robust processing in machine translation - Association for ...

ROBUST PROCESSING IN MACHINE TRANSLATION Rod Johnson,

Doug Arnold,

Centre for Computational Linguistics UMIST, Manchester, M60 8QD, U.K.

Centre for Cognitive Studies, University of Essex, Colchester, CO4 3SQ, U.K.

that of [I0].

ABSTRACT

Finally, the fact that we are c o n c e r n e d with t r a n s l a t i o n m i l i t a t e s a g a i n s t the kind of disregard for input that is characteristic of some robust systems (PARRY [4] is an extreme example), and motivates a c o n c e r n w i t h the r e p a i r or c o r r e c t i o n of errors. It is not e n o u g h that a translation system produces superficially a c c e p t a b l e output for a wide class of inputs, it should aim to produce outputs which represent as nearly as possible translations of the inputs. If it cannot do this, then in some cases it will be better if it i n d i c a t e s as much, so that other action can be taken.

In t h i s p a p e r we p r o v i d e an a b s t r a c t c h a r a c t e r i s a t i o n of d i f f e r e n t kinds of robust p r o c e s s i n g in M a c h i n e T r a n s l a t i o n and N a t u r a l Language Processing systems in terms of the kinds of p r o b l e m they are s u p p o s e d to solve. We focus on one problem which is typically exacerbated by robust processing, and for w h i c h we k n o w of no e x i s t i n g solutions. We d i s c u s s two p o s s i b l e approaches to this, emphasising the need to correct or repair processing malfunctions.

F r o m the point of view we adopt, it is possible to regard MT and NLP systems generally as sets of processes implementing relations between representations (texts can be considered r e p r e s e n t a t i o n s of themselves). It is important to distinguish:

ROBUST PROCESSING IN MACHINE TRANSLATION

This paper is an a t t e m p t to p r o v i d e part of the basis for a general theory of robust p r o c e s s i n g in M a c h i n e T r a n s l a t i o n (MT) with r e l e v a n c e to other areas of N a t u r a l L a n g u a g e P r o c e s s i n g (NLP). That is, p r o c e s s i n g w h i c h is resistant to m a l f u n c t i o n i n g h o w e v e r caused. The b a c k g r o u n d to the paper is w o r k on a g e n e r a l purpose fully automatic m u l t i - l l n g u a l MT s y s t e m within a highly decentralised organisational framework (specifically, the Eurotra system under d e v e l o p m e n t by the EEC). This i n f l u e n c e s us in a number of ways.

(i) R: the correct, or intended relation that holds b e t w e e n r e p r e s e n t a t i o n s (e.g. the r e l a t i o n "is a ( c o r r e c t ) translation of', or "is t~e surface c o n s t i t u e n t s t r u c t u r e of'): we have only fairly vague, p r e - t h e o r e t i c a l ideas a b o u t Rs, in virtue of being b i - l i n g u a l speakers, or h a v i n g some intuitive grasp of the s e m a n t i c s of artificial representations; (ii) T: a t h e o r e t i c a l c o n s t r u c t w h i c h is supposed to embody R; (iii) P: a p r o c e s s or p r o g r a m t h a t is supposed to implement

D e c e n t r a l i s e d d e v e l o p m e n t , and the fact that the system is to be general purpose motivate the f o r m u l a t i o n of a s e n e r a l t h e o r y , w h i c h a b s t r a c t s a w a y from m a t t e r s of purely local relevance, and does not e.g. d e p e n d on e x p l o i t i n g special properties of a p a r t i c u l a r subject field (compare [7], e.g.).

By a robust p r o c e s s P, we m e a n one w h i c h operates error free for all inputs. Clearly, the notion of error or correctness of P depends on the independent standard provided by T and R. If, for the sake of simplicity we ignore the possibility of a m b i g u o u s inputs here, we can define correctness thus:

The fact that we c o n s i d e r r o b u s t n e s s at all can be seen as a result of the d i f f i c u l t y of MT, and the aim of full automation is reflected in our c o n c e n t r a t i o n on a theory of robust p r o c e s s ins, rather than " d e v e l o p m e n t a l robustness'. We will not be c o n c e r n e d here w i t h p r o b l e m s that arise in d e s i g n i n g s y s t e m s so that they are capable of e x t e n s i o n and repair (e.g. not being prone to u n f o r s e e n "ripple effects" under modification). Developmental robustness is clearly essential, and such problems are serious, but no system which relies on this kind of robustness can ever be fully a u t o m a t i c . For the s a m e reason, we will not c o n s i d e r the use of "interactive" a p p r o a c h e s to r o b u s t n e s s such as

(1) Given P(x)=y, and a set W such that ~or all w in W, R(w)=y, then y is correct with respect to R and w iff x is a member of W. Intuitively, W is the set of items for which y is the c o r r e c t r e p r e s e n t a t i o n a c c o r d i n g to R. One p o s s i b l e source of errors in P w o u l d be if P c o r r e c t l y i m p l e m e n t e d T, but T did not e m b o d y R. Clearly, in this case, the only sensible solution is to modify T. Since we can imagine no automatic way of finding such errors and doing this, we will

472

produced within P, or alternative representations c o n s t r u c t e d w i t h i n P which can be m o r e r e l i a b l y c o m p u t e d than the "normal" i n t e n d e d o u t p u t of P (the representational theories of GETA and Eurotra are designed with this in mind: cf. [2], [3], [5], [6], and references there, and see [i] for fuller discussion of t h e s e i s s u e s ) . Again, it is c o n c e i v a b l e that the result of this may be to produce a robust P that implements T more correctly or completely, but again this will be fortuitous. The most likely result will he robust P will now produce errors of the third type:

ignore this p o s s i b i l i t y , end a s s u m e that T is a we11-defined, correct and complete embodiment of R. We can thus replace R by T in (I), and treat T as the standard of correctness below. There appear error in P:

to be two p o s s i b l e

s o u r c e s of

P r o b l e m ( 1 ) : w h e r e P is not a correct i m p l e m e n t a t i o n of T. One would expect this to be c o m m o n where (as often in MT and NLP) T is very complex, and serious p r o b l e m s arise in d e v i s i n g implementations for them.

case (c): P(x)=y, w h e r e y is a legal o u t p u t for P a c c o r d i n g to T, but is not the i n t e n d e d output a c c o r d i n g to T. i.e. y is in the range of T, but yqT(x).

Problem (ii): where P is a c o r r e c t i m p l e m e n t a t i o n so far as it goes, but is i n c o m plete, so that the domain of P is a proper-subset of the domain of T. This will also be very common: in reality processes are often faced with inputs that violate the e x p e c t a t i o n s i m p l i c i t in an implementation.

S u p p o s e both input x and output y of some process are legal objects, it n e v e r t h e l e s s does not follow that they have been correctly paired by the process: e.g.in the case of a parsing process, x may be some sentence and y some representatiom Obviously, the fact that x and y are legal objects for the p a r s i n g p r o c e s s and that y is the o u t p u t of the parser for input x does not guarantee that y is a correct r e p r e s e n t a t i o n of x. Of course, robust processing should be resistant to this kind of malfunctloning also.

If we d i s r e g a r d h a r d w a r e errors, low level bugs and such malfunctions as non-termlnatlon of P (for which there are w e l l - k n o w n solutions), t h e r e are t h r e e p o s s i b l e m a n i f e s t a t i o n s of malfunction. We will discuss them in t u r ~ case (a): P(x)=@,

where

T(x)~@

Case-(c) errors are by far the m o s t s e r i o u s and resistant to s o l u t i o n because they are the h a r d e s t to detect, and b e c a u s e in m a n y cases no output is preferable to superflclally ( m i s l e a d i n g l y ) w e l l - f o r m e d but incorrect output. Notice also that while any process may be subject to this kind of error, m a k i n g a s y s t e m robust in response to case-(a) and case-(b) errors will make this class of errors m o r e w i d e s p r e a d : we have suggested that the likely result of changing P to m a k e it robust will be that it no l o n g e r pairs respresentatlons in the manner required by T, but since any process that takes the output of P should be set up so as to expect inputs that conform to T ( s i n c e t h i s is the " c o r r e c t " e m b o d i m e n t of R, we have assumed), we can expect that in general making a process robust will lead to cascades of errors. If we assume that a system is resistant to case-(a) and case-(b) errors, then it follows that inputs for which the system has to resort to robust processing will be likely to lead to case-(c) errors.

i.e. P halts producing ~ output for input x, where this is not the intended output. This w o u l d be a typical response to unforseen or illformed input, and is the case of process fragility that is most often dealt with. There are two obvious solutions: (1) to m a n i p u l a t e the input so that it c o n f o r m s to the e x p e c t a t i o n s i m p l i c i t in P (cf. the L I F E R [8] a p p r o a c h to ellipsis), or to change P Itself, m o d i f y i n g ( g e n e r a l l y relaxing) its e x p e c t a t i o n s (cf. e.g. the a p p r o a c h e s of [7], [9], [10] and [Ii]). If successful, these g u a r a n t e e that P produces some output for input x. However, there is of course no guarantee that it is correct with respect to T. It may be that P plus the input manipulation process, or P with relaxed expectations is simply a more correct or complete impleme n t a t i o n of T, but this will be fortuitous. It is more llkely that m a k i n g P robust in these ways will lead to errors of another kind: case (b): P(x)=z where z is not a legal output for P a c c o r d i n g to T (i.e. z is not in the range of T.

Moreover, we can expect that making P robust will have made case-(c) errors more difficult to deal with. The likely result of m a k i n g P robust is that it no longer i m p l e m e n t s T, but s o m e T" w h i c h is d i s t i n c t f r o m T, and for w h i c h a s s u m p tlons about correctness in relatlon to R no longer hold. It is o b v i o u s that the p o s s i b i l i t y of detecting case-(c) errors depends on the possibility of distinguishing T from T'. Theoretically, this is unproblematlc. However, in a domain such as MT it will be rather unusual for T and T" to exist s e p a r a t e l y from the p r o c e s s e s that implement them. Thus, if we are to have any chance of d e t e c t i n g case-(c) errors, we m u s t be able to c l e a r l y d i s t i n g u i s h those a s p e c t s of a process that relate to "normal' p r o c e s s i n g from

Typically, such an error will show itself by malfunctioning in a process that P feeds. Detection of such errors is s t r a i g h t f o r w a r d : a w e l l formedness check on the output of P is sufficient. By i t s e l f , of c o u r s e , this w i l l lead to a p r o l i f e r a t i o n of case-(a) errors in P. These can be avoided by a number of methods, in particular: (1) i n t r o d u c i n g some process to m a n i p u l a t e the output of P to make it well-formed according to T, or (ii) a t t e m p t i n g to set up p r o c e s s e s that feed on P so that they can use 'abnormal" or "nonstandard" output from P (e.g. partial representations, or c o m p l e t e i n t e r m e d i a t e r e p r e s e n t a t i o n s

473

assumptions:

those that relate to robust processing. This distinction is not one that is made in most robust systems,

(1) That p-I t e r m i n a t e s in r e a s o n a b l e time. This cannot be guaranteed, but the assumption can be r e n d e r e d more reasonable by o b s e r v i n g characteristics of the input, and thus restricting W (e.g. restricting the members of W in relation to the length of the input to p-I).

We know of no existing solutions to case-(c) m a l f u n c t i o n s . Here we will o u t l i n e two p o s s i b l e approaches. To begin w i t h we m i g h t c o n s i d e r a p a r t i a l s o l u t i o n d e r i v e d f r o m a w e l l - k n o w n t e c h n i q u e in s y s t e m s theory: i n s u r i n g a g a i n s t the effect of faulty components in crucial parts of a system by computing the result for a given input by a number of different routes. For our purposes, the method would consist essentially in implementing the same t h e o r y T as a n u m b e r of d i s t i n c t processes P1,...Pn, etc. to be run in parallel, c o m p a r i n g outputs and u s i n g s t a t i s t i c a l criteria to determine the correctness of processing. We will call this the "statistical solution'. (Notice that c e r t a i n kinds of s y s t e m a r c h i t e c t u r e m a k e this quite feasible, even given real time constraints).

(ii) That construction of p-1 is somehow more straightforward than construction of P, so that p-i is likely to be m o r e r e l i a b l e (correct and complete) than P. In fact this is not implausible for some a p p l i c a t i o n s (e.g. c o n s i d e r the case where P is a parser: it is a widely held idea that generators are easier to build than parsers). Granted these assumptions, detection of case(c) errors is straightforward given this "inverse mapping" approach: one s i m p l y e x a m i n e s the enumeration for the actual input if it is present. If it is present, then given that p-i is likely to be m o r e r e l i a b l e than P, then it is likely that the output of P was T - c o r r e c t , and hence did not constitute a ease-(c) error. At l e a s t , the chances of the output of P being correct have been increased. If the input is not present, then it is likely that P has p r o d u c e d a case-(c) error. The response to this will depend on the domain and a p p l i c a t i o n -- e.g. on w h e t h e r i n c o r r e c t but superficially well-formed output is preferable to no output at all.

Clearly, w h i l e this should s i g n i f i c a n t l y improve the chances that output will be correct, it can p r o v i d e no guarantee. M o r e o v e r , the k i n d of s i t u a t i o n we are c o n s i d e r i n g is m o r e c o m p l e x than that a r i s i n g given f a i l u r e of r e l a t i v e l y simple pieces of hardware. In particular, to make this w o r t h w h i l e , we m u s t be able to e n s u r e that the different Ps are genuinely distinct, and that they are reasonably complete and correct implementations of T, at the very least s u f f i c i e n t l y c o m p l e t e and c o r r e c t that their outputs can be sensibly compared.

In the n a t u r e of things, we will u l t i m a t e l y be lead to the original p r o b l e m s of r o b u s t n e s s , but now in c o n n e c t i o n w i t h p-l. For this r e a s o n we cannot forsee any complete solution to problems of r o b u s t n e s s generally. What we have seen is that s o l u t i o n s to one sort of f r a g i l i t y are normally only partly successful, leading to errors of another kind elsewhere. Clearly, what we have to hope is that each attempt to eliminate a source of error n e v e r t h e l e s s leads to a net d e c r e a s e in the overall number of errors.

Unfortunately, this will be very difficult to ensure, particularly in a field such as MT, where Ts are g e n e r a l l y very c o m p l e x , and (as we have noted) are often not stated s e p a r a t e l y from the processes that implement them. The s t a t i s t i c a l a p p r o a c h is a t t r a c t i v e because it seems to provide a simultaneous solution to both the d e t e c t i o n and repair of case-(c) errors, and we c o n s i d e r such solutions are certainly worth further consideration. However, realistically, we expect the normal situation to be that it is d i f f i c u l t to produce r e a s o n a b l y correct and compelete distinct implementations, so that we are forced to look for an a l t e r n a t i v e approach to the detection of case-(c) errors.

On the one hand, this hope is reasonable, since s o m e t i m e s the faults that give rise to p r o c e s s i n g e r r o r s are a c t u a l l y fixed. But there can be no g e n e r a l g u a r a n t e e of this, so that it seems clear that merely making systems or p r o c e s s e s robust in the ways d e s c r i b e d p r o v i d e s only a partial solution to the p r o b l e m of processing errors.

It is obvious that reliable detection of (e)type errors requires ~he i m p l e m e n t a t i o n of a relation that pairs representations in exactly the same way as T: the obvious candidate is a process p-l, implementing T -I, the inverse of T.

This should not be surprising. Because our primary, concern is with automatic error detection and repair, we have a s s u m e d t h r o u g h o u t that T c o u l d be c o n s i d e r e d a correct and c o m p l e t e embodiment of ~ Of course, this is unrealistic, and in f a c t it is p r o b a b l e that for m a n y processes, at least as many processing errors will arise f r o m the i n a d e q u a c y of T with respect to R as arise from the inadequacy of P with respect to T. Our p r e - t h e o r e t i c a l and i n t u i t i v e a b i l i t y to relate representations far exceeds our ability to formulate clear theoretical statements about these relations. G i v e n this, it w o u l d seem that e r r o r free p r o c e s s i n g d e p e n d s at least as m u c h on the correctness of theoretical models as the capacity

The basic method here would be to compute an e n u m e r a t i o n of the set of all p o s s i b l e inputs W that could have y i e l d e d the actual output, g i v e n T, and some hypothetical ideal P which correctly i m p l e m e n t s it. (Again, this is not u n r e a l i s t i c ; certain system architectures would allow forward computation to p r o c e d e while this inverse processing is carried out). To make

this w o r t h w h i l e

would

involve

two

474

of a s y s t e m to take a d v a n t a g e described above.

of the ideas in this paper were first aired in Eurotra r e p o r t E T L - 3 ([4]), and in a p a p e r presented at the C r a n f i e l d conference on MT earlier this year. We would like to thank all our friends and c o l l e a g u e s in the project and our institutions. The views (and, in particular, the errors) in this paper are our own responsibility, and s h o u l d not be i n t e r p r e t e d as " o f f i c i a l ' Eurotra doctrine.

of the t e c h n i q u e s

We s h o u l d e m p h a s i s e this because it sometimes appears as t h o u g h techniques for e n s u r i n g process robustness m i g h t have a w i d e r importance. We a s s u m e d above that T was to be r e g a r d e d as a correct e m b o d i m e n t of R. Suppose this a s s u m p t i o n is relaxed, and in a d d i t i o n that (as we have argued is likely to be the case) the robust version of P implements a relation T" which is distinct f r o m T. Now, it could, in principle, turn out that T' is a better e m b o d i m e n t of R than T. It is w o r t h saying that this p o s s i b l i l i t y is remote, because it is a possibility that seems to be taken seriously elsewhere: a l m o s t all the strategies we have mentioned as enhancing process robustness were originally proposed as theoretical devices to increase the adequacy of Ts in relation to Rs (e.g. by p r o v i d i n g an a c c o u n t of metaphorical or other "problematic" usage). There can be no question that apart from improvements of T, such theoretical developments can have the side effect of increasing robustness. But notice that their justification is t h e n not to do w i t h robustness, but w i t h theoretical adequacy. What must be e m p h a s i s e d is that the chances that a m o d i f i c a t i o n of a process to e n h a n c e r o b u s t n e s s (and i m p r o v e reliability) will also have the effect of improving the quality of its performance are e x t r e m e l y slim. We cannot expect robust processing to produce results which are as good as those that would result from 'ideal" (optimal/nonrobust) processing. In fact, we have s u g g e s t e d that existing techniques for ensuring process r o b u s t n e s s t y p i c a l l y have the effect of changing the theory the process i m p l e m e n t s , changing the relitionship between representations that the system defines in ways which do not preserve the relationship relationship between representations that the designers intended, so that p r o c e s s e s that have been made robust by existing methods can be expected to produce output of lower than intended quality.

REFE RENCE S

i. ARNOLD, D.J. & JOHNSON, R. (1984) " A p p r o a c h e s to Robust P r o c e s s i n g in M a c h i n e T r a n s l a t i o n " Cognitive Studies Memo, University of Essex. 2. BOITET, CH. (1984) "Research and Development on MT and Related Techniques at Grenoble University' paper presented at Lugano MT tutorial April 1984. 3. BOITET, CH. & N E D O B E J K I N E , N. (1980) " R u s s i a n F r e n c h at G E T A : an o u t l i n e of m e t h o d a n d a detailed example" RR 219, GETA, Grenoble. 4. COLBY, K. (1975) A r t i f i c i a l Press, Oxford.

Paranoia

Pergamon

5. E T L - I - N L / B " T r a n s f e r (Taxonomy, Safety Nets, Strategy), Report by the B e l g o - D u t c h Eurotra Group, August 1983. 6. E T L - 3 F i n a l 'Trio' R e p o r t by the E u r o t r a C e n t r a l L i n g u i s t i c s T e a m (Arnold, Jaspaert, Des Tombe), February 1984.

7.

HAYES,

"Flexible

P.J. and parsing",

MOURADIAN, G.V. (1981): AJCL 7, 4:232-242.

8. HENDRIX, G.G. (1977) " H u m a n E n g i n e e r i n g A p p l i e d N a t u r a l L a n g u a g e P r o c e s s i n g " Proc. IJCAI, 183-191, MIT Press.

These r e m a r k s are intended to e m p h a s i s e the i m p o r t a n c e of clear, complete, and correct theoretical models of the p r e - t h e o r e t l c a l relationships between the representations involved in systems for which error free 'robust' operation important, and to e m p h a s i s e the n e e d for approaches to robustness (such as the two we have outlined above) that make it more likely that robust p r o c e s s e s w i l l m a i n t a i n the r e l a t i o n s h i p between representations that the designers of the "normal/optlmal" p r o c e s s e s intended. That is, to e m p h a s l s e the n e e d to d e t e c t a n d r e p a i r malfunctions, so as to promote correct processing.

for 5th

9. K W A S N Y , S.C. and S O N D H E I M E R , N.K. (1981): " R e l a x a t i o n T e c h n i q u e s for Parsing Grammatically Ill-formed Input in Natural Language Understanding Systems". AJCL 7, 2:99-108. I0. W E I S C H E D E L , R.M, a n d B L A C K , J. ( 1 9 8 0 ) 'Responding I n t e l l i g e n t l y to U n p a r s a b l e Inputs" AJCL 6.2: 97-109. II. WILKS, Y. (1975): "A P r e f e r e n t i a l P a t t e r n M a t c h i n g S e m a n t i c s for N a t u r a l Language". A.I. 6:53-74.

AKNOWLEDGEMENTS Our debt to the E u r o t r a project is great: collaboration on this paper developed out of work on E u r o t r a and has only been p o s s i b l e b e c a u s e of opportunities made available by the project. Some

475