Edinburgh Working Papers in Cognitive Science Volume 11
Incremental Interpretation Edited by David Milward and Patrick Sturt
Centre for Cognitive Science The University of Edinburgh
c 1995 the individual authors Copyright
Centre for Cognitive Science The University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW UK
Contents Contributors Preface Acknowledgements 1 What is incremental interpretation?
v vii ix 1
2 Incrementality and Monotonicity in Syntactic Parsing
23
3 Incremental Interpretation of Categorial Grammar
67
Nick Chater, Martin Pickering and David Milward
Patrick Sturt and Matthew. W. Crocker David Milward
4 Incremental Interpretation: Applications, Theory, and Relationship to Dynamic Semantics 83 David Milward and Robin Cooper
5 Incrementality in Discourse Understanding Simon Garrod and Anthony Sanford
6 Incremental Syntax in a Holistic Language Model David Tugwell
iii
99 123
Contributors Nick Chater
University of Oxford Department of Experimental Psychology South Parks Road Oxford, OX1 3UD UK
[email protected]
Robin Cooper
Centre for Cognitive Science University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW Scotland, UK
[email protected]
Matthew Crocker
Centre for Cognitive Science University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW Scotland, UK
[email protected]
Simon Garrod
Human Communication Research Centre Department of Psychology 56 Hillhead Street University of Glasgow Glasgow G12 9YR Scotland, UK
[email protected]
David Milward
Computerlinguistik Bau 17.2 Universitat des Saarlandes Postfach 1150 D-66041 Saarbrucken Germany
[email protected]
v
vi Contributors Martin Pickering
Human Communication Research Centre Department of Psychology 56 Hillhead Street University of Glasgow Glasgow G12 9YR Scotland, UK
[email protected]
Anthony Sanford
Human Communication Research Centre Department of Psychology 56 Hillhead Street University of Glasgow Glasgow G12 9YR Scotland, UK
[email protected]
Patrick Sturt
Centre for Cognitive Science University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW Scotland, UK
[email protected]
David Tugwell
Centre for Cognitive Science University of Edinburgh 2 Buccleuch Place Edinburgh, EH8 9LW Scotland, UK
[email protected]
Preface The processes involved in language comprehension have been studied from two relatively distinct perspectives. On the one hand, cognitive psychologists and psycholinguists have used experimental techniques to explore the nature and time course of information ow in human language comprehension. On the other hand, computational linguists have developed formal models for linguistic knowledge representation, and algorithms to manipulate them. The computational tradition has been interested chie y in providing algorithms which can be proven to be sound and correct. For example, a sound parsing algorithm is one which provides only correct syntactic analyses according to a speci ed grammar. A complete parser is one which builds syntactic structures for all sentences allowed by a grammar, even for sentences involving multiple centre embeddings or unrecoverable garden paths. Typically there is concern only with average processing times, not with relative processing times between a simple sentence and, for example, a centre embedded sentence. In contrast, the psychological tradition has spent considerable eort in investigating the relative time course of on-line processes. The present volume of papers arises from the conviction that a convergence of approaches can bene t both traditions. Psychological processing models can bene t from the precision of computational linguistics. In particular, given the inevitable complexity of language processing, a computational implementation can give a more precise picture of the range of predictions of a psychological model, and may help us to pinpoint inconsistencies that would otherwise go unnoticed. On the other hand, the computational tradition can bene t equally from taking into consideration psychological evidence concerning on-line processing; for example, knowledge about the nature of incremental interpretation suggests novel processing algorithms which may eventually be able to provide help for speech recognition, as well as provide general techniques for dealing with fragmented input. This volume brings together papers from both psychologists and computational linguists with interests in the incremental processing of natural language. Many of the papers derived inspiration from the workshop on Sentence Processing held in the Centre for Cognitive Science at Edinburgh, and the Department of Psychology in Glasgow. This workshop brings together computational linguists and psychologists and has resulted in much collaborative research. Two papers which particularly show this are Sturt and Crocker's contribution, and Chater, Pickering and Milward's. The volume starts with the \What is Incremental Interpretation" paper which shows that vii
viii Preface there is a gap between the kinds of representations provided by formal theories, and the kinds of representations are needed to explain psycholinguistic results. It argues for a solution in terms of a two-level interpretation process. The next two papers focus on the mapping from words to syntactic structure/ semantic representation. The two papers re ect the dierent starting positions often taken in Psycholinguistics and Computational Linguistics. \Incrementality and Monotonicity in Syntactic Parsing" starts by modelling recovery from mild garden path phenomena in English and Japanese, and then builds on this to move towards a full parsing model. \Incremental Interpretation of Categorial Grammar" considers the abstract problem of providing representations word by word, and provides a processing model which provides all parses of a sentence (if allowed to backtrack). Instead of considering particular garden path phenomena, the paper considers how the parsing model might be re ned using statistical language tuning. The fourth paper, \Incremental Interpretation: Applications, Theory and Relationship to Dynamic Semantics" provides the outline of a computational model of the whole process from strings of words to dynamically interpreted semantic representations. There is discussion of the issue of how quanti ers are dealt with incrementally which, it is hoped, will provoke some badly needed experiments in this area. The fth paper \Incrementality in Discourse Understanding" extends the discussion to phenomena outside the sentence, such as pronouns and de nite descriptions. The paper addresses several important issues for both discourse and sentence processing, including the possibility of shallow semantic processing. Finally the volume concludes with a paper which comes from the empirical tradition of computational linguistics. In \Incremental Syntax in a Holistic Language Model" there is an interesting attempt to derive a processing model directly from language corpora. There is some convergence here between empirical and symbolic traditions, with semantic representations encoding much of the structure which is more explicit in traditional processing models using syntactic representations. David Milward and Patrick Sturt
Acknowledgements The editors would like to extend special thanks to Matt Crocker, general editor of the Working Papers series, Dyane McGavin, departmental publications librarian, and to Lex Holt, who provided the LATEX macros. We would also like to thank Rosemary Stevenson and Chris Brew for their help with the reviewing.
ix
1
What is incremental interpretation? Nick Chater, Martin Pickering and David Milward 1 2 3 4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
One-Level Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Two Level Incremental Interpretation . . . . . . . . . . . . . . . .
10
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.1 1.2 1.3
Incremental Understanding . . . . . . . . . . . . . . . . . . . . . . . Formal Semantics of Sentence Fragments . . . . . . . . . . . . . . . The Paradox of Incremental Interpretation . . . . . . . . . . . . . .
2.1 2.2
Reasoning with Non-Propositions . . . . . . . . . . . . . . . . . . . . Interpreting Sentence Fragments as Propositions . . . . . . . . . . .
3.1 3.2
The input and knowledge level representations . . . . . . . . . . . . 11 Retraction and Reasoning . . . . . . . . . . . . . . . . . . . . . . . . 13
1 Edinburgh Working Papers in Cognitive Science, Vol. 11: Incremental Interpretation, pp. 1{23. D. Milward and P. Sturt, eds.. c 1995 Nick Chater, Martin Pickering and David Milward. Copyright
2 5 6
7 8
2 Nick Chater, Martin Pickering and David Milward
Abstract
Experimental evidence demonstrates that understanding begins before the end of a sentence. This presumably means that some proposition based on what is heard is integrated with contextually relevant propositions drawn from general knowledge. However, formal treatments of the semantics of natural language do not typically give fragments a propositional interpretation. We call this The Paradox of Incremental Interpretation. We argue against two potential solutions to this paradox, one where reasoning can take place over non-propositions, the other where sentence fragments are actually given propositional interpretations. In this second case, the processor would either have to draw extremely weak conclusions, or have to incorporate large amounts of syntactic information to allow for wholesale retraction. In contrast, we propose a two-level account, in which the processor constructs a standard logical form representation at an input level, which is not normally propositional, and a derived propositional representation at a knowledge level. We consider reasoning, recovery from misanalysis and modularity in terms of this account.1
1 Introduction This paper discusses a problem in the explanation of language comprehension. Experimental evidence has demonstrated that understanding an utterance takes place as it is encountered, before the end of the sentence. But this should only be possible if the interpretation of a sentence fragment is a proposition, and, according to standard formal semantics, this is typically not the case. We call this the paradox of incremental interpretation. This section discusses experimental evidence and formal semantics, and shows how they lead to the paradox. The next section considers and rejects two \one-level" solutions to the paradox in which the role of the logical form is altered. Instead, we propose a \two-level" solution in which there is a separation between a generally non-propositional logical form and a derived propositional representation which serves as the basis for understanding. We discuss reasoning, recovery from misanalysis and modularity in terms of this account.
1.1 Incremental Understanding People can obviously think about utterances that they hear (or read) for as long as they like. But understanding begins when the meaning of what is heard is integrated with general knowledge (i.e. beliefs). At this point, the perceiver becomes aware of the plausibility of what has been heard: it is plausible if it is compatible with general knowledge, and implausible if it is not. After this, the perceiver will be able to conduct more extensive reasoning with what 1 The order of authorship is arbitrary. We thank Gerry Altmann and two anonymous reviewers for their comments. N.C. was partly supported by JCI grant SPG 9029590; M.P. was supported by SERC Research Fellowship B/90/ITF/293 under the Logic for IT Initiative, a British Academy Post-Doctoral Fellowship and by ESRC Grant no. R000234542; D.M. was supported by SERC Research Fellowship B/90/ITF/288 under the Logic for IT Initiative.
What is incremental interpretation? 3
has been heard, in combination with relevant general knowledge. This paper is concerned with the initial process of understanding. The following processes do not constitute understanding an utterance: performing lexical access on individual words; resolving referring expressions; building syntactic representations. They are prerequisites of understanding, but not understanding itself. However, all of these processes may be aected by \top-down" processes dependent on understanding. For instance, ambiguous pronouns are often resolved on the basis of the meaning of a utterance in relation to background knowledge. In other words, general knowledge can be used in resolving anaphors, but understanding an utterance involves more than resolving anaphors. Incremental processing occurs when a stimulus is analysed piece-by-piece as it is encountered. Sentence processing is incremental if the processor analyses each part of the sentence as that part is reached. Fine-Grained Incremental Interpretation occurs if the processor analyses each small portion of a sentence, such as each word or morpheme, immediately it is encountered. In contrast, Coarse-Grained Incremental Interpretation occurs if the processor waits until larger chunks of a sentence are encountered. Some aspects of sentence processing certainly involve very ne-grained incremental processing. Word recognition normally occurs extremely quickly (e.g. Ehrlich & Rayner, 1981; Tyler & Marslen-Wilson, 1980). Resolution of pronouns can occur very rapidly (e.g. Nicol, 1988; Shillcock, 1982). The existence of garden-path phenomena indicate that syntactic analysis takes place over sentence fragments (e.g. Bever, 1970; Frazier, 1979; Frazier & Rayner, 1982). The resolution of a NP with respect to a discourse context can occur quickly and appears able to have rapid eects on choice of syntactic analysis under certain circumstances (Altmann, Garnham, & Dennis, 1992; Altmann, Garnham, & Henstra, 1994; Altmann & Steedman, 1988; Britt, Perfetti, Garrod, & Rayner, 1992; Britt, 1994; Spivey-Knowlton, Trueswell, & Tanenhaus, 1993). Intuitively, understanding is also incremental, beginning well before the end of the sentence. For instance, a hearer can often complete a speaker's sentence. O-line, it is possible to judge the implausibility of sentence fragments like the book devours, singing silently or the rich beggar. More convincingly, evidence from language processing experiments suggests that sentence fragments can be incrementally understood, in a fairly ne-grained manner. Good evidence comes from experiments that investigate the eects of plausibility during sentence processing. Traxler and Pickering (in press) monitored subjects' eye-movements whilst reading (1): (1)
a. That's the pistol with which the man shot the gangster yesterday afternoon.
4 Nick Chater, Martin Pickering and David Milward b. That's the garage with which the man shot the gangster yesterday afternoon. c. That's the pistol in which the man shot the gangster yesterday afternoon. d. That's the garage in which the man shot the gangster yesterday afternoon. Sentences (1a & d) are plausible, and (1b & c) are not. Subjects took longer to read shot in the implausible sentences than in the plausible sentences when they rst encountered the word. Hence they regularly understood the fragment that's : : : shot immediately. Note that the design rules out the possibility that the eect is due to any form of lexical relationship between shot and pistol or garage. The eect must be due to the plausibility of two of the fragments (considered as a whole) and the implausibility of the other two. Marslen-Wilson, Tyler and Koster (1994) had subjects listen to passages like (2), as part of a larger experiment: (2)
Mary lost hope of winning the race to the ocean when she heard Andrew's footsteps approaching her from behind. She was slowed down by the deep sand. She had trouble keeping her balance. Overtaking : : :
Subjects named the word her faster than him when it was visually presented at the end of the passage. The fragment overtaking her is pragmatically sensible, but the fragment overtaking him is not. Hence subjects immediately understood the fragment overtaking him/her in relation to the discourse context and in terms of their general knowledge about races. In this condition, Mary is in discourse focus at the beginning of the target sentence, so an explanation in terms of focus would predict the opposite eect. Garrod, Freudenthal and Boyle (1994) conducted a related study using eye-tracking. Two conditions had a pronoun refer to a focused antecedent: (3)
Elizabeth was an inexperienced swimmer and wouldn't have gone if the male lifeguard hadn't been standing by the pool. But as soon as she got out of her depth she started to panic and wave her hands about in a frenzy. a. Within seconds she sank into the pool. b. Within seconds she jumped into the pool.
Garrod et al. detected a plausibility eect on the verb sank/jumped when it was rst encountered. The processor must have immediately resolved the pronoun as referring to Elizabeth and assessed the plausibility of the pronoun-verb fragment. They did not nd immediate plausibility eects when the pronoun referred to an unfocused entity. Notice that these results
What is incremental interpretation? 5
cannot be explained in terms of lexical associations (for instance) since the subject of each verb is a pronoun. Boland, Tanenhaus, Garnsey and Carlson (in press) found immediate eects of plausibility with some wh-questions using a version of self-paced reading in which subjects monitored the point at which the sentence stopped making sense, and Garnsey, Tanenhaus and Chapman (1989) showed similar eects using event-related brain potentials. Other studies showing rapid plausibility eects include Tyler and Marslen-Wilson (1977), Stowe (1989), Holmes, Stowe and Cupples (1989), Clifton (1993) and Trueswell, Tanenhaus and Garnsey (1994). In addition, Marslen-Wilson (1973, 1975) showed very rapid evidence for incremental understanding using a shadowing task. Finally, priming studies (e.g. Swinney, 1979) indicate that contextually inappropriate senses of words are rapidly discarded, even when the context is not a complete sentence. These ndings suggest that the processor performs ne-grained incremental understanding; there is a constant eort after meaning. However, some of these experiments show plausibility eects over fragments whose standardly assumed interpretation is non-propositional. This leads to a paradox, as we discuss below.
1.2 Formal Semantics of Sentence Fragments In standard theories of semantics, propositions are the only entities that can be true or false, and can serve as the premises in inferences. General knowledge therefore consists of a data base of propositions. When a new fact (i.e. belief) is incorporated into general knowledge, it is also encoded as a proposition. Reasoning can employ propositions corresponding to what has just been processed together with propositions drawn from general knowledge. The conclusions take the form of a new proposition or propositions. However, most sentence fragments do not have logical forms that correspond to propositions. It is therefore unclear how they can be integrated into general knowledge. In fact, the only strings that have propositional logical forms are main and embedded declarative sentences. Fragments such as NPs or sentences missing one or more NPs do not have propositional logical forms. We illustrate this using rst-order logic and the lambda calculus (Montague, 1974; see Dowty, Wall, & Peters, 1981). Readings of natural language expressions are represented as disambiguated logical forms. For instance, the proposition expressed by John sees Mary is represented as sees(john,mary). The verb sees is interpreted as a predicate that takes two arguments, the rst being the agent of the action, and the second being the patient. The proposition
6 Nick Chater, Martin Pickering and David Milward represented by this logical form is true if and only if John and Mary stand in the relation of seeing, with John being the seer and Mary the seen. In contrast, the meaning of the verb phrase sees Mary is represented as xsees(x,mary). This does not represent a proposition, but rather a function that can be combined with the meaning of a subject NP to give a propositional logical form. Similarly, the meaning of John sees is represented as xloves(john,x); this is a function that can be combined with the meaning of an object NP to give a propositional logical form. These representations xsees(x,mary) and xloves(john,x) can be given model-theoretic interpretations, just as sees(john,mary) can. But their interpretations are non-propositional; they cannot be asserted to make a true or false claim about the world, and neither can they be used as premises in inference according to standard semantics.
1.3 The Paradox of Incremental Interpretation The experimental evidence discussed above demonstrates ne-grained incremental understanding. It appears to show that the processor understands fragments whose logical forms, according to standard formal semantics, are not propositional. This should not be possible. For example, Marslen-Wilson et al. (1994) showed that the fragment overtaking him was implausible in their contexts. This fragment is standardly assigned a non-propositional logical form, essentially because overtaking is not a tensed verb. Traxler and Pickering (in press) and Garrod et al. (1994) found plausibility eects for fragments consisting of a sentence missing at least one NP (in technical terms, containing a verb whose argument frame was unsaturated). In some experiments, some of the critical fragments might have formed acceptable sentences. If so, they could have been assigned propositional logical forms during incremental processing. But this cannot serve as a resolution of the paradox. In some experiments (e.g. MarslenWilson et al., 1994), the fragments (e.g. overtaking him) could not be complete sentences. In others (e.g. Traxler & Pickering, in press), the sentences would have been highly elliptical at best (e.g. That's the garage in which the man shot.). In these cases, the elliptical argument (e.g. the thing shot) would have to be supplied semantically for plausibility eects to occur. Finally, note that such an account leads to the unlikely conclusion that any sentence containing a transitive verb used non-elliptically would produce a garden path. Hence, we can conclude that the processor regularly treats sentence fragments as incomplete, but at the same time allows their interpretations to be used in understanding. Experimental evidence therefore suggests that the meanings of many sentence fragments with non-propositional semantic representations are integrated with general knowledge during sentence comprehension. We call this The Paradox of Incremental Interpretation. We now
What is incremental interpretation? 7
consider possible ways to resolve this paradox. We rst discuss and reject two \one-level" accounts of incremental interpretation, in which the logical form is directly employed in understanding. Instead, we propose a \two-level" account, which separates the construction of the logical form from understanding.
2 One-Level Solutions 2.1 Reasoning with Non-Propositions We have assumed that reasoning (i.e. inference) is generally taken to be the process of deriving conclusions from sets of premises. Both premises and conclusions must be propositional. These assumptions are central to standard semantics. For instance, the standard de nition of logically valid inference involves propositions: an inference is logically valid i the truth of the premises guarantees the truth of the conclusions. Non-logical forms of reasoning, such as induction, abduction, case-based reasoning and analogical reasoning are also de ned over propositions. However, we could try to circumvent the paradox of incremental interpretation by allowing reasoning with non-propositions. Reasoning over non-propositions can be de ned in terms of an \informativeness relation" (Keenan & Faltz 1985). For example, sings loudly is taken to be more informative than sings, and John loves as more informative than John likes. Thus, from the interpretation of sings loudly, the interpretation of sings can be inferred, and from the interpretation of John loves, the interpretation of John likes can be inferred. For properties such as is an animal, the notion of inference obtained closely corresponds to the relations used in many semantic networks, in which properties are arranged in hierarchies, from speci c to general (e.g. Collins & Quillian 1969). This account allows inferences between non-propositions and other non-propositions. However, it does not license inference from non-propositions to propositions, or from a non-proposition and a proposition to a proposition. This is because non-propositions make no claim about the world; and hence no proposition, which does make a claim about the world, can follow from a non-proposition. Hence, if fragments are interpreted as non-propositions, it is possible to derive other non-propositions in inference, but not to derive propositions. We would then require a non-propositional theory of knowledge representation as a whole. If this is coherent at all, it would require a drastic change in our basic assumptions about the nature of belief. It would therefore only be worth considering as a last resort.
8 Nick Chater, Martin Pickering and David Milward
2.2 Interpreting Sentence Fragments as Propositions An alternative solution is to abandon standard formal semantics and to assign propositional interpretations to fragments such as that's the garage with which the man shot, overtaking her and within seconds she jumped. We can associate propositional \constraints" with sentence fragments (Haddock, 1987, 1989; Mellish, 1985). As more of the sentence is encountered, more constraints are added. The proposition expressed by the complete sentence is simply the conjunction of all the constraints. For example, in John loves Mary, the processor rst encounters John and imposes the constraint that there exists someone called John. It then encounters loves, and adds the constraint that there exists some entity that John loves. On encountering Mary, it adds the further constraint that there exists a person called Mary that John loves. These constraints incrementally narrow down the meaning of the sentence. Since fragments are assigned propositional interpretations, their meanings can be integrated with general knowledge and can serve as premises in reasoning. Thus we can explain why some fragments appear plausible and some implausible. However, the meaning of the complete sentence may not be compatible with the meaning assigned to particular fragments. After John loves, the processor assumes that there exists someone called John and there exists some entity that John loves. But none of the following is compatible with that assumption: (4)
a. b. c. d.
John loves nothing. John loves Mary if he loves anybody at all. John loves Mary as far as I know. John loves Mary or hates her.
Within this approach, we have two choices. We can assume some weaker constraint for a sentence fragment like John loves that will be compatible with any potential continuation. Alternatively, we can allow constraints to be retracted. The rst solution fails because there is no constraint strong enough to account for incremental understanding which is also compatible with any continuation. Examples (4) suggest that John loves does not place restrictions on the state of the world. More precisely, notice that John loves something and John loves nothing are possible sentences beginning with John loves. Assuming that the semantics of John loves nothing is the negation of the semantics of John loves something, every possible state of the world is consistent with one of these continuations. Hence nothing can be excluded for certain at John loves.
What is incremental interpretation? 9
Formally, any constraints made after John loves must be compatible with both the proposition p, that John loves something and its negation, :p. In classical logic, the only thing compatible with both p and :p is the proposition T , which provides no constraint. Proof: if proposition p implies constraint i and the negation of the proposition, :p, also implies i, then p or :p implies i. In classical logic, p or :p is always true, so the only things that it implies are also true. But if i is always true, it provides no constraint upon how the world is. Therefore, assuming classical logic, no certain yet contentful constraints can be made after processing John loves. The processor might perhaps store more than one set of constraints in parallel. For example, the fragment John loves might generate the constraint that there exists some entity that John loves on one analysis, and the constraint that there exists no entity that John loves on the other analysis. The rst analysis might be preferred, hence leading to the psycholinguistic manifestations of incremental understanding. This account is similar to ranked parallel accounts of parsing (e.g. Gorrell, 1989), and clearly requires a considerable amount of information to be stored. However, there is a more serious problem. Storing both analyses accounts for the sentences John loves something, John loves Mary and John loves nothing. But it cannot account for John loves no woman. This sentence is compatible either with John loving something other than a woman, or with John loving nothing at all. In fact, the processor cannot consider a nite set of possible sets of constraints compatible with John loves that will be compatible with all possible continuations. Hence a parallel constraint-based account of incremental interpretation that avoids the need for retraction is impossible. These arguments only apply to the propositional content of a sentence or fragment. Some other information may not be subject to retraction. For instance, it may be that the fragment John loves presupposes that John exists. This remains true whether the sentence ends with something or nothing. However, some presuppositions are subject to retraction. The fragment A unicorn is coming presupposes that a unicorn exists, but the complete sentence could be A unicorn is coming, my father said, from which it does not follow that a unicorn exists. More importantly, the information that can be gained from computing presuppositions incrementally cannot explain the plausibility eects discussed above. Information about what has been heard may not be subject to retraction. For instance, the fragment John loves is about John and about an act of loving (which may or may not actually occur). This will remain the case however the sentence concludes. But these facts do not imply that there exists some entity that John loves. Hence we cannot explain the experimental ndings in this way. We can now conclude that any propositional interpretations for fragments sucient to explain incremental understanding may have to be retracted. Incremental understanding necessitates that the processor jumps to uncertain conclusions whilst
10 Nick Chater, Martin Pickering and David Milward processing a sentence. More speci cally, any one-level account of incremental interpretation must have mechanisms for retracting constraints, even in cases where there is no syntactic ambiguity. If the processor retracts by simply backtracking, it encounters the same problems faced by the parallelism account discussed above. If it assumes after John loves that there exists an entity that John loves, it must backtrack on encountering John loves nothing. But how would it know how to interpret John loves this time? If the sentence continued John loves no woman, then this reanalysis would have been inappropriate. It is clear that the processor could only decide what constraints to apply to John loves after having provided an interpretation for the whole sentence. It would then have to use this to determine the interpretation of a fragment, which is clearly the wrong way round. Alternatively, the processor could seek to modify the interpretation assigned to a fragment. A major problem would be knowing which part of the interpretation to retract. In the sentence John arrived with nothing, the processor would wish to retract the assumption that John arrived with something. It would not wish to retract the assumption that John arrived (somewhere). The processor would have to label the constraints that it had applied with respect to their position within the original syntax. It would therefore have to store a large amount of syntactic information. This is incompatible with standard constraint-based approaches (Haddock, 1987, 1989; Mellish, 1985), in which the only information retained at each stage is the semantic information associated with the constraints (a set of possible worlds and possible referents). We conclude that no one-level account of incremental interpretation that incorporates the information necessary for incremental understanding into the incremental construction of logical form is appropriate. Instead, these components of incremental interpretation should be separated.
3 Two Level Incremental Interpretation We now propose that incremental interpretation involves two levels of representation: an input level, which serves as the vehicle for compositional semantics, and a knowledge level, which forms the basis for understanding. We de ne these two levels of representation, and show how they are related. We show how they can account for language comprehension, and discuss reasoning, recovery from misanalysis and modularity in terms of this account.
What is incremental interpretation? 11
3.1 The input and knowledge level representations The Input Level of Representation (ILR) is an incrementally constructed logical form which is non-propositional for most fragments. It is a standard lambda expression, as assumed in formal semantics (e.g. Dowty et al., 1981). We assume that it is constructed in a word-by-word manner, so that a logical form is associated with every sentence fragment that is encountered. To do this, we require an appropriate parsing algorithm given a particular grammar. One approach is due to Milward (1994, 1995), who provided word by word construction of lambda expressions for grammars like basic categorial grammar. Other ne-grained approaches to logical form construction include Ades and Steedman (1982), Barry and Pickering (1990), Henderson (1994), Pulman (1986), Stabler (1991, 1994) and Stevenson (1993). Details of grammar and parser are not relevant to this paper.
In this paper, we adopt a simple account of logical form in which the ILRs for Mary gave John chocolates are as follows: Mary Mary gave Mary gave John Mary gave John chocolates
PP mary xygave(mary,y,x) ygave(mary,y,john)
gave(mary,chocolates,john)
An account which produces these representations is certainly inadequate for many expressions involving quanti ers (e.g. Barwise & Cooper, 1981), but it is adequate for current purposes. A more sophisticated account is provided by Milward and Cooper (1994), who use underspeci ed representations to account for the incremental processing of sentences containing generalised quanti ers and scope ambiguities. We explain the Knowledge Level of Representation (KLR) in two stages. First, it involves existential quanti cation over lambda-abstracted arguments. Notationally, is replaced with 9. This produces the following representations for Mary gave John chocolates: Mary Mary gave Mary gave John Mary gave John chocolates
9PP mary 9x9ygave(mary,y,x) 9ygave(mary,y,john)
gave(mary,chocolates,john)
These representations imply that processor assumes in turn that there exists an entity called Mary, that Mary gave some entity to some entity, that Mary gave some entity to John, and that Mary gave chocolates to John.
12 Nick Chater, Martin Pickering and David Milward These propositions could be assessed with respect to general knowledge. The hearer could retain the information or reject it if it appeared to be impossible or implausible. But there would only be these two choices. In fact, a hearer may treat an utterance in many ways. For example, knowledge about the intentions of the speaker, an implausible content and a particular intonation might suggest that a sentence (or fragment) is ironic. Hence, the hearer may adopt the negation of what would otherwise be assumed. More generally, what a hearer concludes from an utterance is determined not just by the logical form of the utterance, but by the speech act that the hearer takes to have been performed, and other world knowledge (Searle, 1969). It is only if they believe they are listening to a reputable speaker talking sensibly and seriously that hearers will make the inference from the fact that a proposition was expressed to the proposition itself. To allow for this, the propositions are nested within the two-place predicate expresses. The rst argument is the speaker, the second the proposition. If Bill utters Mary gave John chocolates, the KLRs are: Mary Mary gave Mary gave John Mary gave John chocolates
expresses(bill,9PP mary) expresses(bill,9x9ygave(mary,y,x)) expresses(bill,9ygave(mary,y,john)) expresses(bill,gave(mary,chocolates,john))
The processor assumes, in turn, that Bill expresses that there exists an entity called Mary, that Bill expresses that Mary gave some entity to some entity, that Bill expresses that Mary gave some entity to John, and that Bill expresses that Mary gave chocolates to John. The listener can then decide whether to accept the propositions literally, or whether to draw any other conclusions. At rst sight, it might appear that the ILR could be embedded within the expresses predicate directly. If Bill utters Mary gave, the representation expresses(bill,xy gave(mary,y,x)) is propositional. However, a listener could not normally assess the plausibility of this fragment, because the plausibility of someone expressing something is likely to be dependent on the plausibility of what is expressed. We now get back to the original problem that the listener cannot determine the plausibility of xy gave(mary,y ,x), because it is not propositional and therefore cannot be compared with general knowledge. The use of KLR representations with existentially quanti ed contents accounts for the experimental evidence for incremental understanding. The KLR representations are of course quite weak, but weaker propositions still (e.g. that Bill expresses that Mary gave either something or nothing to John) would obviously be too weak to account for the ndings. However, the
What is incremental interpretation? 13
hearer might assume stronger propositions. The KLR for John killed is expresses(speaker,9xkilled(killed,x), i.e., that the speaker is expressing that there is some entity that John killed. But the listener might assume the stronger proposition that this entity is a living thing. The KLR would be expresses(speaker, 9x living-thing(x) & killed(john,x)). On this account, the sentence John killed time would precipitate revision at the KLR. Such stronger propositions would be most likely when the verb imposes strong constraints (e.g. selection restrictions) on possible arguments. An interesting further possibility is for the construction of ILRs to be ne-grained, but for the construction of KLRs to be rather more coarse-grained. The processor might, for instance, only existentially quantify over lambda-expressions that serve as an argument in the ILR (cf. Barry & Pickering, 1990). For this paper, we ignore these re nements, and assume that both ILRs and KLRs are constructed word-by-word, with the KLR employing simple existential quanti cation.
3.2 Retraction and Reasoning Garden paths as input level misanalysis. The ILR is constructed monotonically unless the parser realises it has chosen the wrong syntactic analysis. Hence, misanalysis at the input level corresponds to \garden-path" phenomena (Bever 1970; Frazier, 1979; Frazier and Rayner 1982). Garden-paths are standardly characterised in syntactic terms, but can also be characterised in terms of the ILR. For example, at least some reduced complement sentences like John heard Fred left cause subjects to garden path (e.g. Ferreira & Henderson, 1990; Frazier & Rayner, 1982; Rayner & Frazier, 1987; Pickering & Traxler, 1995b; Trueswell, Tanenhaus, & Kello, 1993). At least if heard preferentially takes an NP-object, most accounts assume that the processor initially attaches Fred as its object. But when it encounters left, it realises that Fred is the subject of left, and that Fred left is the (reduced) complement of heard. This reanalysis causes a garden path eect. But in terms of the logical form, the processor initially analyses John heard Fred as heard(john,fred). This cannot be combined with the logical form of left, so the processor adopts the alternative logical form, under which the second argument of heard has sentential meaning rather than noun phrase meaning. It then constructs the correct logical form heard(john,left(fred)). Similar accounts explain the misanalysis of other sentences like The horse raced past the barn fell and When Mary was eating the pudding became cold.
A revision to the ILR will normally cause the KLR to undergo revision as well. Hence garden paths involve misunderstanding as well as syntactic and logical form misanalysis. If the KLR has been used as a premise in inferencing, then the conclusions will normally have to be
14 Nick Chater, Martin Pickering and David Milward abandoned as well. This explains the nding that the processor nds it harder to recover from plausible than implausible misanalyses (Pickering & Traxler, 1995a, 1995b). If the KLR contains a plausible proposition, the reader becomes more committed to the analysis than if the KLR is implausible. A plausible KLR is more likely to precipitate inferences that will subsequently have to be abandoned than an implausible KLR. Alternatively, an implausible KLR may cause the reader to seek another syntactic analysis immediately. \Pure" Retraction of the KLR. In contrast, KLR retraction regularly occurs without corresponding ILR retraction. For example, in John loves nothing, the ILR for John loves is x loves(john,x). This is re ned to give the ILR :xloves(john,x) for John loves nothing, without retraction. But the KLR for John loves states that the speaker is expressing that there is some entity that John loves. This is incompatible with the KLR for John loves nothing, which states that the speaker is expressing that there is no entity that John loves. Hence the KLR for John loves is abandoned.
This model predicts that pure retraction of the KLR will cause disruption. Thus, John loves nothing should cause some disruption not found in, say, John hates everything. Recovery from syntactic (and, in our terms, ILR) misanalysis is hard if the misanalysis is long-lived (Ferreira & Henderson, 1991; Warner & Glass, 1987). Similarly, we predict that recovery from pure KLR misanalysis will be hard if it is long-lived, so that Mary gave her new and extremely clever friend nothing should be disruptive. Each KLR may be automatically superseded by the next KLR when it is constructed. Problems would only arise if inferences drawn from an earlier KLR were incompatible with the new KLR. For instance, the KLR for John loves no woman is not incompatible with the KLR for John loves. If the previous KLR were not abandoned, then the hearer would conclude (at the end of the sentence) that the speaker is expressing the claim that John loves something that is not a woman. There is no intuitive evidence for this, so we suspect that the earlier KLR is automatically abandoned. We therefore suggest that only one KLR is ever entertained. However, the implications of earlier KLRs will be felt if they have been employed in much knowledge-level processing. This suggests a dierence between KLR construction and general processes of belief revision, where all propositions may be retained unless rendered implausible by new evidence. Inferences based on the KLR: reasoning and retraction. We assume that the KLR can be rapidly employed in reasoning in combination with relevant aspects of general knowledge. Pure KLR retraction (like mixed ILR/KLR retraction) should be particularly disruptive if the KLR has been extensively integrated with general knowledge, and if a lot of reasoning has occurred. We consider \bridging" inferences and elaborative inferences in turn.
What is incremental interpretation? 15
\Bridging" inferences link successive parts of a discourse, in order to make it coherent (Haviland & Clark, 1974). In understanding the connection between the successive clauses in John went to the fridge, got the milk and drank some, the hearer must infer that there was milk in the fridge. This can only be derived if the KLR is integrated with general knowledge. In this example, the bridging inference may be made during the sentence, rather than at its end, with the hearer assuming that the milk is in the fridge immediately milk is reached. If so, knowledge-level processing occurs whilst the sentence is encountered. If the sentence turned out to be John opened the fridge, got the milk, and put it in, the bridging inference would be withdrawn, and we would predict some disruption. \Elaborative" inferencing occurs in deeper processes of discourse comprehension. Understanding the rami cations of a very simple story involves making a large number of inferences (e.g. Minsky, 1975). In processing It was Martha's birthday. John thought he would buy a balloon we can make numerous inferences quickly and eortlessly | that John and Martha are young children, that John knows that it is Martha's birthday, that John is a friend of Martha, that John bought the balloon to give to Martha, and so on. It may be necessary to retract any or all of these inferences in the light of subsequent information. For example, the story could continue Martha would be retiring : : : or Martha thought it was for her but : : : . Hearers use knowledge level reasoning to make sense of what they have heard and to integrate its meaning with current context. Much of this reasoning appears to be very fast | fast enough to be computed incrementally as a sentence is encountered | and to be subject to equally rapid retraction. In this regard, reasoning about language is no dierent from reasoning in any other domain. Almost all common-sense, everyday inferences are plausible, defeasible, non-monotonic inferences rather than deductive inferences, and are invariably subject to retraction, if new information is added (Oaksford & Chater, 1991). Not all knowledge level retraction is straightforward. For example, elaborative inferencing can go far astray if a reader misinterprets a story. The above story could continue for some time before it became clear that Martha was not a child, and that John was a very rich adult who was buying her a hot-air balloon. In this case, the reader could be seriously misled. Similarly, not all input level retraction is hard. Locally ambiguous reduced relatives like The horse raced past the barn fell often cause intuitively strong garden path eect, but reduced complements like John heard Fred left often do not (e.g. Pritchett, 1992). Is the ILR used in reasoning? We suggest that the ILR is not used in reasoning. Its role is as an intermediate representation between form and understanding, similar, in this respect, to a syntactic representation like a phrase-structure tree. The KLR is used in understanding, so there is no need to employ the ILR as well.
16 Nick Chater, Martin Pickering and David Milward We assume that general knowledge is unavailable in the input level, so the only possible form of inference at this level would be logical inference. It could not be very successful, because there is strong evidence that people are extremely poor at even very simple logical inference, and nd logically complex natural language expressions more or less uninterpretable (Evans, 1989; Johnson-Laird, 1983). In addition, it could only be used to derive new propositions on the rare occasions when the ILR is propositional, as discussed above. On those occasions, it would be possible to perform a very few operations like stripping away double negations. It might also be possible to perform reasoning with multiple propositions (e.g. syllogisms), but only if the processor remembered ILRs for earlier fragments. This would be very strange, because earlier ILRs could otherwise be forgotten. The non-propositional reasoning based on lexical entailments employed by Keenan and Faltz (1985) could take place within the ILR, but it would commonly produce the wrong results. If the processor inferred xlikes(john,x) from xloves(john,x), it would then incorrectly assume that John likes nothing when it encountered John loves nothing. Even if this only occurred with propositional fragments alone, the wrong results would occur if the language permitted the fragment to be subsequently negated. Pure decomposition (e.g. replacing the meaning of bachelor with its de nition as an unmarried man) does not suer these problems, but its value would be very limited if it could not be employed in reasoning. In addition, some o-line evidence suggests that lexical decomposition does not occur (Fodor, Garrett, Walker, & Parkes, 1980). Note that the experimental evidence for incremental understanding shows rapid eects of lexical content in parsing, but this can be explained purely in terms of KLR processing. We conclude that the ILR is inferentially inert, in that it plays no role in reasoning.
4 Conclusions This dichotomy between the ILR and the KLR is compatible with Fodor's (1983) distinction between modular and central processes. Modular processes are informationally encapsulated: they use only perceptual input and information from within the module, and do not have access to general world knowledge (or, for that matter, to information internal to other modules). With respect to language, Fodor notes that:
: : : there is a wide choice of properties of utterances that could be computed by
computational systems whose access to background information is, in one way or another, interestingly constrained : : : there is, in the case of language, a glaringly obvious galaxy of candidates for modular treatment | viz., those properties that
What is incremental interpretation? 17
utterances have in virtue of some or other aspects of their linguistic structure (where this means, mostly, grammatical and/or logical form). [1983, p.88; Fodor's italics]. The construction of the ILR is a modular process, but the use of the KLR in reasoning is part of central processes. This quotation raises the question of precisely which aspects of meaning are encoded in the ILR. Some intuitions are clear: function-argument structure, for example, must surely be represented, and rhetorical force cannot be. A fully developed formal semantics could specify what could be extracted purely from the form of a sentence with any recourse to general knowledge. This would provide an upper bound on the ILR. This might include rather more than what is standardly considered by formal semantics. For example, it may be possible to include information derived from intonational structure (cf. Steedman, 1991). However, the ILR might not make use of all the information that a formal semantics could specify. For instance, it might not normally compute a fully disambiguated representation in sentences involving multiple quanti ers, but might instead represent underspeci ed structures that are compatible with dierent readings. This paper has not attempted to determine the precise nature of the ILR. We have outlined a two-level account of language comprehension, but the discussion of modularity makes it apparent that similar distinctions may apply in perceptual processing. Quite generally, an input/knowledge level distinction appears to be crucial in keeping separate what is perceived from what is inferred (cf. Fodor, 1989). One level of representation records the perceptual input; a second level of representation extends beyond what has actually been perceived, and can act as a basis for further inference. So, in the perception of partly occluded stimuli, a representation analogous to our ILR records what has been perceived. A representation analogous to our KLR makes suggestions about the nature of the whole stimulus, thus allowing inferences about this putative stimulus to be made. These inferences may have to be withdrawn if the stimulus is not as expected, but will often turn out to be correct, and hence will facilitate ecient processing. Note that our ILR is already quite a \deep" level of processing, in the sense that it is a considerable abstraction from the original form of the stimulus, though it is \shallow" in that it is isolated from general knowledge. Incremental interpretation refers to two very dierent processes, the incremental construction of logical form and incremental understanding. We have argued that these processes use two dierent representations, the ILR and the KLR. The ILR is a standard formal semantic representation, and is non-propositional for most fragments. It is constructed monotonically except in the case of syntactic misanalysis. It is an inferentially inert and short-lived inter-
18 Nick Chater, Martin Pickering and David Milward mediary representation which serves to facilitate the construction of the KLR. The KLR is propositional, and forms the basis of the understanding of a sentence or sentence fragment. It will quite often have to be retracted, along with any inferences derived from it, when the ILR remains unaltered. According to this account, the language processor is constantly trying to interpret what it has encountered to as deep a level as possible, as quickly as possible. It does this by assigning propositional interpretations to fragments that do not standardly receive such interpretations, so that it can interpret the fragments in relation to general knowledge and can draw important inferences. However, we showed that any useful propositional interpretation of a fragment will be wrong on occasion. If the interpretation does turn out to be wrong, the processor retracts it. But it retains the non-propositional logical form assigned to the fragment, so reanalysis is facilitated. In other words, when the KLR goes wrong, the processor has the ILR to fall back on. The processor has the best of both worlds: rapid incremental understanding, but minimal problems from over-hasty analysis.
References Ades, A.E. & Steedman, M.J. (1982). On the order of words. Linguistics and Philosophy, 4, 517{558. Altmann, G.T.M., Garnham, A. & Dennis, Y.I.L. (1992). Avoiding the garden-path: Eye movements in context. Journal of Memory and Language, 31, 685{712. Altmann, G.T.M., Garnham, A. & Henstra, J.A. (1994). Eects of syntax in human sentence parsing: Evidence against a structure-based parsing mechanism. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 209{216. Altmann, G.T.M. & Steedman, M.J. (1988). Interaction with context during human sentence processing. Cognition, 30, 191{238. Barwise, J. & Cooper, R. (1981). Generalized quanti ers and natural languages. Linguistics and Philosophy, 4, 159{219. Barry, G. & Pickering, M.J. (1990). Dependency and constituency in categorial grammar. In G. Barry & G. Morrill (Eds.) Edinburgh working papers in cognitive science, Vol. 5: Studies in categorial grammar. Edinburgh: Centre for Cognitive Science, University of Edinburgh. Bever, T. (1970). The cognitive basis for linguistic structure. In Hayes, J.R. (ed.), Cognitive Development of Language. New York: Wiley.
What is incremental interpretation? 19
Boland, J.E., Tanenhaus, M.K., Garnsey, S.M. & Carlson, G.N. (in press). Verb argument structure in parsing and interpretation: Evidence from wh- questions. Journal of Memory and Language. Britt, M.A. (1994). The interaction of referential ambiguity and argument structure in the parsing of prepositional phrases. Journal of Memory and Language, 33, 251{283. Britt, M.A., Perfetti, C.A., Garrod, S.C. & Rayner, K. (1992). Parsing in context: Context eects and their limits. Journal of Memory and Language, 31, 293{314. Clifton, C. (1993). Thematic roles in sentence parsing. Canadian Journal of Experimental Psychology, 47, 222{246. Clifton, C. & Frazier, L. (1989). Comprehending sentences with long distance dependencies. In G.N. Carlson & M.K. Tanenhaus (Eds.) Linguistic structure in language processing (pp. 273{317). Dordrecht: Kluwer. Collins, A.M. & Quillian, M.R. (1969). Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behavior, 8, 240{247. Dowty, D.R., Wall, R.E. & Peters, S. (1981) Introduction to Montague Semantics. Dordrecht: Reidel. Evans, J. St. B. T. (1989) Bias in human reasoning: Causes and consequences, Hillsdale, NJ: Erlbaum. Ferreira, F. & Henderson, J. (1991). Recovery from misanalyses of garden- path sentences. Journal of Memory and Language, 30, 725{745. Fodor, J.A. (1983). The modularity of mind. Cambridge, MA: MIT Press. Fodor, J.A. (1989) Why should the mind be modular? In George, A. (ed.) Re ections on Chomsky. Oxford: Basil Blackwell. Fodor, J.A., Garrett, M.F., Walker, E.C.T. & Parkes, C.H. (1980) Against de nitions. Cognition, 8, 263{367. Frazier, L. (1979). On comprehending sentences: Syntactic parsing strategies. Bloomington, IN: Indiana University Linguistics Club. Frazier, L. & Rayner, K. (1982). Making and correcting errors dung sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology, 14, 178{210.
20 Nick Chater, Martin Pickering and David Milward Garnsey, S., Tanenhaus, M.K. & Chapman, R.M. (1989) Evoked potentials and the study of language comprehension. Journal of Psycholinguistic Research, 18, 51{60. Garrod, S.C., Freudenthal, D. & Boyle, E. (1994) The role of dierent types of anaphor in the on-line resolution of sentences in a discourse. Journal of Memory and Language. Gorrell, P. (1989). Establishing the loci of serial and parallel eects in sentence processing. Journal of Psycholinguistic Research, 18, 61{73. Haddock, N.J. (1987). Incremental semantic interpretation and incremental syntactic analysis. Unpublished PhD Thesis, University of Edinburgh. Haddock, N.J. (1989). Computational models of incremental semantic interpretation. Language and Cognitive Processes, 4, SI337{368. Haviland, S.E. & Clark, H.H. (1974) What's new? Acquiring new information as a process of comprehension. Journal of Verbal Learning and Verbal Behavior, 13, 512{521. Henderson, J. B. (1994). Description Based Parsing in a Connectionist Network. Unpublished PhD Thesis, University of Pennsylvania Holmes, V.M., Stowe, L. & Cupples, L. (1989). Lexical expectations in parsing complementverb sentences. Journal of Memory and Language, 28, 668{689. Johnson-Laird, P.N. (1983). Mental Models. Cambridge: Cambridge University Press. Keenan, E. & Faltz, L. (1985) Boolean Semantics for Natural Language. Boston, MA: Reidel. Marslen-Wilson, W.D. (1973). Linguistic structure and speech shadowing at very short latencies. Nature, 244, 522{523. Marslen-Wilson, W.D. (1975). Speech perception as an interactive parallel process. Science, 189, 226{228. Marslen-Wilson, W.D., Tyler, L.K. & Koster, C. (1994). Integrative processes in utterance resolution. Journal of Memory and Language. Mellish , C.S. (1985) Computer interpretation of natural language descriptions. Chichester: Ellis Horwood. Milward, D.R. (1994) Dynamic dependency grammar. Linguistics and Philosophy, 17, 561{ 605. Milward, D.R. (1995) Incremental interpretation of categorial grammar. Proceedings of the 7th European ACL, Dublin, Ireland (pp. 119{126).
What is incremental interpretation? 21
Milward, D.R. & Cooper, R (1994) Incremental interpretation: Applications, theory and relationship to dynamic semantics. Proceedings of COLING- 94, Kyoto, Japan (pp. 748{ 754). Minsky, M. (1975) Frame-system theory: In R. Schank & B.L. Nash- Webber (Eds.) Theoretical Issues in Natural Language Processing, Cambridge, MA. Nicol, J. (1988). Coreference processing during sentence comprehension. Unpublished PhD dissertation, MIT. Oaksford, M. & Chater, N. (1991) Against logicist cognitive science. Mind and Language, 6, 1{31. Pereira, F.C.N. & Pollack, M.E. (1991) Incremental interpretation. Arti cial Intelligence, 50, 37{82. Pickering, M.J. & Traxler, M.J. (1995a). Discourse representation and parsing: Eects of referential ambiguity on syntactic analysis and reanalysis. Unpublished manuscript. Pickering, M.J. & Traxler, M.J. (1995b). Plausibility and recovery from garden paths: An eye-tracking study. Unpublished manuscript. Pritchett, B. (1992) Grammatical competence and parsing performance. Chicago, IL: University of Chicago Press. Pulman, S.G. (1986). Grammars, parsers and memory limitations. Language and Cognitive Processes, 1, 197{225. Rayner, K. & Frazier, L. (1987). Parsing temporarily ambiguous complements. Quarterly Journal of Experimental Psychology, 39A, 657{673. Searle, J. R. (1969) Speech acts: An essay in the philosophy of language. Cambridge: Cambridge University Press. Shillcock, R.C. (1982). The on-line resolution of pronominal anaphora. Language and Speech, 4, 385{402. Spivey-Knowlton, M., Trueswell, J. & Tanenhaus, M. (1993). Context eects and syntactic ambiguity resolution: Discourse and semantic in uences in parsing reduced relative clauses. Canadian Journal of Experimental Psychology, 47, 276{309. Stabler, E.P. (1991). Avoid the pedestrian's paradox. In R. Berwick, S. Abney & C. Tenny (Eds.) Principle-based parsing: Computation and psycholinguistics. Dordrecht: Kluwer.
22 Nick Chater, Martin Pickering and David Milward Stabler, E.P. (1994). Parsing for incremental interpretation. (Ms. UCLA) Steedman, M.J. (1991). Structure and intonation. Language, 67(2), 260-296 Stevenson, S. (1993). A Competition-Based Explanation of Syntactic Attachment Preferences and Garden Path Phenomena. In Proceedings of 31st ACL. p.266-273 Stowe, L. (1989). Thematic structures and sentence comprehension. In G.N. Carlson & M.K. Tanenhaus (Eds.) Linguistic structure in language processing. Dordrecht: Kluwer. Swinney, D.A. (1979) Lexical access during sentence comprehension: (Re)consideration of context eects. Journal of Verbal Learning and Verbal Behavior, 18, 645{659. Traxler, M.J. & Pickering, M.J. (in press). Plausibility and the processing of unbounded dependencies: An eye-tracking study. Journal of Memory and Language. Trueswell, J., Tanenhaus, M.K. & Garnsey, S.M. (1994). Semantic in uences on parsing: Use of thematic role information in syntactic disambiguation. Journal of Memory and Language, 33, 285{318. Tyler, L.K. & Marslen-Wilson, W.D. (1977). The on-line eects of semantic context on syntactic processing. Journal of Verbal Learning and Verbal Behavior, 16, 683{92. Warner, J. & Glass, A.L. (1987). Context and distance-to-disambiguation eects in ambiguity resolution: Evidence from grammaticality judgments of garden-path sentences. Journal of Memory and Language, 26, 714{738.
2
Incrementality and Monotonicity in Syntactic Parsing Patrick Sturt and Matthew. W. Crocker 1 2 3 4
5 6 7
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
Constraints on the parser . . . . . . . . . . . . . . . . . . . . . . .
27
1.1
Overview of this Paper . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1 2.2 2.3
Informational Monotonicity and structural relations . . . . . . . . . 27 Conditions on Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Primary and Secondary Relations . . . . . . . . . . . . . . . . . . . . 28
3.1
Connectedness and Incrementality . . . . . . . . . . . . . . . . . . . 31
4.1 4.2 4.3 4.4
Syntactic Representation Simple Attachment . . . . TAG adjunction . . . . . Tree-Lowering . . . . . . .
6.1
English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.1
Top-down search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
9.1 9.2
Three types of lowering . . . . . . . . . . . . . . . . . . . . . . . . . 51 Head- nal and Head-initial languages . . . . . . . . . . . . . . . . . 53
Two ways of viewing the constraint of Informational Monotonicity 31
Coherence-Preserving parsing operations . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Top-down prediction . . . . . . . . . . . . . . . . . . . . . . . . . . Search strategies for coherence-preserving reanalysis . . . . . . .
. . . .
34 34 34 36 38
40 40
Processing Japanese . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Easy Double Displacements . . . . . . . . . . . . . . . . . . . . . . Explaining dierences in search strategies . . . . . . . . . . . . .
50 51
10 Limitations of the system . . . . . . . . . . . . . . . . . . . . . . .
54
8 9
10.1 Interaction with discourse context and world knowledge . . . . . . . 54 10.2 Competition eects in Post-modi er attachment preferences . . . . . 55
11 Non-monotonicity of semantic interpretation and secondary relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 12 Other Computational Approaches to Human Syntactic Reanalysis 60 13 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 23 Edinburgh Working Papers in Cognitive Science, Vol. 11: Incremental Interpretation, pp. 23{66. D. Milward and P. Sturt, eds.. c 1995 Patrick Sturt and Matthew. W. Crocker. Copyright
24 Patrick Sturt and Matthew. W. Crocker
Abstract
In recent years, a number of researchers have argued for the utility of monotonicity in syntactic representation, as provided by Description Theory (Marcus, Hindle & Fleck, 1983), for building psychological processing models. In this paper, we explore the problem of combining such a monotonic architecture with strict word-by-word incremental processing. Previous monotonic models have focussed on static representational issues, without considering the problem of how the processor decides how to incorporate the incoming word into the analysis. We de ne two simple parsing operations, simple attachment and tree lowering, which are related to the grammatical composition operations of substitution and adjunction in the Tree Adjoining Grammar formalism. Since the treelowering operation allows the parser to reanalyse in the case of \unconscious" garden paths, it can be used to investigate the consequences of adopting various search strategies for reanalysis, predicting preferences in cases where more than one possibility for reanalysis exists. Considering data from English and Japanese, we show that reanalysis preferences may dier between head- nal and head-initial languages, and suggest some reasons why this might be so.1
1 Introduction As noted in Garrod and Sanford (this volume), \garden path" phenomena, which have become staple fare in psycholinguistics since the work of Bever (1970), have often been taken as evidence for incrementality in human syntactic processing. That is to say that, when faced with a choice point at the onset of a local ambiguity, the parser does not delay its decision until disambiguating information is encountered, but commits itself to an initial analysis, which may well prove to be wrong. When this happens, there is an experimentally detectable \surprise eect" at the point of disambiguation, (i.e. at the word written in bold type in the examples below): (1)
a. John knows the truth hurts. b. While Philip was washing the dishes crashed onto the oor. c. The boat oated down the river sank.
Given the incremental nature of syntactic processing, we are faced with a further question: How ne-grained are the units of incremental processing? In response to this question, many researchers assume a head-driven architecture, in which commitment to a syntactic analysis may only be made when a licensing head has been found in the input (Abney (1987, 1989), Pritchett (1992)). Such a strategy implies that in a head- nal language, such as Japanese for example, the processor waits until the nal word of a phrase before building that phrase and Parts of the material reported in this paper have been presented at the 8th Annual CUNY Conference on Human Sentence Processing, Tucson, Arizona, 1995, and at the 7th Conference of the European Chapter of the Association for Computational Linguistics, University College Dublin, 1995. A revised version of this paper has been provisionally accepted by Language and Cognitive Processes. We would like to thank the many people who have oered insightful comments on earlier drafts of this paper, and in particular, David Milward and Martin Pickering. The research reported here was supported by ESRC Research Studentship No. R00429334338 to P.S. 1
Incrementality and Monotonicity in Syntactic Parsing 25
thus committing itself to an analysis. However, there is experimental evidence from Dutch (Frazier (1987)) and intuitive evidence from Japanese (Inoue and Fodor, 1995) that in such languages, structuring can and does occur before the head has been encountered: Consider the following example (Inoue, (1991), p.102): (2)
Bob ga Mary ni [tnom=i ringo wo tabeta] inui wo ageta. Bob NOM Mary DAT apple ACC eat-PAST dog ACC give-PAST \Bob gave Mary the dog which ate the apple "
Comprehenders report a \surprise" eect on reaching the rst verb, tabeta (\ate"). This is explained on the assumption that the nominative, dative and accusative arguments (\Bob", \Mary" and \the apple"), are initially postulated as coarguments of the same clause, in advance of reaching the verb. On reaching the transitive verb \ate", this analysis is falsi ed, since this verb cannot take a dative argument. However, if, as the head-driven models would predict, the arguments are not structured in advance of the verb, but are held in some form of local memory store, then we have no simple explanation for the surprise eect. For the purposes of this paper, we assume that the relevant unit of incremental processing is the word, so that each word is incorporated into the structure being built as it is encountered, with no use of delaying strategies, such as a stack or look-ahead buer would provide. We also assume an architecture that is serial, in the sense that only one analysis is constructed at a time.2 It has often been noted that garden path utterances such as those in (1) above, can be classi ed according to their relative diculty, and this has been used to motivate certain claims about the architecture of the human language processing system. For example, Marcus et al (1983), Abney (1987, 1989), Pritchett (1992), Weinberg (1993) and Gorrell (1995, a,b), all propose two-level architectures in which what we will call a \core" processor is capable of reanalysing only in the easy cases, while the reanalysis of the harder garden paths requires the assistance of a \higher-level resolver" (to use Abney's terminology). The natural assumption behind the adoption of such a two-level architecture is the idea that the core parser is limited in its ability to alter the structure it has already built, either through backtracking, or through destructive parsing operations. Thus, Abney, (1989, 1987), for example, allows only one destructive operation, steal, which removes a constituent from the right edge of the subtree at the top of the parser's stack, and attaches it to the left edge of the subtree currently under construction. This operation allows the parser to reanalyse in (1.a) and (1.b) but not in \unrecoverable" examples such as (1.c). A more extreme position is represented by the claim that, at least in certain limited domains of representation, the core parser is incapable of any destructive operations or backtracking at all. This is the assumption Note the claim that the processor can only consider one analysis at a time does not necessarily mean that dierent aspects of that analysis (e.g. syntactic and semantic) cannot be computed in parallel. See Crocker (1992) for a detailed proposal allowing distributed processing of dierent representation types within a syntactic module. 2
26 Patrick Sturt and Matthew. W. Crocker underlying what have been called informationally monotonic models, which are represented by Marcus et al (1983), Weinberg (1993) and Gorrell (1995, a, b).3 In these systems, the parser builds up a tree-description in terms of a set of structural relations (dominance and precedence) holding between its nodes (this is why the original model (Marcus et al (1983)) was called D(escription)- theory). The criterion of informational monotonicity forbids the parser from overriding or deleting any of the information in this set. Thus, reanalysis within the core processor is modelled by the addition of new relations into the set, such that the updated set continues to describe a coherent phrase-marker. This results in a more restricted core parser than Abney's. Thus, in Gorrell's system, (1995 (a)) for example, the core parser will be able to reanalyse in (1.a), but not in (1.b) or (1.c), both of which will require the assistance of the higher level resolver. This more constrained coverage is motivated by the (generally agreed) fact that (1.b) causes reanalysis which is available for conscious report, while (1.a) does not (Pritchett (1992))4. Gorrell (1995 a) and Weinberg (1993) attempt to sharpen the psychological plausibility of the original D-theory parser (Marcus et al (1983)) by allowing for word by word incremental processing where Marcus et al relied crucially on the use of look-ahead buer cells to delay attachment decisions. The resulting combination of informational monotonicity with incrementality results in a particularly strict processing regime, especially in the case of Gorrell's model, which severely restricts the use of underspeci cation.
1.1 Overview of this Paper In D-theory based monotonic models recognise no major conceptual distinction between simple attachment and \unconscious reanalysis", since both can be derived by adding new structural relations into the description of the phrase-marker. However, when implementing such a parser, the central question to be addressed is how the parser decides which new relations to add. Proponents of such monotonic models have concentrated on the de nition of the structural conditions under which reanalysis occurs, but have tended to neglect this potentially interesting question; Gorrell, for example, is particularly silent on this. In this paper we attempt to answer this question with respect to the constraints proposed by (Gorrell, 1995 (a)). In particular, we show that a composition operation for simple attachment can be de ned which identi es the root of one tree description with a node on the fringe of another, while a reanalysis operation can be de ned which inserts one tree description into another at an intermediate point. We call this reanalysis operation tree lowering, and we will see that it is related (but not identical) to the operation of adjunction in Tree Adjoining Grammars (Joshi et al (1975)). The de nition of the tree-lowering operation for reanalysis Gorrell uses the term structural determinism to describe the constraint which applies to the set of dominance and precedence relations built by the parser. We do not wish to discuss the technical issue of whether such a system can really be seen as deterministic, and therefore use the more neutral term informational monotonicity. 4 Despite this intuitive dierence, there is as yet no on-line experimental paradigm which will reliably distinguish between these two sentence types (Martin Pickering, personal communication). However, for the purposes of this paper we will assume that the distinction is, in fact, a psychologically real one. 3
Incrementality and Monotonicity in Syntactic Parsing 27
leads us to the consideration of search strategies for its application | just as the parser may be faced with attachment ambiguities in its initial analysis, so it may be faced with lowering ambiguities in its re-analysis, so we must de ne processing strategies to guide the parser in its reanalysis decisions as well as in its attachment decisions. The main purpose of this paper is to explore the various search strategies for the application of tree-lowering, and show that the predictive power of the model often depends crucially on the choice of such a strategy. We take a contrastive cross-linguistic approach, considering data from two typologically diverse languages; English and Japanese. The English data we discuss concern examples such as (1.a). The Japanese examples concern a local ambiguity problem which can be found in all pure head- nal languages, namely the ambiguity concerning the position of a main/subordinate clause boundary, which often cannot be resolved until after the subordinate clause has been completed. Consideration of these examples reveals that, if we want to capture the data in a uniform way, we have to adopt a dierent strategy for the two languages. Overall, the paper can be seen as a description of the consequences of adopting a processing architecture constrained by incrementality, seriality and informational monotonicity. In the nal section we discuss evidence to the eect that this combination of constraints is too strong.
2 Constraints on the parser 2.1 Informational Monotonicity and structural relations Unlike Weinberg (1993), who allows the parser to update the set of relations by eshing out underspeci ed node labels in the description carried over from the previous state, Gorrell does not permit underspeci cation of node labels. The parser is monotonic in the sense that all it can do to the set of structural relations is add more relations, though, in this case, the domain of representation to which this constraint may apply is limited to the purely con gurational notion of structural relations (i.e. dominance and precedence relations), while licensing relations (such as theta and case assignment, for example) are not so constrained. The claim of these models, then, is that the core parser will be able to perform reanalysis just in those cases where such an action does not violate informational monotonicity.
2.2 Conditions on Trees We will say that the set of relations describing the phrase marker being constructed is coherent i it conforms to the following conditions for trees (adapted from Partee et al (1993)). 1. Single Root Condition: There is a single node, the root node, which dominates every node in the tree: 9x8y: dom(x; y)
28 Patrick Sturt and Matthew. W. Crocker 2. Exclusivity Condition: No two nodes can stand in both a dominance and a precedence relation: 8x; y. prec(x; y) _ prec(y; x) $ : dom(x; y) ^ : dom(y; x) 3. Inheritance: (a.k.a. the \non-tangling" condition). All nodes inherit the precedence properties of their ancestors: 8w; x; y; z. prec(x; y) ^ dom(x; w) ^ dom(y; z) ! prec(w; z) Dominance and precedence are both de ned as transitive relations. In addition, it is usually assumed that dominance is re exive (every node dominates itself) and precedence is irre exive. That is to say that dominance de nes a weak partial order and precedence de nes a strict partial order.
2.3 Primary and Secondary Relations Gorrell divides syntactic representation into primary relations (dominance and precedence) and secondary relations (theta-role assignment, c-command, case-assignment, etc), of which only the primary relations are constrained by monotonicity. Recall (1) repeated below in (3): (3)
a. John knows the truth hurts. b. While Philip was washing the dishes crashed onto the oor.
Gorrell explains the diculty of processing (3.b) in terms of the fact that monotonicity is not preserved in primary relations. In (3.a), by contrast, although secondary relations have to be altered during the parse, primary relations can be built up monotonically, and the sentence is therefore predicted to be processable by the core parser. Below we give a brief description of this, though for more detail and a range of further examples, the reader is referred to Gorrell's original work (1995, a). Consider (3.b). At the point where the parser has just received the dishes, this DP will have been attached as the direct object of washing. Thus, the set of relations will encode the fact that a VP dominates this DP (in the following diagram, we box these two nodes for clarity):
Incrementality and Monotonicity in Syntactic Parsing 29 CP (= S )
HHH H IP (= S) C HHHH while I DP HH H Philip I VP HHH was V 0
0
DP
washing
4
the dishes
f...,dom(VP,DP),...g However, the following word, crashed forces a reinterpretation in which the DP the dishes appears in the matrix clause. IP
HHH HH HHH IP CP H HH HHHH IP C I DP H H H HH HI 4 while DP VP HH the dishes I 4 Philip I 0
0
VP was V washing
crashed onto the oor
In this revised structure, it can be derived that the original VP now precedes the DP (through the inheritance condition), but this leads to a contradiction, because we now have dom(VP,DP) ^ prec(VP,DP) against the exclusivity condition, and thus the parse is predicted to be impossible. Now consider (3.a). At the point where the truth has just been parsed, the tree-description will include an DP dominating the truth. This DP is dominated by VP and preceded by V:
30 Patrick Sturt and Matthew. W. Crocker IP (= S)
HHHH DP VP HHH John V DPHH knows DET N the
truth
The primary relations will include the following:
f..., dom(VP,DP), prec(V,DP),...g Among the secondary relations, we will have, for example, the assigment of a thematic role to the direct object DP by the verb. Now, encountering the verb hurts will force a reinterpretation to a complement clause analysis, and a consequent revision of this assignment of thematic roles. However, the set of primary relations can be updated monotonically, with the addition of a new S-node (call it S2 ).5 We show the relevant section of the corresponding phrase marker below: VP HHH HS V 2 HH DP knows HH DET N the
truth
This is achieved by adding the following relations:
fdom(VP,S2), prec(V,S2), dom(S2,DP)g The addition of the above new relations does not falsify any of the relations carried over from the previous state. For example, both dom(VP,DP) and prec(V,DP) are still true. 5 In standard GB theory, and in Gorrell's model, we would also require a new CP node immediately dominating the embedded S (or IP) node. This will mean that the secondary relations of case-assignment and government hold between V and DP before but not after reanalysis. The CP node has been omitted for clarity of expoisition.
Incrementality and Monotonicity in Syntactic Parsing 31
3 Two ways of viewing the constraint of Informational Monotonicity There are two dierent possible ways in which a parser constrained by informational monotonicity can be viewed. 1. In cases where a grammatical continuation is not possible, the syntactic processor is capable of deriving alternative analyses whether or not they are consistent with the conditions on trees. However, the inconsistent analyses are ltered out through a \consistency-checking" procedure. 2. The core syntactic processor is simply incapable of changing the phrase marker in any way which would violate consistency. If this is the case, then there is no need for a consistency checking procedure. On the rst view, the conscious garden path eect is explained in terms of a lter (consistency check) which rejects a parse because it involves too great a structural change from the previous state. On the second view, a much more limited parser is assumed; one that is incomplete with respect to the grammar. That is to say that there are grammatical strings which the parser simply cannot derive without outside help, and these strings correspond to conscious garden paths. It is the second of these two approaches which seems the more attractive. The reason for this, as hinted above, is partly one of eciency. If the rst approach is taken, then the consistency check will have to be applied at every state of incremental processing, i.e. after every word is processed, and a greater range of alternative analyses will have to be considered. The rst approach would also require an explanation for the need for a consistency checker. If the parser is capable of nding certain (possibly correct) alternative analyses, then what is the point of ltering some of these out? If, on the other hand, the parser is simply incapable of computing structures which would fail the consistency check, as in the second approach, then there is no reason to employ the check at all.6 Thus, the second approach seems to promise an account of the data in terms of limitations on the parser itself, while if we take the rst approach, we must seek an explanation in terms of the reason for the existence of the consistency checker.
3.1 Connectedness and Incrementality We have seen that Gorrell's model represents a particularly strong instance of the monotonicity hypothesis in that it does not allow underspeci cation of node labels. The status of The consistency checker itself need not be too expensive | Cornell (1994) describes an algorithm for consistency checking which is polynomial in the number of nodes in the description. 6
32 Patrick Sturt and Matthew. W. Crocker the model with respect to incrementality is not so clear. On the one hand, the parser is incremental in the sense that it does not employ the look-ahead device of earlier D-theory (Marcus et al (1983)), and therefore does not consistently delay attachment decisions. This is supported by the principle of incremental licensing:
Incremental Licensing: The parser attempts incrementally to satisfy the principles of grammar.
On the other hand, no formal speci cation is made of what actions the parser takes if it is unable to satisfy this principle. The crucial question is that of whether the parser should be allowed to buer constituents. For example, if the current word cannot be attached into the description under construction so as to guarantee a grammatical continuation, is it permissible to keep the word and its associated superstructure in a buer or stack until the issue is resolved? This question is related to the issue of connectedness in that the size of the stack accessible to the parser corresponds to the amount of structure which may be left unconnected in the parser's memory (see Stabler, (1994, (a,b)) for a discussion of this issue). Gorrell does not provide a discussion of this, though implicitly he does allow the parser to retain unconnected material in its memory, in cases where there is insucient grammatical information to postulate a structural relation between two constituents. One example of this is the occurrence of a sequence two non case-marked NPs, as in the following centre-embedded example: (4)
The man the report criticised was demoted.
On this question, Gorrell claims that \there is no justi cation for asserting any relation between the two NPs" Gorrell (1995, a, p. 212). This implies that the parser is able to store unconnected material in its memory. If this is so, then it is necessary to constrain the conditions under which material may be added to this store. In particular, if the parser is permitted to shift material onto a stack whenever it fails to make an attachment, then, in a garden path utterance, the error will not be recognised until the end of the input has been reached, when the parser will be faced with a stack full of irreducible structures. This was essentially the problem faced by Abney, whose (1987) model required the use of an unbounded stack. This was because attachment could only be made under a head-driven form of licensing, so that in left-branching structures, where the licenser was still unread in the input, it was essential to shift structure onto the stack until the licenser was found, in order for the attachment to be made. The problem was that the parser could not tell whether to continue adding material to the stack, in the expectation of a licenser later in the input, or whether to abandon the parse. In a later version (Abney, 1989), this problem was solved with the addition of LR-states, to indicate whether or not a grammatical continuation could be expected at the current input. For the purposes of the implementation reported in this paper, we have taken the most constrained position possible, and insisted on full connectedness; that is to say that at any stage, the parser has access only to a single set of relations describing a fully connected tree,
Incrementality and Monotonicity in Syntactic Parsing 33
and each word has to be incorporated within this structure as it is encountered. Behind the adoption of this strict regime lies the acceptance of the following hypothesis, various forms of which have appeared in the work of several researchers in the eld (Steedman, (1989), Stabler (1994, a,b)): Every structure associated with every pre x of a readily intelligible, grammatical utterance is connected. (Stabler, 1994 (a)). The implementation described here obeys the following constraints, which are intended to capture the conditions described in Gorrell's work. In particular, informational monotonicity is de ned, as well as full speci cation. The condition of incrementality is stronger than that implied by Gorrell, as we have seen. 1. Strict Incrementality: Each word must be connected to the current tree description at the point at which it is encountered through the addition of a non-empty set of relations to the description. 2. Structural Coherence: At each state, the tree description should obey the conditions on trees: (a) Single Root Condition. (b) Exclusivity Condition (c) Inheritance 3. Full Speci cation of Nodes: Tree-descriptions are built through the assertion of dominance and precedence relations between fully speci ed nodes. In the current implementation, each node is a triple hCat,Bar,Idi, consisting of category Cat, bar-level Bar and an identi cation number Id. Each of these three arguments must be fully speci ed once the structure has been asserted. 4. Informational Monotonicity: The tree-description at any state n must be a subset of the tree-description at state n + 1.7 Thus the parser may not delete relations from the tree description. 5. Obligatory Assertion of Precedence: If two or more nodes are introduced as sisters, then precedence relations between them must be speci ed. 6. Grammatical Coherence: At each state, each local branch of the phrase marker described must be well-formed with respect to the grammar. 7 In fact, since (1) requires the set of relations added to the description at each word to be non-empty, the description at n is a proper subset of the description at n + 1.
34 Patrick Sturt and Matthew. W. Crocker
4 Coherence-Preserving parsing operations 4.1 Syntactic Representation The syntactic representation assumed is similar to that used in Tree-Adjoining Grammars (henceforth TAGs) (Joshi et. al (1975)), and in particular, Lexicalised Tree Adjoining Grammars (Schabes et al, 1988). Each lexical category is associated with a set of structural relations, which determine its lexical subtree. For example, the verb category for English contains the following relations:
fdom(VP,V0), dom(V0,Word), dom(S,VP), dom(S,DP), prec(DP,VP)g These de ne the following subtree, where we borrow a tradition from the TAG literature and represent an attachment site with a downward-pointing arrow. S
HH DP VP
#
V0 Word
As each word is encountered in the input, the parser projects the subtree corresponding to its lexical category, and an attempt is made to connect this subtree with the global tree under construction. There are two possibilities for this; simple attachment and tree-lowering, corresponding roughly, (though not exactly, as we shall see later) to substitution and adjunction in TAG.
4.2 Simple Attachment The parser is capable of performing simple right and left attachment. A lexical subtree may contain attachment sites to the left or the right of the word from which it is projected. Intuitively, left-attachment consists in attaching the current global tree on the left corner of the subtree projection of the new word, while right attachment consists in attaching the subtree projection of the new word in the right corner of the current global tree. These are similar to Abney's (1987, 1989) and Attach-L and Attach respectively. In the following de nitions, we use the term current tree description to refer to the set of relations describing the global phrase-marker currently in the parser's memory, in other words, the parser's left
Incrementality and Monotonicity in Syntactic Parsing 35 LEFT ATTACHMENT Curent tree-description:
New Projection:
R
Resulting Description: B
B
R=A A
RIGHT ATTACHMENT Resulting Description: B Curent tree-description: B
New Projection: R R=A
A
Figure 1: Left andRight Attachment context. The term subtree projection is used to refer to the set of relations corresponding to the lexical category of the new word encountered in the input. The attachment operations are illustrated diagrammatically in Fig.1.
Left Attachment:
Let D be the current tree description, with root node R. Let S be the subtree projection of the new word, whose left-most attachment site, A is of identical syntactic category with R. The updated tree description is S [ D, where A is uni ed with R.
Right Attachment:
Let D be the current tree description, with the rst right attachment site A. Let S be the subtree projection of the new word, whose root R is of identical syntactic category with A. The updated tree description is S [ D, where A is uni ed with R. The parser is also capable of creating a new attachment site with reference to a verb's argument structure. For example, if a transitive verb is found in the input, then a new right attachment site is created for a DP, and a new DP node is \downwardly projected" as a sister to the verb.8 8 Note that this requires a systematic method for choosing amongst alternative subcategorization frames in cases where the verb is ambiguous. Gorrell (1995 (a), p.188) de nes the notion of simplicity: \No vacuous structure building" | the principle which is used to explain the initial preference for DP attachment in the
36 Patrick Sturt and Matthew. W. Crocker
α
β
γ a
a c b
c:n
d
t g
b
c
d
e
c
f
+ e
c
f
h t g
h
Figure 2: TAG adjunction. So, for a simple transitive sentence such as Polly eats grapes, rst Polly is projected to a DP and instantiated as the current global tree. This DP will match the left attachment site of eats, and so left attachment will be performed. Since eats is a transitive verb, a new DP attachment site will be created, and this will be suitable for right attachment of the projection of grapes. If simple attachment is not possible, then the parser attempts to perform a second mode of attachment, tree-lowering. Before describing tree lowering, we will brie y review the adjunction operation of TAGs, since the two operations share some common features.
4.3 TAG adjunction A TAG contains a set of elementary trees, which are divided into initial and auxiliary trees. An initial tree is rooted in a distinguished start symbol non-terminal node, and has terminal symbols along its frontier. An auxiliary tree is rooted in a non-terminal node, and its frontier includes a further non-terminal node, which must be of the same type as the root. This non-terminal node is known as the foot node. Most versions of TAG allow two operations for combining one tree with another; substitution and adjunction. Substitution is essentially the same operation as simple attachment as de ned above. To explain adjunction, we refer to an example instantiation given in Fig.2. case where a verb may subcategorise either for a DP or for a clause. In the case of an optional argument, however, the processor cannot employ downward projection, since there is no guarantee whether or not the argument will appear in the input. This implies that, if the parser is able to cope with the attachment of optional arguments, it must be capable of attaching an argument without rst downwardly projecting that argument. Thus, it may be possible to avoid the use of downward projection altogether. It would be interesting to see to what extent this would allow us to replicate the delay in the use of subcategorization information proposed by Mitchell (1989), as opposed to the immediate use which downward projection implies.
Incrementality and Monotonicity in Syntactic Parsing 37
The following de nition is adapted from Joshi, Vijay-Shanker & Weir, (1991). Let be be an initial tree containing a node n with category c (this is the node marked `c:n' in Fig.2). Let be an auxiliary tree, whose root and foot nodes are also of category c. The adjunction of to at node n will yield the tree that is the result of the following operations: 1. The subtree of dominated by n is excised. We call this excised subtree t. 2. The auxiliary tree is attached at n, and its root node is identi ed with n. 3. The excised subtree t is attached to the foot node of , and the root node, n of t is identi ed with the foot node of . It is simple to de ne a version of TAG adjunction in terms of sets of structural relations which describe trees. The trees , and can be described by sets of relations, which we call A, B and ? respectively. For example, A (which describes ), is as follows | (Again, we use c:n to represent the node n with category C)
A = fdom(a,b),
dom(a,c:n), dom(a,d), dom(c:n,g), dom(c:n,h), prec(b,c:n), prec(c:n,d), prec(g,h)
g
The set B is similarly de ned. Now, to adjoin to , we nd the set of local relations L A in which c:n participates:9
L = fdom(a,c:n),
g
prec(b,c:n), prec(c:n,d)
We then build a new set of relations N , which is L with all occurrences of c:n replaced by the root node of (call it c:r)
N = fdom(a,c:r),
g
prec(b,c:r), prec(c:r,d)
We then unify the foot node of with c:n, so that B consists of the following set of relations:
Y = fdom(c:r,e),
g
dom(c:r,c:n), dom(c:r,f), prec(c:n,f)
Now ? is simply de ned as follows: ? = A[N [B As the reader can verify, the derived set, ?, describes the tree , as required. The `local relations' in which a node N participates at state S are those dominance and precedence relations which de ne the mother sisters of N at S . 9
38 Patrick Sturt and Matthew. W. Crocker α’
γ’
β’
S S
DP
+
VP
DP
S
DP
VP
VP
S
V
DP
V
DP D
VP
N D
N
Figure 3: Reanalysis as the insertion of one tree inside another. The inserted material is enclosed inside the dotted line.
4.4 Tree-Lowering It should be noted that since the root and foot nodes of a TAG auxiliary tree must be of the same syntactic category, we cannot use this device to derive a monotonic, incremental parse of 1(a) (repeated below as (5)). (5)
John knows the truth hurts.
Schematically, what we need in the above case is illustrated in Fig. 3 Here, the nodes corresponding to root and foot nodes of the auxiliary tree are the S node and DP (subject) node of the embedded clause respectively. In order to accommodate 0 into 0 in the desired monotonic fashion, therefore, we will have to drop the requirement for root and foot nodes to be of identical syntactic category. However, we must constrain this operation so that it may only be employed in cases where the root node of the auxiliary tree is licensed in its new adjoined position. In the above case it is licensed, since know may subcategorize for a clause.10 In order to maintain structural coherence, the new word attached via tree-lowering must be preceded by all other words previously attached into the description. We can guarantee this by requiring the lowered node to dominate the last word to be attached. We also need to ensure that, to avoid crossing branches, the lowered node does not dominate any unsaturated attachment sites (or \dangling nodes"). We therefore de ne accessibility for tree-lowering as follows:
Vijay-Shanker (1992) discusses a version of TAG in which root and foot nodes need not be of the same category. Although it uses dominance relations in a similar way to the approach described here, Vijay-Shanker's system diers in many respects. We will discuss this further in section 11 10
Incrementality and Monotonicity in Syntactic Parsing 39 TREE LOWERING B R
B
R
N A
N=A
Figure 4: Schematic illustration of Tree Lowering. The node R must be licensed in the position previously occupied by N definition Accessibility:
Let N be a node in the current tree description. Let W be the last word to be attached into the tree. N is accessible i N dominates W, and N does not dominate any unsaturated attachment sites. Note that it is not necessarily the case that all nodes on the right edge of the tree are accessible, nor that all accessible nodes are on the right edge of the tree. For example, a \dangling node" will not dominate any lexical material, even though it might be on the right edge of the tree, and therefore it will not be accessible. Also, there may be a node which dominates the last word to be attached, and which is therefore accessible, but which precedes a dangling node, and is therefore not on the right edge of the tree. The tree-lowering operation is de ned as follows, and illustrated diagrammatically in Fig. 4: definition Tree-lowering:
Let D be the current tree description. Let S be the subtree projection of the new word. The left attachment site A of S must match a node N accessible in D. The root node R of S must be licensed by the grammar in the position occupied by N. Let L be the set of local relations in which N participates. Let M be the result of substituting all instances of N in L with R. The attachment node A is uni ed with N. The updated tree-description is D [ S [ M Note that, in order to check whether the root node R of the new subtree projection, is licensed in its new position, it may be necessary to access subcategorization information associated with a word long past in the input, and which is no longer accessible in the sense de ned above. This is the case in the above example, where the subcategorization frame of knows has to be checked to allow the attachment of the clausal node as its sister, although the V0 node of knows itself is no longer accessible. It would be interesting to investigate how far inside
40 Patrick Sturt and Matthew. W. Crocker the tree the parser is capable of looking in order to extract this type of information.11 The parser is constructed in such a way that, if at any point in the parse, simple attachment fails, the accessible nodes of the current tree-description are considered until a node is found at which tree-lowering may be applied. Note that, tree-lowering can capture many eects of standard TAG adjunction, and it is therefore possible to use this operation for the attachment of post-modi ers, which is the course of action taken in this implementation. The preference for argument over adjunct attachment is captured by the fact that tree-lowering is only attempted in cases where standard attachment fails.12
5 Top-down prediction We mentioned earlier that, (for head-initial languages like English), when the parser encounters a head requiring a following internal phrasal argument, this argument is projected top-down and asserted as a so-called \dangling node". However, there will be cases where the word immediately following the head cannot be directly connected to such a dangling node. Consider the following example: (6)
Mary thinks John .....
On encountering the verb thinks, the parser must project a clausal node, since this verb can only subcategorize for a clause. However, the DP John cannot be directly connected to this node, since it is of the wrong syntactic type. The current implementation addresses this problem by adopting Crocker's approach (1992,1994), in which the \functional structure" of the clause (CP, IP) is projected top-down along with the DP subject node in the speci er position of IP. This provides an immediate attachment site for the embedded subject.13
6 Search strategies for coherence-preserving reanalysis 6.1 English In example 5, at the point where standard attachment fails, the parser is faced with the task of incorporating the projection of hurts into its representation. In this case, there is only one accessible node at which tree-lowering may be applied in such a way that grammatical licensing conditions are met, and that is the DP boxed in the diagram below: We are grateful to Martin Pickering for bringing this point to our attention. In Gorrell's original model, this is accounted for by the Principle of Simplicity. Milward's rule of state prediction (1994, 1995) gives a more general solution to this problem. We will brie y discuss this approach in section 11. 11
12 13
Incrementality and Monotonicity in Syntactic Parsing 41 S H HH VP DP HHH John V DP HH
knows DET N the truth
SHH DP VP #
V0 hurts
In the above, then, the word hurts uniquely disambiguates the local ambiguity. However, there may be occasions where reanalysis does not coincide with unique disambiguation. In particular, the parser may be faced with a choice of alternative lowering sites in much the same way as it is faced with a choice of alternative attachment sites at the onset of a standard local ambiguity. and therefore we must consider the search strategy, or heuristics used to choose between such lowering sites. For example, imagine that the following utterance has just been processed: (7)
I know [DP1 the man who believes [DP2 the countess]]
Now, imagine that the utterance continues with the verb killed. The verb must be attached via tree-lowering, but now there are two accessible nodes where the operation can be applied; DP1 and DP2 . Though there are, as far as we are aware, no experimental studies of this type of structure have been conducted, it intuitively seems that the lower site, DP2 is preferred. This can be seen more clearly in the following sentences, where binding constraints force a particular reading; (8.a), where lowering is obligatorily applied at the lower DP, is easier than (8.b), where lowering can only be applied at the higher DP: (8)
a. I know the man who believes the countess killed herself. b. I know the man who believes the countess killed himself.
In the above examples, though the verb killed triggers reanalysis, it is the following re exive pronoun, himself/herself which uniquely disambiguates the structure. If it is indeed the \low" reanalysis (corresponding to (8.a) that is favoured, then in the dispreferred case, (8.b), the parser will mis-reanalyse on encountering killed, only to experience what we might think of as a \second order" garden path eect at the disambiguating signal, himself, where the preferred reanalysis is seen to have been mistaken. A similar preference for low attachment of post-modi ers has been argued for in the literature, as an instance of late closure (Frazier & Rayner, 1982). Thus, the preferred reading for John said Bill left today is (9.a), where the adverbial appears in the lower clause, as opposed to (9.b), where it appears in the higher clause.
42 Patrick Sturt and Matthew. W. Crocker (9)
a. John said [Bill left today] b. John said [Bill left] today
Since tree-lowering is also used for post-modi er attachment, we would expect the search strategy used in examples such as (8.a) to share some features in common with that in (9.a). A possible strategy which can be used, then, is to search the set of accessible nodes in a bottom-up direction. We de ne the current node path as an ordered set, h N1, N2, ..., Nk i such that N1 is the node immediately dominating the last word to be processed, Nk is the root of the tree, and Nj immediately dominates Nj ?1 for each pair of adjacent nodes hNj ?1 ; Nj i in the path. In the bottom-up search, then, the parser considers the rst node in the path, N1, and attempts to lower. If this is unsuccessful, it moves to the next node, N2 (i.e. the node immediately dominating N1), and again attempts to lower. The process continues, with the parser considering successively higher nodes until either lowering is successful, and the parser can move on to consider the next word, or the root node is reached, in which case, the parser fails, and the string is predicted to be either a conscious garden path or ungrammatical.14
7 Processing Japanese In the previous section, we saw that, at the point of reanalysis, the parser may be faced with a choice of possibilities at which the lowering operation may be applied, necessitating the de nition of a search strategy. In this section we will look at a class of examples involving the reanalysis of relative clauses in Japanese, where just such a choice is found. We will rstly look at Gorrell's explanation for the distinction between conscious and unconscious garden paths for this type of example in terms of a comparison between the set of relations describing the phrase marker at two snap-shots of processing. We then show how this \static" explanation fails to make an adequate distinction between the easy and hard cases of reanalysis. We subsequently demonstrate the important role of the search strategy in explaining this distinction, showing, incidentally, how the bottom-up strategy we proposed to account for the English data predicts the opposite of the observed results. We then give a (speculative) proposal of a top-down, weakly interactive search, which allows a more satisfactory explanation. The issue concerns data such as the following: (10)
a.
[Mary ga sinseihin wo loc kaihatusita] kaisya ga tubureta. Mary NOM new product ACC developed company NOM went bankrupt
In fact, we will see in the nal section that preferences for post-modi er attachment are more complicated than would be suggested in a simple bottom-up search. However, it is still possible to discern a general preference for low attachment for English. 14
Incrementality and Monotonicity in Syntactic Parsing 43
\The company where Mary developed the new product went bankrupt." (Inoue 1991) ni tegami b. Yamasita ga [tnom=i yuuzin wo houmonsita] siriaii acquaintance DAT letter yamasita NOM friend ACC visited wo kaita. ACC wrote \Yamasita wrote a letter to an acquaintance who visited his friend." (adapted from Mazuka and Itoh (1995)) c. Yamasita ga yuuzin wo [nom tacc=i houmonsita] kaisyai de mikaketa. company LOC saw Yamasita NOM friend ACC visited \Yamasita saw his friend at the company he visited." (adapted from Mazuka and Itoh (1995)) In all of the examples in (10), a clause is initially built containing an (overt) subject, object and transitive verb. However, subsequent appearance of the noun shows that this clause, or some part of it, must be reinterpreted as a relative clause modifying that noun. The local ambiguity consists in the fact that the boundary between the main and relative clause may fall at any point between the left edge of the sentence and the immediately pre-verbal position (Japanese is a \super pro-drop" language). We will see how this local ambiguity can be modelled as a choice of lowering sites at the point where the parser receives the immediately post-clausal noun.15 In each of the three sentences in (10.a-c), the initial string may schematically be represented as follows:16 NPnom NPacc Vtrans On the assumption that the two NPs are initially structured as coarguments of the verb, the parser may have to \displace" one or more arguments from the clause on reaching the relativising noun. This is because a relative clause must contain a gap, which is coindexed with the head noun, and if the constituent to be relativised is already overt in the initially built clause, then that overt constituent will be have to be displaced from the clause, and replaced with the gap. However, if the relativised constituent is not overt in the initially built clause, then no material will have to be displaced. For example, if the relativised constituent is represented by an empty pro category in the initially built clause, then the relativisation relation can be established by postulating the pro as the relativised trace, and coindexing it with the head-noun. Otherwise, if the relativised constituent is a non-overt adjunct in the 15 At the point where a post-clausal noun has been found, the parser is actually faced with a further local ambiguity; either the clause directly modi es the post-clausal noun through relativisation, or the post-clausal noun is the rst word of a new relative clause which modi es a subsequent head noun in coordination with the rst relative clause. However, Mazuka et al (1989) show that, for strings which are globally ambiguous for these two readings, the reading consistent with a direct association between a relative clause and a following noun is strongly preferred to the indirect \coordinate" reading, even in cases where plausibility favours the coordinate reading. Inoue (1991), Inoue & Fodor (1995) and Fodor & Inoue (1994) report cases where an initial preference for \direct" reading results in an unrecoverable garden path eect when this is later found to be untenable. 16 We indicate an NP bearing case C as NPC , and Vtrans denotes a transitive verb (i.e. a verb which takes one nominative and one accusative argument)
44 Patrick Sturt and Matthew. W. Crocker initially built clause, then the relativisation relation can be established be adding the empty category representing the adjunct to the clause. Inoue (1991) notes a general preference towards displacing the minimal amount of material from a completed clause to a higher clause. He calls this the \Minimal Expulsion Strategy"17. The sentence in (10.a) is an example of a case where no material is displaced on reaching the head noun. This is because, although both arguments of the verb kaihatusita are present in the structure, the relativised constituent corresponds to a locative adjunct rather than an argument, so no overt constituents need to be displaced by postulating the relativised gap. Note that, in Japanese, relativisation of an adverbial is only possible if the role of that adverbial is temporal or locative. We will refer to such examples as \null-displacement". Null displacement examples of this kind reportedly do not cause conscious processing diculty (Inoue (1991)). The sentence in (10.b) involves displacing one argument, i.e. the nominative NP Yamasita ga. At the point where the rst verb is processed, a clausal structure will have been built, corresponding in meaning to (\Yamasita visited his friend"). However, on the subsequent input of the noun siriai, a gap has to be found in that clause. This time, since siriai \acquaintance" is not a plausible location or time for an action to take place, adverbial relativisation is not possible (consider the bizarre interpretation of the English NPs \the acquaintance where Yamasita visited his friend", and \the acquaintance when Yamasita visited his friend)". This means that the processor will be forced to postulate a gap in one of the two argument positions. Postulating the gap in the subject position causes the displacement of Yamasita ga, which results in the globally correct structure. We will call such examples \single displacement" sentences. They do not cause conscious processing diculty (Mazuka and Itoh (1995)). The sentence in (10.c) involves displacing two arguments, the subject argument and the object argument. We will pospone our discussion of how processing might proceed in such an example. These \double displacement" examples do cause conscious processing diculty (Mazuka and Itoh (in press)). Gorrell (1995) claims the contrast in processability between (10.b) and (10.c) can be derived via structural determinism (the monotonicity requirement on structural relations). However, we will see that, if we simply consider the structural relations before and after reanalysis, not only (10.b) but also (10.c) can be derived in a manner which preserves monotonicity. Consider the schematic representation of a clause below: 17 We prefer to use the term \displacement" over \expulsion", since, as we shall see, in the present model, a displaced element is not expelled from the clause to which it is originally attached, but rather everything except the displaced element is lowered. However, we continue to use the term \minimal expulsion" when we refer to the processing strategy as proposed by Inoue.
Incrementality and Monotonicity in Syntactic Parsing 45 S S Arg1 Arg1
V’
V’ Arg3 Arg2
V
V
0
N’ S
N’ V’
/ O
Arg2
cs
Noun
V0
Figure 5: Single displacement via the application of lowering at V0 node. (The material enclosed in the dotted line is inserted into the structure on reanalysis.) S HHH Arg1 V0 HH Arg V0 2
Let us say that the next word in the input is a noun. If the argument corresponding to Arg1 has to be relativised, then Arg1 will have to be displaced by an empty argument (or trace). If the argument corresponding to Arg2 has to be relativised, then both Arg1 and Arg2 have to be displaced. Let us consider the displacement of Arg1 rst. An alternative way of looking at this, as Gorrell (a,b) has noted is that everything except Arg1 is \lowered" (that is, in present terms, tree-lowering is performed on V0), and a new S node is created (call it S2 ) which immediately dominates a newly created empty category in subject position. Then S2 is adjoined as a premodi er to the noun. The noun is postulated to head Arg3 , which is attached as a coargument to Arg1. This is illustrated in Fig. (7).18 Now consider the displacement of both Arg1 and Arg2. Gorrell (a,b) explains the diculty of utterances requiring such double displacement in terms of the need to \delete" a domination relation between (in our terms) V02 and Arg2. In fact, however, this double displacement can be derived in an analogous manner to the single displacement example noted above. In this case, we lower the head node, V0, and reconstruct a relative clause structure by adding the relevant nodes up to S2, including two empty argument positions. This is illustrated in Fig. (7) As the reader can verify, this, as well as the single argument displacement preserves Note that the current implementation treats Japanese case-markers as words in their own right for the purposes of incremental processing. (The expected case-marker is marked cs in Fig. (7). 18
46 Patrick Sturt and Matthew. W. Crocker S
S
Arg2
V’
Arg1
V’
Arg1
V
V’
Arg2
0
Arg3
cs
N’
S
N’
V’
O /
O /
V
V
Noun 0
Figure 6: Double displacement via the application of lowering at V0 node informational monotonicity, since the original position of V0 dominates the post-reanalysis position.19 In contrast to Gorrell's model (1995 (a, b)), where the distinction between single and double displacement is accounted for in terms of the parser's inability to withdraw structural statements at the initial point of disambiguation, we would like to propose instead that the diculty of examples such as (10.c) may be due to the parser initially performing a mis-reanalysis, which only becomes apparent at a later point of processing, from where recovery is dicult. The standard de nition for tree-lowering oered above will obviously not suce to deal with this type of example, since here we must add structure (including a new sentential node, and empty argument position(s)) which is not part of the subtree projection of the new head noun found in the input. The de nition of tree-lowering has therefore been extended so that this extra structure can be built as part of the operation. The parser includes a headchecking operation, which, on the input of a head, checks for the presence of the required arguments, and, in the case of a verbal head, adds empty categories for any arguments which are missing. It is this head-checking operation which is employed in the extended de nition of tree-lowering. Where lowering is applied to a head-projection, this head-checking operation is reapplied, so that, in cases where the arguments of a verb are displaced by reanalysis, the In fact, the structure Gorrell proposes for this example includes an empty INFL node as a right sister of the VP. The double displacement example could be ruled out if we allow the verb to raise via verb-movement into this position. In this case, the last word to be processed will be in the INFL position, and thus, the V0 node will no longer be accessible for lowering. However, this would rule out all reanalyses involving displacement of an object, but in some cases, such examples are possible without conscious processing diculty, as we shall see in section 8. 19
Incrementality and Monotonicity in Syntactic Parsing 47
embedded clause structure is \regrown", including any necessary empty argument positions. In the examples we have been considering, the \regrown" embedded clause can then be attached as a relative clause to the incoming noun, and this noun can then be attached as a coargument to the displaced arguments (Sturt (1994) contains details of this). Given the revised de nition, either single or double displacement can be derived, depending on the node at which tree-lowering is applied. This means that, in order to account for the contrast in diculty between (c) and (b), we will crucially have to appeal to the search strategy which the parser uses in nding a node for lowering. It will be clear that the bottom-up search we motivated in the previous section for English will predict exactly the opposite results for these Japanese examples. This is because the last word to be attached into a clause will be a verb, and therefore, at the point where the parser fails to attach the head noun, the lowest node accessible to the lowering operation will be the node immediately dominating this verb, i.e. in the above schemata, the V0 node. This means that, if the parser begins its search of the accessible nodes at the bottom, the V0 node will be the rst to be tried, and (given that the original clause contains two arguments) the embedded clause will be reconstructed with two empty arguments, in eect, resulting in a double displacement. On the other hand, single displacement, which is known to be easy for Japanese perceivers, will be predicted to be more dicult, since it corresponds to choosing a lowering site which is higher in the structure. If we reconsider the single displacement example, (b), repeated below as (11), we see how the bottom-up search strategy wrongly predicts a conscious garden path eect: (11)
Yamasita ga yuuzin wo houmonsita siriai ni tegami wo kaita. yamasita NOM friend ACC visited acquaintance DAT letter ACC wrote \Yamasita wrote a letter to an acquaintance who visited his friend."
Taking the V0 node immediately dominating houmonsita (\visited") as the node chosen for lowering, the parser will displace both subject (Yamasita ga) and object (yuuzin wo) into the main clause. This will result in an ungrammatical continuation, in which the verb kaita (\wrote") takes two accusative arguments instead of one, and we must wrongly predict a garden path eect when the parser notices this downstream.20 (12)
ni tegami Yamasita ga yuuzin wo [nom tacc=i houmonsita] siriaii acquaintance DAT letter yamasita NOM friend ACC visited wo kaita. ACC wrote
In fact this structure will violate the so-called double-o constraint, which bars the overt presence of two accusative marked NPs (in our terms, PPs) as arguments of the same predicate. (see Kuroda for details of this constraint) 20
48 Patrick Sturt and Matthew. W. Crocker
7.1 Top-down search A moment's re ection reveals that, if we want to reproduce the \minimal expulsion" eects using the tree-lowering operation for the type of examples discussed here, we should de ne a preference to perform lowering at as high a site as possible. It therefore seems reasonable to postulate a top-down search for Japanese. Since, as yet we have almost no on-line experimental data concerning the processing of this type of example, the following discussion necessarily speculative, and should be seen as one possible way in which a top-down search could proceed. It should be noted in particular that the parser may be sensitive to more types of information than the simply con gurational issue of whether a node is high or low in a tree structure. One such factor may be the pragmatic plausibility of relativisation. Relativisation involves coindexing a head noun with an argument position (which is occupied by a gap site in the relative clause). Assuming a model which allows a certain degree of interaction of non-syntactic knowledge in making parsing decisions (c.f. Crain and Steedman (1985)), it may be that the processor takes into account the plausibility of establishing the referent of the head-noun in the argument position concerned. Other factors which may play a part here include the valency preferences of the verb, and the obliqueness of the argument to be relativised. Another factor which has been shown to be important is the case-marking on the relativising head-noun (Inoue (1991)).21 Imagine a two-argument clause has been built, resulting in the structure illustrated below, and that this is followed immediately by a noun, which must be incorporated somehow into the analysis. S
HHH Arg1 V0 HH Arg V0 2
One possible search strategy is that the parser considers each accessible node in a top-down order (S, V0 , V0), until a plausible relativisation site is found. In a top-down search, the rst node to be considered for lowering is S. This solution corresponds to retaining all arguments in the relative clause. We will call this null-displacement. Since the relative clause must contain a gap, and all of the arguments are overt, the gap must represent an adjunct, which, as we have seen above, must be either temporal or locative. Thus, the null-displacement option will only be available if the semantic content of the relativising head-noun is a plausible time or location for the semantic content of the relative clause to have taken place. 21 For example, native speakers seem to avoid on-line parsing decisions which would require postulating two nominative-marked NPs as arguments of the same predicate. This means that, in the class of examples which we have been discussing, if the relativising head-noun is marked with nominative case, then there will be a preference against displacing the (nominative marked) embedded subject to the matrix clause, i.e. a preference for the null-displacement option.
Incrementality and Monotonicity in Syntactic Parsing 49
If this is not possible, we move on to consider node V0, and attempt to lower. This option corresponds to relativizing Arg1 . A new sentential node will be created, dominating an empty argument in the position occupied by Arg1. This means that S will remain as the matrix sentential node, and will continue to immediately dominate Arg1, or, to put it another way, Arg1 will be displaced from the relative clause. This will be possible if it is plausible to coindex the referent of the head noun with the empty element in the position of Arg1 . If this is not possible, then the parser will descend to the next node, V0, and attempt to relativise Arg2 . In the context shown above, this kind of search will predict the \minimal expulsion" strategy, with null-displacement preferred. Consider rst a null-displacement example, (10.a), repeated below as (13): (13)
[Mary ga sinseihin wo loc kaihatusita] kaisya ga tubureta. Mary NOM new product ACC developed company NOM went bankrupt \The company where Mary developed the new product went bankrupt."
At the point when kaisya (company) is found in the input, the rst node to be considered will be the top S-node. The parser considers the relativisation corresponding to this node (i.e. adverbial relativsation), which is found to be plausible, since a company is a plausible location for Mary to have developed a new product. Thus no diculty is predicted, and indeed, this sentence does not cause conscious processing diculty. Now consider (10.b) repeated again as (14): (14)
Yamasita ga [tnom=i yuuzin wo houmonsita] siriaii ni tegami wo yamasita NOM friend ACC visited acquaintance DAT letter ACC kaita. wrote
As before, the parser rst builds the transitive clause with houmonsita (\visited") as the main verb. This time, null-displacement is not a possibility, since siriai (\acquaintance") is not a plausible location or time. This means that the processor considers the next node down as a lowering site. This node will be the constituent covering the object and verb yuuzin wo houmonsita. Accordingly, the subject argument, yamasita ga is displaced, and a relative clause structure is built with an empty subject position. This analysis remains grammatical throughout the parse, and the structure will be unproblematic for the processor. Finally, consider (10.c) example repeated below as (15): (15)
Yamasita ga yuuzin wo [nom acc houmonsita] kaisya de mikaketa. Yamasita NOM friend ACC visited company LOC saw \Yamasita saw his friend at the company he visited."
50 Patrick Sturt and Matthew. W. Crocker (15) is complicated by the fact that the string is not only locally but also globally ambiguous. The other reading is one in which the main clause contains two empty arguments, and the initially built clause remains intact as an adjunct relative. The null context strongly disfavours this reading, in which two uncontrolled gaps appear in the matrix clause, but it is possible to create a prior context which provides discourse control for both of these arguments, as in the question in (16.a) below. In this case, the utterance is considerably easier to process: (16)
a.
anata wa doko no kaisya de Piitaa wo mikaketa no? you TOP where GEN company LOC Peter ACC saw Q \At which company did you see Peter?" b. nom acc [tloc Yamasita ga yuuzin wo houmonsita] kaisya de mikaketa. Yamasita NOM friend ACC met company LOC saw \I saw him at the company where Yamasita visited his friend."
At the point where houmonsita (\visited") is attached, the parser will have built a simple transitive clause with both nominative and accusative arguments overt. On encountering the noun kaisya, the rst option to be considered is the adjunct relativisation corresponding to null-displacement. This analysis is not implausible, since a company is a reasonable location for Yamasita to meet his friend. Let us say that the parser initially adopts this analysis. At the point when the nal verb mikaketa (\saw") is encountered, neither of its nominative or accusative arguments is overtly present in the main clause. On the null context, this means that there are two uncontrolled arguments. However, raising the subject and object from the lower clause recti es this situation. The two empty arguments in the lower clause are now both controlled, the accusative argument, marked acc is grammatically controlled by the head noun of the relative, kaisya, and the nominative argument, nom , is pragmatically controlled by the matrix subject Yamasita ga. The explanation of the diculty is that this raising of the two arguments cannot be derived via tree-lowering. This is because, by the time the disambiguating nal verb, mikaketa is encountered in the input, the relevant node for lowering will no longer be accessible, since it will be embedded inside the relative clause.
8 Easy Double Displacements The present analysis predicts that double displacement should be possible if, at the initial point of reanalysis (i.e. where the immediately post-clausal noun is encountered), the need to perform lowering consistent with double displacement is obvious. This is in contrast to both Gorrell (1995 a, b) and Weinberg (1993), both of whom propose models which allow the displacement of an overt subject but not an overt object.22 Mazuka and Itoh (in press) give the following example, which reportedly causes no conscious processing diculty, despite the fact that both the subject and object have to be displaced: Though as we have seen above, if the lowering of a verb is permitted, Gorrell's model will allow the displacement of an overt object. 22
Incrementality and Monotonicity in Syntactic Parsing 51
(17)
Hirosi ga aidoru kasyu wo [nom tacc=i kakusita] kamerai de totta. Hirosi NOM popular singer ACC hid camera with photographed \Hiroshi photographed the popular singer with the camera he was hiding."
We assume, as before, that the overt nominative and accusative arguments are initially structured as arguments of the verb kakusita (\hid"). On encountering the head noun kamera, the parser rst considers the null-displacement option, which is found to be implausible (\the camera where/when Hirosi hid the popular singer"). The single displacement option is similarly ruled out (\the camera which hid the popular singer"). Finally, the double displacement option is considered, and is found to be plausible (\the camera which (somebody) hid"). This option is adopted, and the remaining processing proceeds without trouble. A similar eect can be seen if we consider topicalization. In Japanese, topicalised elements, which are given an overt morphological marker wa, almost invariably occur in the matrix clause23 though it may control a \gap" at any level of embedding. Below we reproduce the double displacement (10.c) with the nominative marked argument topicalised. This is reported to be considerably easier to process than the non-topicalised version. (18)
Yamasita wa yuuzin wo [nom tacc=i houmonsita] kaisyai de mikaketa. Yamasita TOP friend ACC visited company LOC saw \(as for) Yamasita (he) saw his friend at the company he visited."
In the top-down search described above, we hypothesised that, on reaching the head noun, kaisya (\company") the processor rst attempts to form a relative clause consistent with null-displacement, with a locative relativisation reading. However, in (18), the parser can immediately eliminate this option, since it would involve a topicalised phrase Yamasita wa appearing in a subordinate clause, which is ungrammatical. The next option to be tried will be the single displacement option, in which the constituent covering yuuzin wo houmonsita (\visited the friend") is lowered. However, this may be discounted on the grounds of plausibility, since \company" is not a plausible subject for \visited". The parser is then left with no choice but to go along with the double displacement option, which eventually turns out to be correct. Thus (18) is correctly predicted to be easier than its non-topicalised counterpart, (10.c). However, on the bottom-up search, there would be no dierence predicted, since the V0 node will be the rst node chosen as a prospective lowering site in both cases.
9 Explaining dierences in search strategies 9.1 Three types of lowering As discussed above, the use of informational monotonicity at the level of purely structural relations may result in non-monotonic behaviour at the level of \secondary" linguistic depend23
Though see Kuroda, 1988, for some limited exceptions.
52 Patrick Sturt and Matthew. W. Crocker encies. In this section we discuss three dierent scenarios in which lowering may apply, and in each case, we discuss the consequences for the dependencies between head and satellites. Note that the lowering operation discussed in this paper is what we may call retrospective lowering; that is where the disambiguating word is preceded by the constituent that it lowers. Another possible operation, which we do not discuss here is anticipatory lowering. In this case, the disambiguating word will precede the lowered constituent, where all the terminal nodes dominated by the lowered constituent are dangling nodes (i.e. they do not yet dominate lexical material). 1. head extension: A head projection is extended through the insertion of material at an intermediate point. Example: post-modi er attachment. XP | X' | X
X' / \ X' YP | Y
+
==>
XP | X' / \ X' YP | | X Y
2. satellite detachment: A satellite projection is broken. Example: John knows the truth hurts, where the satellite projection between the DP dominating the truth and the matrix VP is broken. XP / \ X YP
+
ZP / \ YP Z
==>
XP / \ X ZP / \ YP Z
3. head detachment: A head projection is split into two separate projections. Examples in discussion of Japanese data. XP / \ YP XP / \ ZP XP | X
+
XP / \ WP XP / \ XP W
==>
XP / \ YP XP / \ WP XP / \ XP W / \ ZP XP | X
Incrementality and Monotonicity in Syntactic Parsing 53
9.2 Head- nal and Head-initial languages Now let us consider the consequences of performing each of the three lowering operations in head-initial and head- nal languages. We consider in particular the number of dependencies which must be broken in order to perform lowering on the above structures. By \dependency", we mean a (licensing) relation between a satellite and its head both of which have been encountered in the input so far. In other words, we do not consider a dependency to exist between a head and its satellite if either of the two have not yet been encountered in the input.24 Let us consider head extension rst. On standard X-bar assumptions, this corresponds to the attachment of a modi er. In a head- nal language, in which only pre-modi ers, and no postmodi ers exist, this can only occur if the word which heads the projection to be extended has not yet been encountered in the input. Note that this cannot occur via retrospective lowering as we have de ned it. In a head-initial language which allows post-modi ers, like English, however, head-extension will be possible via retrospective lowering. In this case the lowering operation will add one dependency between the head and the newly adjoined constituent. Now consider satellite detachment. In a head- nal language, all satellites precede the head. This means that, once the head has been attached, none of the satellites will be accessible in the sense de ned above, since none will dominate the head (i.e. the last word to be incorporated). So in a head nal language, satellite detachment will only be possible in cases where the licensing head has not yet been reached in the input, and therefore will not break a dependency. In a head-initial language, by contrast, any satellite which is preceded by its head will remain accessible when the phrase has been completed, and subsequent detachment of the satellite will result in breaking the dependency between that satellite and its head. Finally, consider Head detachment. This is very likely to be found in a head- nal language. It corresponds to the case where a constituent, call it XP, has been completed, but the word subsequently found in the input (call it W), requires one of the nodes on X's head projection. This node is attached to W's left, and replaced with the root node of W's subtree projection. The word W, requiring a constituent on its left, may be a postposition, for example, in a head nal language. Head detachment may break any number of dependencies. In a head- nal language, at the point where a constituent has been built with a head and all its satellites, but no further attachments have been made, all nodes on the head projection of that constituent will be accessible. If a head-detaching lowering operation subsequently has to be performed, then we will assume that the processor attempts to lower at a node consistent with breaking the smallest number of dependencies possible, and this will coincide with Inoue's \minimal expulsion strategy". It will be seen that the minimal expulsion preference can be derived via a preference to lower at the highest node possible, thus maintaining intact the largest number of dependencies between the head and its satellites. The reanalysis of an initially built clause as a relative Note that, in modular theories such as GB, there may be a number of licensing relations between a single head and a single satellite. In this case we do not count each single licensing relation as a single dependency, but treat the entire complex of licensing relations as one dependency. 24
54 Patrick Sturt and Matthew. W. Crocker in the examples we have been discussing above may be seen as instances of the schema in 3. above (abstracting away from syntactic details), where the word W corresponds to the post-clausal head noun, and XP is the clausal (verbal) projection. In English, on the other hand, head extension and satellite detachment will be employed far more often. Since, as far as the number of broken dependencies is concerned, nothing hinges on the choice of lowering site for either of these, the processor may be following a dierent strategy, which may include a preference to lower nodes which have been created as recently as possible (i.e. assuming a right-branching structure, to choose a low site). Considerations such as these may well underlie the diering search strategies between the two languages.
10 Limitations of the system 10.1 Interaction with discourse context and world knowledge Gorrell ((a), ch.5), argues that a model incorporating structural determinism must allow weak interaction of non-syntactic knowledge, in the sense of Crain and Steedman (1985). That is to say that, at each point in processing, it must be possible for discourse information and real world knowledge to be used in order to choose among dierent alternative analyses proposed by the syntactic processor. We saw Japanese examples in the previous section where it was crucial for the parser to be able to judge the plausibility of particular candidates for relativisation in order for the correct predictions to obtain. The constraint of informational monotonicity requires that the parser must decide on the appropriate candidate and commit to that analysis before the next word is considered. In order for the plausibility of a particular analysis to be checked, semantic interpretation must be possible, and so we must assume that the structure corresponding to that analysis has been built. This means that the processor must be able to entertain structure at two dierent levels of commitment; a tentative level for checking the plausibility of alternatives, and an irrevocable level for asserting relations into the database. Apart from any conceptual objections there may be to such an architecture, the model faces problems in cases where the implausibility of an initially preferred analysis forces retrospective reanalysis. Consider the examples in (19): (19)
a. John saw the man with a telescope. b. John saw the man with a moustache.
The constraint of incrementality means that, at the point at which with is encountered in the input in these examples, the parser must decide on an argument or adjunct reading. It is known (Frazier and Rayner, 1982) that there is a preference for the instrumental reading, in which the PP is attached as an argument to the verb. This is predicted in the present model because of the preference for argument over adjunct attachment. In the case of 19(a),
Incrementality and Monotonicity in Syntactic Parsing 55
this analysis remains plausible, since a periscope is a reasonable instrument with which to see a man. In the case of 19(b), however, the instrumental analysis is eventually found to be implausible, since moustaches are, in general rather poor optical instruments, and with a moustache will have to be reattached as an adjunct of the NP a man. We show the two analyses in (20) and (21). (20)
VP
HHHHH HPP V DP
4
saw (21)
4
a man with a telescope VP
HHHH DP V HHH HNP saw DET HH HH a NP N man
PP
4 with a moustache
On encountering the word telescope in 19(b), the parser will have to reanalyse from the structure in (20) to that in (21). But this will result in an incoherent description, since, in (20) we have prec(NP,PP), and in (21) we have dom(NP,PP), against the exclusivity condition.25 A possible solution to this is to compromise word-by-word incrementality, and allow the preposition with and determiner a to be buered until the head noun is reached, but this does not really help, since unbounded lexical material may have to be processed before the head noun is reached: John saw the man with the stylish old-fashioned .... moustache.
10.2 Competition eects in Post-modi er attachment preferences Gibson, et al (1993) give evidence for competition eects in English post-modi er attachments, which are not easily captured in a simple top-down or bottom-up search. Consider the following examples, from Gibson, et al (1993), where in each case, agreement constraints allow only one of the three possible attachment sites for the relative clause. 25
The reader can verify that a binary branching variant of (20) will suer the same problem.
56 Patrick Sturt and Matthew. W. Crocker (22)
a. the lamps near the paintings of the house that was damaged in the ood. b. the lamps near the painting of the houses that was damaged in the ood. c. the lamp near the paintings of the houses that was damaged in the ood.
Intuitive evidence, as well as o-line and on-line experimental data reported in Gibson et al suggest a preference for the low attachment exempli ed in (22.a). However, the second choice is not the middle attachment site, (22.b), which would be predicted on a purely bottom-up search strategy, but rather the high attachment in (22.c), with (22.b) as the least preferred analysis, causing conscious processing diculty. Gibson et al explain this in terms of a competition eect where post-modi er attachment is sensitive to two opposing criteria; a recency eect, which favours low attachment, and a preference to attach to \the elements bearing the core semantic content of the clause | i.e. the components of the head verbal -grid of a clause, consisting of a verbal role assigner and its role-receivers" (Gibson et al, p.9). In examples such as (22), this predicts a competition eect between the lowest site, which will be favoured by the recency criterion, and the highest site, which occurs in the minimal constituent capable of being assigned a -role from outside. Note that, in this case also, a commitment to attachment of the relative clause will have to be made at the point where that is received. If subsequent (agreement) information shows this to be mistaken, then the parser will be forced to reanalyse. The bottom-up search will commit the parser to the lowest attachment site, corresponding to the structure of (22.a). However, the reader can verify that subsequent reanalysis to a higher site will be impossible to derive via tree-lowering (or any other coherence preserving reanalysis procedure). However, intuitive evidence suggests that, though (22.c) is indeed harder to process than (22.a), it is only (22.b) that causes a conscious garden path eect.26 Gibson et al have also shown that the Spanish equivalent of (22) exhibits the same order of attachment preferences as the English one, i.e. LOW > HIGH > MID. This is despite the fact that, as Cuetos & Mitchell (1988), and Mitchell & Cuetos (1991) show, Spanish readers have a preference to attach high in examples with two possible attachment sites. This suggests that a the recency eect, favouring low attachment, may take precedence over a preference for high attachment when the relevant \distance" crosses a certain threshold.
11 Non-monotonicity of semantic interpretation and secondary relations Gorrell's original model, in common with approaches such as Pritchett (1992) , assigns great importance to the syntactic level of representation in explaining garden path phenomena, and Weinberg (1993) proposes that adjuncts are initially attached high, and may subsequently be lowered through the addition of dominance relations. If an adjunct is lowered in this way, structural coherence will be preserved just so long as precedence relations are not asserted. In the present implementation, as well as Gorrell's original model, precedence relations are obligatorily asserted. Note that, for the examples under discussion, Weinberg's system will predict an early closure preference high>mid>low. 26
Incrementality and Monotonicity in Syntactic Parsing 57
as such, does not concern itself with issues of semantic interpretation. Our implementation inherits this concentration on syntax. Ultimately, however, for a processing model to be plausible, we must show that the parser is capable of building semantically interpretable structures. We believe that the best approach for the model described here is that suggested in the original D-theory paper (Marcus et al (1983)), in which the dominance relations at each point in processing are \strengthened" to immediate dominance, to form an interpretable \default" tree. If this is done, then we can view the semantic interpreter as as taking a \snapshot" of the default tree as each word is read. This means that, in the case of in nite local ambiguity, even though the description will represent an in nite set of partial trees, the semantic interpreter only considers one of those trees at any one point in processing. This view of semantic interpretation is similar in spirit to certain computational approaches, such as that of Shieber and Johnson (1993), which uses TAGs to pack together trees which share the same recursive structure. In this system, a semantic tree is built up simultaneously with the syntactic tree in the Synchronous TAG formalism. Default semantic values can be obtained from the semantic tree at each point in processing by assuming that no further TAG adjunctions are to be performed. Another similar approach is that of Thompson et al (1991), in which an atomic category context-free grammar is compiled into a strongly equivalent TAG. Vijay-Shanker (1992) presents a version of TAG which crucially relies on the manipulation of descriptions of trees in the D-theory sense. Despite its apparent similarity to the model described here, however, Vijay-Shanker's approach is actually quite dierent. In his system, what are thought of as auxiliary nodes in the standard TAG formalism are replaced by pairs of nodes related by the domination relation. An auxiliary tree can be adjoined to such a pair of nodes by identifying its root node with the \higher" (i.e. the dominator) of the pair of nodes and identifying its foot node with the \lower" (i.e. the dominated) of the pair. A default tree can then be derived by identifying the higher with the lower of all such pairs of nodes (this does not contradict the domination relation between the pair, since dominance is re exive, so that we can have [dom(A,B) ^ A = B]. Vijay-Shanker notes that such a formalism provides a natural framework for incorporating feature structures into TAGs, since the \identi cation" of two nodes can be accompanied by the uni cation of their feature structures. Constraints on adjunction can also be derived; for example, obligatory adjunction at a pair nodes can be ensured by giving the pair incompatible feature structures, so that identi cation of the two nodes is impossible. However, once two nodes have been identi ed, further material can only be adjoined at that site if the uni cation of the two feature structures is undone. In other words, to preserve monotonicity at the feature-structure level, we can only derive a default tree if we are sure that no further adjunctions are to be performed; otherwise we run the risk of having to unravel previous uni cations. This indicates that, if we were to introduce more features than simply maximal category and bar information onto nodes in our model, such features would have to be manipulated in a non-monotonic fashion. This is related to the point that monotonicity holds at the level of primary relations, while secondary relations (such as those that would be handled by most of the feature attributes) are not so constrained. A similar point can be made in relation to semantic interpretation. In terms of the \snap-shot"
58 Patrick Sturt and Matthew. W. Crocker metaphor, this means that the semantic representation derived at one snap-shot may dier fairly radically from previous snap-shots (they will not necessarily be related by the entailment relation, for example). Let us consider another \easy" single displacement Japanese example: (23)
Mary ga [tnom=i banana wo tabeta] nekoi wo sikatta. Mary NOM banana ACC ate cat ACC scolded \Mary scolded the cat that ate the banana."
A the point where tabeta is received, we may assume that a meaning represenation something like the following has been computed:
9x.[banana0(x) ^ eat0(m0,x)] However, by the end of the sentence, the meaning has changed to the following:
9x. 9y.[banana0(x) ^ cat0(y) ^ eat0(y,x) ^ scold0(m0,y)] Thus, while at one point, the constant m0 (denoting the individual Mary) has been applied to the function eat0, at a later point it is relaced by an existentially quanti ed variable, and applied instead to the function scold0 . This presupposes that the processor must be capable of being fairly destructive at the semantic level. The type of model described here, in which syntax is given a fairly privileged place in the processing architecture, may be contrasted with models which are concerned primarily with incrementally extracting semantic representation, and in which logical forms are built directly, without the explicit construction of a purely syntactic level of representation (e.g. Pulman (1986), Milward (1994, 1995)). Such models derive logical forms entirely non-destructively by using logical devices such as higher-order abstraction. There is also a strong contrast with models such as Crocker (1992), which incorporates a separate thematic level of representation in addition to a Phrase-structure level within the syntactic processor. Reanalysis (modelled as prolog-style backtracking) within the phrase-structure module is predicted to result in negligible processing costs, while reanalysis within the thematic module (as would be required to process 23 above) is predicted to cause conscious diculty. Let us consider Milward's semantics-based model (1994, 1995). The grammar for this system contains a rule state-application, which is similar to the standard functional application rule of categorial grammar, as well as a rule of state-prediction, which may be seen as a form of subordination operation. In a case where, for example, a constituent of type X is needed in order to complete a sentence, but a constituent of type Y is actually found, the rule \subordinates" Y under the (still incomplete) X, and predicts a syntactic type which takes Y as an argument to its left in order to form an X. The following is intended as an illustration of a possible instance of the rule.27 Here, the \hat" pre x, \ ^ "is intended to represent the fact that the argument following it is still not found in the input, and YnX, is the syntactic type of a function from a Y on the left to an X. 27
Incrementality and Monotonicity in Syntactic Parsing 59
/ A
S \ ^X
+
Y
==>
/ A
S \ X / \ Y ^Y\X
Recall example (6), repeated below as (24) (24)
Mary thinks John .....
In the case of (24) above, the rule will subordinate the embedded subject NP under the expected clausal argument, and predict a category of type npns (i.e. a category which requires an NP to its left in order to become a sentence, for example, a verb phrase). There are some interesting parallels with the tree-lowering operation; for example, prediction is also used in the attachment of post-modi ers. The dierence is that, in Milward's model, the need to build a semantic representation non-destructively results in a greater degree of non-determinism. For example, on nding a transitive verb, such as likes in the input, the parser has to choose whether to project only the object NP or the object NP plus a VP modi er. This corresponds to deciding whether to apply state application (projecting only the NP), or to apply state-prediction (projecting the modifer as well).28 In the case of treelowering (and other TAG-inspired systems), the default un-modi ed structure can be built initially, and the post-modifer can be adjoined later as necessary. The dierence can be seen clearly when we consider verbs such as know, which may take either a sentential or NP (in our terms, DP) internal argument. In Milward's model, assuming a preference for the direct object reading, the parser will initially attach the post-verbal NP as a direct object, via state-application. A subsequent tensed verb in the input will then force the parser to backtrack, and project an S node. The post-verbal NP will subsequently be subordinated to the embedded subject position via prediction. In contrast, the present approach initially attaches the DP, which is subsequently subordinated via an explicit parsing operation, and without the use of backtracking.29 Retaining monotonicity at the level of primary structural relations is therefore achieved only at the cost of considerable non-monotonicity at the level of secondary relations. If Gorrell is correct that the human parser does indeed work in this way, this implies that the level of representation at which such relations are expressed plays a vital role in processing. On this view, the primary relations would be seen as an anchor which is used to ground the interpretation of a uid and capricious semantic representation. In fact, state-prediction oers other choices as well (see Milward (1995) for details). It may be possible to reduce the non-determinism of Milward's approach, through the use of disjunctive categories. This would make it similar to approaches in which category labels may be left underspeci ed until the disambiguating information is found (c.f. Weinberg 1993) (David Milward, personal communication). 28
29
60 Patrick Sturt and Matthew. W. Crocker
12 Other Computational Approaches to Human Syntactic Reanalysis Suzanne Stevenson's competition based model (1993, 1994) derives from the constraints of a connectionist architecture a number of very interesting predictions for syntactic processing. The model is similar to that proposed here in the sense that reanalysis is not seen as qualitatively dierent from simple attachment. In fact Stevenson's constraint that reanalysis should only involve nodes on the right edge of the tree, which follows from the space constraints inherent in the connectionist architecture, leads to a reanalysis operation which is very similar to tree-lowering constrained by accessibility, as de ned here. The decay of network activation predicts an empirically supported recency preference in reanalysis as well as attachment, which means that the \bottom up" search strategy which we have discussed here does not have to be stipulated. However, it is not clear how the model would perform in processing head- nal languages, where, assuming an incremental processing regime, the recency preference would presumably predict the opposite of the \top-down", dependency preserving search strategy motivated for Japanese in this paper. That is to say that a strategy which takes into account of more grammatical information may required in addition to the simple recency preference which is predicted in a decay of activation model. Also, the constraints of Stevenson's model may in fact be too strong. For example, phrase structure cannot be postulated without explicit evidence from the input, so that, for example, in sentence such as Mary thinks John likes oranges, processing diculty is predicted on the input of John, because thinks does not select for an NP, and therefore no attachment site can be postulated from explicit evidence in the input up to that point in processing. On the other hand, in a model such as we have been discussing in this paper, which does not forbid non-lexically driven prediction, this problem need not arise, as we have seen. Richard Lewis (1993) presents a comprehension model, nl-soar, which incorporates a syntactic reanalysis component. If, on the attachment of an incoming word, an inconsistency is detected within the local maximal projection to which the incoming word is attached, then nl-soar's \snip" operator can break a previous attachment within this maximal projection and reattach it elsewhere in the tree. This operation is more powerful than Tree-Lowering in the sense that the phrase detached by the snip operator does not have to be reattached in a position which is dominated by the projection of the incoming, disambiguating word. This results in an impressive range of correct predictions. In particular, in a sentence such as Is the block on the table red?, the disambiguating word red can trigger the snip operator to detach the PP on the table, which has previously been attached as the complement oof the copular, and reattach it as an adjunct of block, correctly predicting this sentence to cause no processing diculty, while lowering will not account for this, since the post-reanalysis position of on the table is not dominated by the projection of the disambiguating word red. However, nl-soar does overgenerate on a class of examples such as: The psychologist told the woman that he was having trouble with to leave. and the boy put the book on the table on the shelf, which both involve reattaching material into a preceding phrase, which is not possible in Gorrell's model, and also cannot be generated via tree-lowering.
Incrementality and Monotonicity in Syntactic Parsing 61
13 Conclusions In this paper we have explored the consequences of combining informational monotonicity with word-by-word incrementality. Given the high degree of commitment which the parser has to make before proceeding from one word to the next, it is vital that the optimal analysis should be chosen at each stage in processing, and thus a heavy responsibility is given to the search strategy for nding the required solution. Given this constraint, the range of examples which can be handled is surprisingly wide. However, as shown in the previous section, the information available to the processor at the point where the search has to be undertaken is often inadequate, and this results in the failure to process examples which do not seem to cause conscious processing diculty to humans. The monotonic approach ts well with the idea of a clear-cut binary division of reanalysis into conscious and unconscious, but does not support the notion that utterances may be graded by their diculty on a sliding scale. This, for example, is why it is dicult to account for the order of preference order of LOW > HIGH > MID in the example of Gibson et al in section 10.2. In the monotonic system, the bottom-up search xes on the lowest attachment site, and rigidly holds to it, predicting all other sites to be dicult. A similar phenomenon of \graded diculty" can be seen in the following \single displacement" Japanese examples: (25)
a.
Mary ga banana wo tabeta. Mary NOM banana ACC ate \Mary ate a banana." sikatta. b. Mary ga [tnom=i banana wo tabeta] neko wo Mary NOM banana ACC ate cat ACC scolded \Mary scolded the cat that ate the banana." sikatta] kodomoj c. Mary ga [tnom=j [tnom=i banana wo tabeta] nekoi wo ACC scolded child ACC ACC ate cat Mary NOM banana wo hometa. praised \Mary praised the child who scolded the cat which ate the banana."
As we have mentioned, native speakers do not report conscious diculty at the disambiguating word neko in (25.b). However, Inoue mentions that in examples such as (25.c), there is a slight sense of diculty at the second disambiguating word kodomo, despite the fact that the application of tree-lowering with the top-down search has no trouble in reanalysing here. Inoue reports that the sense of diculty increases with each reanalysis when similar examples with deeper embedding are considered, presumably as a consequence of the subject argument becoming more and more distant from each successive point of reanalysis. It is exactly this graded scale of diculty which is hard to capture in a monotonic model. Though degrees of diculty during reanalysis may be detectable using experimental techniques (reading time, eye-tracking and event related potentials), there has so far been no
62 Patrick Sturt and Matthew. W. Crocker reliable experimental evidence for a binary distinction, despite the initial intuitive appeal. Thus, it seems that we have to consider a more complex architecture than the two-level core parser with higher level resolver. Graded diculty, may be captured, for example, in a more modular approach, such as that of Crocker (1992), where relative diculty of reanalysis is explained in terms of backtracking within or across subsystems of the processor, or by introducing the concept of processing load (Gibson (1991)). What we believe the research reported here shows is that, as both Gorrell (1995 a,b) and Pritchett (1992) argue, con gurational information does play an important role in the degree of diculty which the processor experiences in performing reanalysis. In particular, we may assume the existence of a processing correlate to the grammatical notion of subordination (corresponding to tree-lowering in the present implementation), which can be applied relatively cheaply. Exactly how the various other factors interact with this is a question for future research.
References Abney, S. P. (1987): Licensing and Parsing. Proceedings of NELS 17 p.1-15, University of Massachusetts, Amherst Abney, S. P. (1989): A computational model of human parsing. Journal of Psycholinguistic Research 18 p.129-144 Bever, T. (1970): The cognitive basis for linguistic structures. In J.R. Hayes (ed) Cognition and the development of language p.279-360, New York, Wiley Cornell, T.L. (1994): On Determining the Consistency of Partial Descriptions of Trees Proceedings of ACL, p.163-170 Crain, S. and M. Steedman (1985): On not being led up the garden path: The use of context by the psychological parser. in D. Dowty, L. Karttunen, and A. Zwicky (eds.) Natural Language Processing: Psychological, Computational and Theoretical Perspectives. Cambridge: Cambridge University Press Crocker, M. W. (1992): A Logical Model of Competence and Performance in the Human Sentence Processor. PhD thesis, Dept. of Arti cial Intelligence, University of Edinburgh, Edinburgh, U.K. Crocker, M. W. (1994): On the Nature of the Principle-Based Sentence Processor. in Perspectives on Sentence Processing. Clifton, Frazier and Rayner (eds). New York: Lawrence Erlbaum Cuetos, F. & Mitchell, D.C. (1988): Cross-linguistic dierences in parsing: Restrictions on the use of the Late Closure strategy in Spanish. Cognition 30, p.73-105
Incrementality and Monotonicity in Syntactic Parsing 63
Fodor, J.D. & Inoue, A. (1994): The diagnosis and cure of garden paths. to appear in V. Teller, ed., special edition of Journal of Psycholinguistic Research 23.4, p.405-432 Frazier, L. (1987): Syntactic processing: Evidence from Dutch. In Natural Language and Linguistic Theory 5.4 p.519-559 Gibson, E. (1991): A Computational Theory of Human Linguistic Processing: Memory Limitations and Processing breakdown. Unpublished PhD dissertation, Carnegie Mellon University Gibson, E., N. Pearlmutter, E. Canesco-Gonzalez and Greg Hickok (1993): Cross-linguistic Attachment Preferences: Evidence from English and Spanish. (ms. submitted to Cognition) Gorrell (1987): Studies of Human Syntactic Processing: Ranked-Parallel versus Serial Models PhD thesis, University of Conneticut Gorrell, P. (1995 (a).): Syntax and Parsing. Cambridge University Press, UK Gorrell, P. (1995 (b).): Japanese Trees and the Garden Path. (to appear in Mazuka & Nagai (eds)) Gunji, T. (1987): Japanese Phrase Structure Grammar. Dordrecht: Reidel. Inoue, A. (1991): A comparative study of parsing in English and Japanese. PhD thesis, University of Conneticut. Inoue, A. and J.D. Fodor (1995): Information-paced parsing of Japanese. (in Mazuka & Nagai (eds)) Joshi, A.K., L.S. Levy, and M. Takahashi, (1975): Tree Adjunct grammars. Journal of Computer and System Sciences 10, p.136-163 Joshi, A.K., K. Vijay-Shanker, and D. Weir, (1991): The Convergence of Mildly ContextSensitive Grammar Formalisms. in Foundational Issues in Natural Language Processing, P. Sells, S.M. Shieber, and T. Wasow, eds, MIT Press, Cambridge, MA Keenan, E. and B. Comrie (1977): Noun phrase accessibility and Universal Grammar. Linguistic Inquiry 8 p.63-100 Kuroda, S-Y. (1988): Whether we agree or not: a comparative syntax of English and Japanese. in W. J. Poser, ed. Papers from the second workshop on Japanese Syntax CSLI publications, Stanford, CA Lewis R. L. (1993): An Architecturally-based Theory of Human Sentence Comprehension PhD thesis, Carnegie Mellon University. (Technical Report CMU-CS-93-226, Computer Science Department, Carnegie Mellon University, Pittsburgh PA)
64 Patrick Sturt and Matthew. W. Crocker Marcus, M., D. Hindle, and M. Fleck (1983): D-theory: Talking about talking about trees. Association for Computational Linguistics 21 p.129-136 Mazuka, R., and K. Itoh (1995): Can Japanese be led down the garden path? (in Mazuka and Nagai, eds) Mazuka, R., and Nagai (eds) (1995): Japanese Syntactic Processing LEA Mazuka, R., K. Itoh, S. Kiritani, S. Niwa, K. Ikejiri, and K. Naitoh (1989): Processing of Japanese garden-path, center-embedded, and multiply-left-embedded sentences: Reading time data from an eye movement study. Annual Bulletin of the Research Institute of Logopedics and Phoniatrics 23, p.187-212 Milward, D.R. (1994): Dynamic Dependency Grammar. Linguistics and Philosophy 17 p.651605 Milward, D.R. (1995) Incremental interpretation of categorial grammar. Proceedings of the 7th European ACL, Dublin, Ireland p. 119{126 Mitchell, D.C. (1989): Verb Guidance and other lexical eects in parsing. Language and Cognitive Processes 4 (3/4) SI p.123-154 Mitchell, D.C. & Cuetos, F. (1991): The origins of parsing strategies. Conference proceedings: Current issues in natural language processing University of Texas at Austin, TX Partee, B., A. ter Meulen and R. E. Wall (1993): Mathematical methods in Linguistics Dordrecht: Kluwer Academic Publishers Pickering, M. and Shillcock, R. (1992): Processing Subject Extractions. in G. Goodluck and M. Rochemont (eds) Island Constraints: Theory, Acquisition and Processing Kluwer Academic Publishers, Dordrecht, p.295-320 Pritchett, B. L. (1992): Grammatical Competence and Parsing Performance. Chicago, IL: University of Chicago Press Pulman, S. G. (1986): Grammars, parsers and memory limitations. Language and Cognitive processes 1(3) p.197-225 Schabes, Y. A. Abeille, and A.K. Joshi (1988): New parsing strategies for tree adjoining grammars. In proceedings of 12th International Conference in Computational Linguistics Shieber, S. and M. Johnson (1993): Variations on Incremental Interpretation Journal of Psycholinguistic Research 22 No.2 p.287-318 Shibatani, M (1990): The Languages of Japan. CUP: Cambridge, UK
Incrementality and Monotonicity in Syntactic Parsing 65
Stabler, E. P. (1994, a): Parsing for incremental interpretation. (Ms. UCLA) Stabler, E. P. (1994, b): Syntactic Preferences in Parsing for Incremental Interpretation. (Ms. UCLA) Steedman, M. J. (1989): Grammar, interpretation and processing from the lexicon. In Marslen-Wilson, W., ed. Lexical Representation and Process MIT Press, Cambridge, MA Stevenson, S. (1993): A Competition-Based Explanation of Syntactic Attachment Preferences and Garden Path Phenomena. In Proceedings of 31st ACL. p.266-273 Stevenson, S. (1994): Competition and Recency in a Hybrid Network Model of Syntactic Disambiguation. Journal of Psycholinguistic Research 23 4 p.295-321 Sturt, P (1994): Unconscious Reanalysis in an Incremental Processing Model. Ms. Centre for Cognitive Science, Edinburgh Thompson, H., M. Dixon and J. Lamping (1991): Compose-Reduce Parsing, in Proceedings of the 29th ACL, p.87-97 Vijay-Shanker, K. (1992): Using Descriptions of Trees in a Tree Adjoining Grammar Computational Linguistics 18(4) p.481-517 Weinberg, A. (1993): Parameters in the theory of Sentence Processing: Minimal Commitment Theory goes East Journal of Psycholinguistic Research 22 No.3 p.339-364
3
Incremental Interpretation of Categorial Grammar David Milward 1 2 3 4 5 6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . Applicative Categorial Grammar . . . . . . . . . . . . AB Categorial grammar with Associativity (AACG) An Incremental Parser . . . . . . . . . . . . . . . . . . Parsing Lexicalised Grammars . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
67 Edinburgh Working Papers in Cognitive Science, Vol. 11: Incremental Interpretation, pp. 67{83. D. Milward and P. Sturt, eds.. c 1995 David Milward. Copyright
68 70 73 74 78 80
68 David Milward
Abstract
The paper describes a parser for Categorial Grammar which provides fully word by word incremental interpretation. The parser does not require fragments of sentences to form constituents, and thereby avoids problems of spurious ambiguity. The paper includes a brief discussion of the relationship between basic Categorial Grammar and other formalisms such as HPSG, Dependency Grammar and the Lambek Calculus. It also includes a discussion of some of the issues which arise when parsing lexicalised grammars, and the possibilities for using statistical techniques for tuning to particular languages.1
1 Introduction There is a large body of psycholinguistic evidence which suggests that meaning can be extracted before the end of a sentence, and before the end of phrasal constituents (e.g. MarslenWilson 1973, Tanenhaus et al. 1990). There is also recent evidence suggesting that, during speech processing, partial interpretations can be built extremely rapidly, even before words are completed (Spivey-Knowlton et al. 1994)2. There are also potential computational applications for incremental interpretation, including early parse ltering using statistics based on logical form plausibility, and interpretation of fragments of dialogues (a survey is provided by Milward and Cooper, 1994, (also included in this volume) henceforth referred to as M&C). In the current computational and psycholinguistic literature there are two main approaches to the incremental construction of logical forms. One approach is to use a grammar with `non-standard' constituency, so that an initial fragment of a sentence, such as John likes, can be treated as a constituent, and hence be assigned a type and a semantics. This approach is exempli ed by Combinatory Categorial Grammar, CCG (Steedman 1991), which takes a basic CG with just application, and adds various new ways of combining elements together3. Incremental interpretation can then be achieved using a standard bottom-up shift reduce parser, working from left to right along the sentence. The alternative approach, exempli ed by the work of Stabler on top-down parsing (Stabler 1991), and Pulman on left-corner parsing (Pulman 1986) is to associate a semantics directly with the partial structures formed during a top-down or left-corner parse. For example, a syntax tree missing a noun phrase, such as the following s / \ vp / \ v np^ likes
np John
This paper appears in the Proceedings of the Seventh Conference of the European Chapter of the Association for Computational Linguistics, University College Dublin, 1995, pp. 119{126. The research reported here was supported by the UK Science and Engineering Research Council, Research Grant RR30718. I am grateful to Patrick Sturt, Carl Vogel, and the reviewers for comments on an earlier version. 2 Spivey-Knowlton et al. reported 3 experiments. One showed eects before the end of a word when there was no other appropriate word with the same initial phonology. Another showed on-line eects from adjectives and determiners during noun phrase processing. 3 Note that CCG doesn't provide a type for all initial fragments of sentences. For example, it gives a type to John thinks Mary, but not to John thinks each. In contrast the Lambek Calculus (Lambek 1958) provides an in nite number of types for any initial sentence fragment. 1
Incremental Interpretation of Categorial Grammar 69
can be given a semantics as a function from entities to truth values i.e. x. likes(john,x), without having to say that John likes is a constituent. Neither approach is without problems. If a grammar is augmented with operations which are powerful enough to make most initial fragments constituents, then there may be unwanted interactions with the rest of the grammar (examples of this in the case of CCG and the Lambek Calculus are given in Section 2). The addition of extra operations also means that, for any given reading of a sentence there will generally be many dierent possible derivations (so-called `spurious' ambiguity), making simple parsing strategies such as shift-reduce highly inecient. The limitations of the parsing approaches become evident when we consider grammars with left recursion. In such cases a simple top-down parser will be incomplete, and a left corner parser will resort to buering the input (so won't be fully word-by-word). M&C illustrate the problem by considering the fragment Mary thinks John. This has a small number of possible semantic representations (the exact number depending upon the grammar) e.g. P.thinks(mary,P(john)) P.Q. Q(thinks(mary,P(john))) P.R. (R(x.thinks(x,P(john))))(mary)
The second representation is appropriate if the sentence nishes with a sentential modi er. The third allows there to be a verb phrase modi er. If the semantic representation is to be read o syntactic structure, then the parser must provide a single syntax tree (possibly with empty nodes). However, there are actually any number of such syntax trees corresponding to, for example, the rst semantic representation, since the np and the s can be arbitrarily far apart. The following tree is suitable for the sentence Mary thinks John shaves but not for e.g. Mary thinks John coming here was a mistake. s / \ np vp Mary / \ v s thinks / \ np vp^ John
M&C suggest various possibilities for packing the partial syntax trees, including using Tree Adjoining Grammar (Joshi 1987) or Description Theory (Marcus et al. 1983). One further possibility is to choose a single syntax tree, and to use destructive tree operations later in the parse4 . 4 This might turn out to be similar to one view of Tree Adjoining Grammar, where adjunction adds into a pre-existing well-formed tree structure. It is also closer to some methods for incremental adaptation of discourse structures, where additions are allowed to the right-frontier of a tree structure (e.g. Polanyi and Scha 1984). There are however problems with this kind of approach when features are considered (see e.g. Vijay-Shanker 1992).
70 David Milward The approach which we will adopt here is based on Milward (1992, 1994). Partial syntax trees can be regarded as performing two main roles. The rst is to provide syntactic information which guides how the rest of the sentence can be integrated into the tree. The second is to provide a basis for a semantic representation. The rst role can be captured using syntactic types, where each type corresponds to a potentially in nite number of partial syntax trees. The second role can be captured by the parser constructing semantic representations directly. The general processing model therefore consists of transitions of the form: Syntactic typei Semantic repi
!
Syntactic typei+1 Semantic repi+1
This provides a state-transition or dynamic model of processing, with each state being a pair of a syntactic type and a semantic value. The main dierence between our approach and that of Milward (1992, 1994) is that it is based on a more expressive grammar formalism, Applicative Categorial Grammar, as opposed to Lexicalised Dependency Grammar. Applicative Categorial Grammars allow categories to have arguments which are themselves functions (e.g. very can be treated as a function of a function, and given the type (n/n)/(n/n) when used as an adjectival modi er). The ability to deal with functions of functions has advantages in enabling more elegant linguistic descriptions, and in providing one kind of robust parsing: the parser never fails until the last word, since there could always be a nal word which is a function over all the constituents formed so far. However, there is a corresponding problem of far greater non-determinism, with even unambiguous words allowing many possible transitions. It therefore becomes crucial to either perform some kind of ambiguity packing, or language tuning. This will be discussed in the nal section of the paper.
2 Applicative Categorial Grammar Applicative Categorial Grammar is the most basic form of Categorial Grammar, with just a single combination rule corresponding to function application. It was rst applied to linguistic description by Adjukiewicz and Bar-Hillel in the 1950s. Although it is still used for linguistic description (e.g. Bouma and van Noord, 1994), it has been somewhat overshadowed in recent years by HPSG (Pollard and Sag 1994), and by Lambek Categorial Grammars (Lambek 1958). It is therefore worth giving some brief indications of how it ts in with these developments. The rst directed Applicative CG was proposed by Bar-Hillel (1953). Functional types included a list of arguments to the left, and a list of arguments to the right. Translating Bar-Hillel's notation into a feature based notation similar to that in HPSG (Pollard and Sag 1994), we obtain the following category for a ditransitive verb such as put: 3 2 77 66 s 77 66 lhnpi 4 rhnp, ppi 5
Incremental Interpretation of Categorial Grammar 71
The list of arguments to the left are gathered under the feature, l, and those to the right, an np and a pp in that order, under the feature r. Bar-Hillel employed a single application rule, which corresponds to the following: Ln : : :
2 66 X L1 66 lhL1 : : : 4 rhR : : : 1
3 77 Ln i 77 R1 Rn i 5
: : : Rn ) X
The result was a system which comes very close to the formalised dependency grammars of Gaifman (1965) and Hays (1964). The only real dierence is that Bar-Hillel allowed arguments to themselves be functions. For example, an adverb such as slowly could be given the type5 2 3 66 s 77 66 lhnpi 7 3 77 66 2 77 66 6 s 7 77 77 66 rh 66 l h np i 75i 77 66 64 75 rhi 4
An unfortunate aspect of Bar-Hillel's rst system was that the application rule only ever resulted in a primitive type. Hence, arguments with functional types had to correspond to single lexical items: there was no way to form the type npns6 for a non-lexical verb phrase such as likes Mary. Rather than adapting the Application Rule to allow functions to be applied to one argument at a time, Bar-Hillel's second system (often called AB Categorial Grammar, or Adjukiewicz/BarHillel CG, Bar-Hillel 1964) adopted a `Curried' notation, and this has been adopted by most CGs since. To represent a function which requires an np on the left, and an np and a pp to the right, there is a choice of the following three types using Curried notation: npn((s/pp)/np) (npn(s/pp))/np ((npns)/pp)/np Most CGs either choose the third of these (to give a vp structure), or include a rule of Associativity which means that the types are interchangeable (in the Lambek Calculus, Associativity is a consequence of the calculus, rather than being speci ed separately). The main impetus to change Applicative CG came from the work of Ades and Steedman (1982). Ades and Steedman noted that the use of function composition allows CGs to deal 5 The reformulation is not entirely faithful here to Bar-Hillel, who used a slightly problematic `double slash' notation for functions of functions. 6 Lambek notation (Lambek 1958).
72 David Milward with unbounded dependency constructions. Function composition enables a function to be applied to its argument, even if that argument is incomplete e.g. s/pp + pp/np ! s/np This allows peripheral extraction, where the `gap' is at the start or the end of e.g. a relative clause. Variants of the composition rule were proposed in order to deal with nonperipheral extraction, but this led to unwanted eects elsewhere in the grammar (Bouma 1987). Subsequent treatments of non-peripheral extraction based on the Lambek Calculus (where standard composition is built in: it is a rule which can be proven from the calculus) have either introduced an alternative to the forward and backward slashes i.e. / and n for normal args, " for wh-args (Moortgat 1988), or have introduced so called modal operators on the wh-argument (Morrill et al. 1990). Both techniques can be thought of as marking the wh-arguments as requiring special treatment, and therefore do not lead to unwanted eects elsewhere in the grammar. However, there are problems with having just composition, the most basic of the nonapplicative operations. In CGs which contain functions of functions (such as very, or slowly), the addition of composition adds both new analyses of sentences, and new strings to the language. This is due to the fact that composition can be used to form a function, which can then be used as an argument to a function of a function. For example, if the two types, n/n and n/n are composed to give the type n/n, then this can be modi ed by an adjectival modi er of type (n/n)/(n/n). Thus, the noun very old dilapidated car can get the unacceptable bracketing, [[very [old dilapidated]] car]. Associative CGs with Composition, or the Lambek Calculus also allow strings such as boy with the to be given the type n/n predicting very boy with the car to be an acceptable noun. Although individual examples might be possible to rule out using appropriate features, it is dicult to see how to do this in general whilst retaining a calculus suitable for incremental interpretation. If wh-arguments need to be treated specially anyway (to deal with non-peripheral extraction), and if composition as a general rule is problematic, this suggests we should perhaps return to grammars which use just Application as a general operation, but have a special treatment for wh-arguments. Using the non-Curried notation of Bar-Hillel, it is more natural to use a separate wh-list than to mark wh-arguments individually. For example, the category appropriate for relative clauses with a noun phrase gap would be: 2 3 66 s 7 66 lhi 777 66 rhi 77 4 5 whnpi
It is then possible to specify operations which act as purely applicative operations with respect to the left and right arguments lists, but more like composition with respect to the wh-list. This is very similar to the way in which wh-movement is dealt with in GPSG (Gazdar et
Incremental Interpretation of Categorial Grammar 73
al. 1985) and HPSG, where wh-arguments are treated using slash mechanisms or feature inheritance principles which correspond closely to function composition. Given that our arguments have produced a categorial grammar which looks very similar to HPSG, why not use HPSG rather than Applicative CG? The main reason is that Applicative CG is a much simpler formalism, which can be given a very simple syntax semantics interface, with function application in syntax mapping to function application in semantics7 ; 8 . This in turn makes it relatively easy to provide proofs of soundness and completeness for an incremental parsing algorithm. Ultimately, some of the techniques developed here should be able to be extended to more complex formalisms such as HPSG.
3 AB Categorial grammar with Associativity (AACG) In this section we de ne a grammar similar to Bar-Hillel's rst grammar. However, unlike Bar-Hillel, we allow one argument to be absorbed at a time. The resulting grammar is equivalent to AB Categorial Grammar plus associativity. The categories of the grammar are de ned as follows: 2 3 X7 6 7 6 1. If X is a syntactic type (e.g. s, np), then 66 lhi 77 is a category. 4 rhi 5
2. If2 X is3 a syntactic type, and L and R are lists of categories, then 66 X 77 66 lL 77 is a category. 4 rR 5 Application to the right is de ned by the rule9 : One area where application based approaches to semantic combination gain in simplicity over uni cation based approaches is in providing semantics for functions of functions. Moore (1989) provides a treatment of functions of functions in a uni cation based approach, but only by explicitly incorporating lambda expressions. Pollard and Sag (1994) deal with some functions of functions, such as non-intersective adjectives, by explicit set construction. 8 As discussed above, wh-movement requires something more like composition than application. A simple syntax semantics interface can be retained if the same operation is used in both syntax and semantics. Wharguments can be treated as similar to other arguments i.e. as lambda abstracted in the semantics. For example, the fragment: John found a woman who Mary can be given the semantics P.9x. woman(x) & found(john,x) & P(mary,x), where P is a function from a left argument Mary of type e and a wh-argument, also of type e. 9 `' is list concatenation e.g. hnpi hsi equals hnp,si. 7
74 David Milward 2 66 X 66 lL 4
hR1iR
r
3 77 77 + R1 5
)
2 3 66 X 77 66 lL 77 4 5 rR
Application to the left is de ned by the rule: 2 3 66 X 77 L1 + 66 l L1 L 77 4 rR 5
h i )
2 3 66 X 77 66 lL 77 4 5 rR
The basic grammar provides some spurious derivations, since sentences such as John likes Mary can be bracketed as either ((John likes) Mary) or (John (likes Mary)). However, we will see that these spurious derivations do not translate into spurious ambiguity in the parser, which maps from strings of words directly to semantic representations.
4 An Incremental Parser Most parsers which work left to right along an input string can be described in terms of state transitions i.e. by rules which say how the current parsing state (e.g. a stack of categories, or a chart) can be transformed by the next word into a new state. Here this will be made particularly explicit, with the parser described in terms of just two rules which take a state, a new word and create a new state10. There are two unusual features. Firstly, there is nothing equivalent to a stack mechanism: at all times the state is characterised by a single syntactic type, and a single semantic value, not by some stack of semantic values or syntax trees which are waiting to be connected together. Secondly, all transitions between states occur on the input of a new word: there are no `empty' transitions (such as the reduce step of a shift-reduce parser). The two rules, which are given in Figure 111, are dicult to understand in their most general form. Here we will work upto the rules gradually, by considering which kinds of rules we might need in particular instances. Consider the following pairing of sentence fragments with their simplest possible CG type: Mary thinks: s/s Mary thinks John: s/(npns) Mary thinks John likes: s/np Mary thinks John likes Sue: s This approach is described in greater detail in Milward (1994), where parsers are speci ed formally in terms of their dynamics. 11 Li, Ri , Hi are lists of categories. li and ri are lists of variables, of the same length as the corresponding Li and Ri . 10
Incremental Interpretation of Categorial Grammar 75
State-Application: 2 3 Y 6 7
66 77 3 2 66 lhi2 77 3 66 77 7 66 Y 7 66 66 X 77 77 7 66 lhi 7 6 7 l L 66 6 0 7 77 \W" 6 7 r h i R r R R 2 7 6 1 2 6 7 66 6 rR0 7 77 5 4 4 5 66 77 hhi h H 0 66 77 4 5 r1 . F(G(r1)) hhi
!
F
2 3 Y 66 7 7 66 lhi 7 7 2 3 66 7 7 66 7 X 6 7 7 6 7 3 3 2 66 7 6 7 7 6 7 6 7 Z 6 7 77 7 6 66 7 6 7 7 6 7 6 7 77 6 lL 7 66 7 6 7 7 6 i L l h 7 07 6 3 77 7 6 66 7 6 7 r R 7 6 7 6 7 77 4 5 66 7 77 6 7 7 h hi 6 7 77 \W" 6 7 6 7 L0 77 6 7 i R2 7 66 rR1 h 6 7 77i R2 777 6 7 7 rR0 6 7 66 7 6 7 5 77 2 3 7 6 7 H0 66 7 6 7 77 7 Z 6 7 6 7 66 7 6 7 5 6 7 7 6 7 6 lL 7 66 7 6 7 6 7 7 h h i H 07 6 6 7 66 7 6 7 r R 6 7 7 6 7 4 5 66 7 6 7 7 h hi 4 5 66 7 7 64 7 5 hhi
2 3 X 6 7 6 7 6 7 lL0 6 7 where W: 66 rR1 R0 77 4 5 hhi
G
State-Prediction: 2 66 Y 66 lhi 66 2 66 6 X 66 66 lL 66 rh 66 1 66 64 rR0 66 hL1 64 hhi
F
!
r1 .(h.
F(l1. (h( r (((G r1)r)l1)))))
Figure 1: Transition Rules
3 2 Z 7 6 7 6 6 lL1 L 7 7 6 where W: 66 rR1 R 77 5 4 hhi
G
76 David Milward Now consider taking each type as a description of the state that the parser is in after absorbing the fragment. We obtain a sequence of transitions as follows: s/s \John" ! s/(npns) \likes" ! s/np \Sue" ! s
If an embedded sentence such as John likes Sue is a mapping from an s/s to an s, this suggests that it might be possible to treat all sentences as mapping from some category expecting an s to that category i.e. from X/s to X. Similarly, all noun phrases might be treated as mappings from an X/np to an X. Now consider individual transitions. The simplest of these is where the type of argument expected by the state is matched by the next word i.e. s/np \Sue" ! s
where: Sue: np
This can be generalised to the following rule, which is similar to Function Application in standard CG12 X/Y \W" !X
where: W: Y
A similar transition occurs for likes. Here an npns was expected, but likes only provides part of this: it requires an np to the right to form an npns. Thus after likes is absorbed the state category will need to expect an np. The rule required is similar to Function Composition in CG i.e. X/Y \W" ! X/Z
where: W: Y/Z
Considering this informally in terms of tree structures, what is happening is the replacement of an empty node in a partial tree by a second partial tree i.e. X / \ U Y^
+
Y / \ V Z^
=>
X / \ U Y / \ V Z^
The two rules speci ed so far need to be further generalised to allow for the case where a lexical item has more than one argument (e.g. if we replace likes by a di-transitive such as gives or a tri-transitive such as bets). This is relatively trivial using a non-curried notation similar to that used for AACG. What we obtain is the single rule of State-Application, which corresponds to application when the list of arguments, R1 , is empty, to function composition when R1 is of length one, and to n-ary composition when R1 is of length n. The only change needed from AACG notation is the inclusion of an extra feature list, the h list, which stores information about which arguments are waiting for a head (the reasons for this will It diers in not being a rule of grammar: here the functor is a state category and the argument is a lexical category. In standard CG function application, the functor and argument can correspond to a word or a phrase. 12
Incremental Interpretation of Categorial Grammar 77 2 3 s 66 7 66 lhi 777 66 rhsi 77 4 5 hhi
\John" !
Q.Q
2 3 66 s 7 7 66 lhi 7 3 7 66 2 7 7 66 6 s 7 7 66 66 lhnpi 77 7 7 77i 7 66 rh 66 7 66 64 rhi 75 7 7 7 66 hhnpi 7 7 64 7 5 hhi
\likes" !
2 3 s 6 7 6 7 6 lhi 7 6 7 6 rhnpi 7 6 7 4 5 hhi Y.likes'(john',Y)
\Sue" !
2 3 s 7 6 6 7 6 lhi 7 6 7 6 rhi 7 6 4 7 5 hhi likes'(john',sue')
H. (H(john'))
Figure 2: Possible state transitions be explained later). The lexicon is identical to that for a standard AACG, except for having h-lists which are always set to empty. Now consider the rst transition. Here a sentence was expected, but what was encountered was a noun phrase, John. The appropriate rule in CG notation would be: X/Y \W" ! X/(ZnY)
where: W: Z
This rule states that if looking for a Y and get a Z then look for a Y which is missing a Z. In tree structure terms we have: X / \ U Y^
+
Z
=>
X / \ U Y / \ Z Z\Y^
The rule of State-Prediction is obtained by further generalising to allow the lexical item to have missing arguments, and for the expected argument to have missing arguments. State-Application and State-Prediction together provide the basis of a sound and complete parser13 . Parsing of sentences is achieved by starting in a state expecting a sentence, and applying the rules non-deterministically as each word is input. A successful parse is achieved if the nal state expects no more arguments. As an example, reconsider the string John likes Sue. The sequence of transitions corresponding to John likes Sue being a sentence, is given in Figure 2. The transition on encountering John is deterministic: State-Application cannot apply, and State-Prediction can only be instantiated one way. The result is a new state expecting an argument which, given an np could give an s i.e. an npns. The parser accepts the same strings as the grammar and assigns them the same semantic values. This is slightly dierent from the standard notion of soundness and completeness of a parser, where the parser accepts the same strings as the grammar and assigns them the same syntax trees. 13
78 David Milward The transition on input of likes is non-deterministic. State-Application can apply, as in Figure 2. However, State-Prediction can also apply, and can be instantiated in four ways (these correspond to dierent ways of cutting up the left and right subcategorisation lists of the lexical entry, likes, i.e. as hnpi hi or hi hnpi). One possibility corresponds to the prediction of an sns modi er, a second to the prediction of an (npns)n(npns) modi er (i.e. a verb phrase modi er), a third to there being a function which takes the subject and the verb as separate arguments, and the fourth corresponds to there being a function which requires an s/np argument. The second of these is perhaps the most interesting, and is given in Figure 3. It is the choice of this particular transition at this point which allows verb phrase modi cation, and hence, assuming the next word is Sue, an implicit bracketing of the string fragment as (John (likes Sue)). Note that if State-Application is chosen, or the rst of the State-Prediction possibilities, the fragment John likes Sue retains a at structure. If there is to be no modi cation of the verb phrase, no verb phrase structure is introduced. This relates to there being no spurious ambiguity: each choice of transition has semantic consequences; each choice aects whether a particular part of the semantics is to be modi ed or not. Finally, it is worth noting why it is necessary to use h-lists. These are needed to distinguish between cases of real functional arguments (of functions of functions), and functions formed by State-Prediction. Consider the following trees, where the npns node is empty. s / \ s/s s / \ np np\s^
s / \ np np\s / \ (np\s)/(np\s) np\s^
Both trees have the same syntactic type, however in the rst case we want to allow for there to be an sns modi er of the lower s, but not in the second. The headed list distinguishes between the two cases, with only the rst having an np on its headed list, allowing prediction of an s modi er.
5 Parsing Lexicalised Grammars When we consider full sentence processing, as opposed to incremental processing, the use of lexicalised grammars has a major advantage over the use of more standard rule based grammars. In processing a sentence using a lexicalised formalism we do not have to look at the grammar as a whole, but only at the grammatical information indexed by each of the words. Thus increases in the size of a grammar don't necessarily eect eciency of processing, provided the increase in size is due to the addition of new words, rather than increased lexical ambiguity. Once the full set of possible lexical entries for a sentence is collected, they can, if required, then be converted back into a set of phrase structure rules (which should correspond to a small subset of the rule based formalism equivalent to the whole lexicalised grammar), before being parsing with a standard algorithm such as Earley's (Earley 1970). In incremental parsing we cannot predict which words will appear in the sentence, so cannot
Incremental Interpretation of Categorial Grammar 79
2 3 66 s 77 66 lhi 7 3 777 66 2 66 6 s 7 7 66 66 lhnpi 77 777 \likes" 77i 77 66 rh 66 r hi 6 75 7 66 4 7 66 hhnpi 7 77 64 5 hhi
!
H. (H(john'))
2 3 s 66 7 7 66 lhi 7 2 3 7 66 7 7 66 7 66 s 7 7 7 3 66 7 66 2 7 7 7 66 7 66 6 s 7 7 7 7 7 66 7 66 6 7 6 lhnpi 7 7 7 7 66 ; np i 7 66 lh 6 7 7 6 7 7 rhi 7 66 7 66 6 7 5 4 7 7 66 7 hhi 66 7 7 7 66 rhnp; 6 7 7 i 7 66 rhi 7 66 7 7 7 66 2 7 3 66 7 7 7 66 7 66 7 s 7 6 7 7 66 6 7 7 66 7 7 6 7 l h np i 7 66 hh 6 7 7 66 7 ; np i 7 6 7 7 66 6 rhi 7 7 66 7 7 4 5 7 64 7 66 hhi 5 7 7 66 7 7 4 5 hhi
2 3 s 7 6 6 7 6 lhnpi 7 6 7 where W: 66 rhnpi 77 4 5 hhi Y.X.likes'(X,Y)
Y.K.(K(X.likes'(X,Y)))(john)
Figure 3: Example instantiation of State-Prediction use the same technique. However, if we are to base a parser on the rules given above, it would seem that we gain further. Instead of grammatical information being localised to the sentence as a whole, it is localised to a particular word in its particular context: there is no need to consider a pp as a start of a sentence if it occurs at the end, even if there is a verb with an entry which allows for a subject pp. However there is a major problem. As we noted in the last paragraph, it is the nature of parsing incrementally that we don't know what words are to come next. But here the parser doesn't even use the information that the words are to come from a lexicon for a particular language. For example, given an input of 3 nps, the parser will happily create a state expecting 3 nps to the left. This might be a likely state for say a head nal language, but an unlikely state for a language such as English. Note that incremental interpretation will be of no use here, since the semantic representation should be no more or less plausible in the dierent languages. In practical terms, a naive interactive parallel Prolog implementation on a current workstation fails to be interactive in a real sense after about 8 words14. What seems to be needed is some kind of language tuning15. This could be in the nature of This result should however be treated with some caution: in this implementation there was no attempt to perform any packing of dierent possible transitions, and the algorithm has exponential complexity. In contrast, a packed recogniser based on a similar, but much simpler, incremental parser for Lexicalised Dependency Grammar has O(n3 ) time complexity (Milward 1994) and good practical performance, taking a couple of seconds on 30 word sentences. 15 The usage of the term language tuning is perhaps broader here than its use in the psycholinguistic literature to refer to dierent structural preferences between languages e.g. for high versus low attachment (Mitchell et 14
80 David Milward xed restrictions to the rules e.g. for English we might rule out uses of prediction when a noun phrase is encountered, and two already exist on the left list. A more appealing alternative is to base the tuning on statistical methods. This could be achieved by running the parser over corpora to provide probabilities of particular transitions given particular words. These transitions would capture the likelihood of a word having a particular part of speech, and the probability of a particular transition being performed with that part of speech. There has already been some early work done on providing statistically based parsing using transitions between recursively structured syntactic categories (Tugwell 1995)16. Unlike a simple Markov process, there are a potentially in nite number of states, so there is inevitably a problem of sparse data. It is therefore necessary to make various generalisations over the states, for example by ignoring the R2 lists. The full processing model can then be either serial, exploring the most highly ranked transitions rst (but allowing backtracking if the semantic plausibility of the current interpretation drops too low), or ranked parallel, exploring just the n paths ranked highest according to the transition probabilities and semantic plausibility.
6 Conclusion The paper has presented a method for providing interpretations word by word for basic Categorial Grammar. The nal section contrasted parsing with lexicalised and rule based grammars, and argued that statistical language tuning is particularly suitable for incremental, lexicalised parsing strategies.
References Ades, A. & Steedman, M.: 1972, `On the Order of Words', Linguistics & Philosophy 4, 517-558. Bar-Hillel, Y.: 1953, `A Quasi-Arithmetical Notation for Syntactic Description', Language 29, 47-58. Bar-Hillel, Y.: 1964, Language & Information: Selected Essays on Their Theory & Application, Addison-Wesley. Bouma, G.: 1987, `A Uni cation-Based Analysis of Unbounded Dependencies', in Proceedings of the 6th Amsterdam Colloquium, ITLI, University of Amsterdam. Bouma, G. & van Noord, G.: 1994, `Constraint-Based Categorial Grammar', in Proceedings al. 1992). 16 Tugwell's approach does however dier in that the state transitions are not limited by the rules of StatePrediction and State-Application. This has advantages in allowing the grammar to learn phenomena such as heavy NP shift, but has the disadvantage of suering from greater sparse data problems. A compromise system using the rules here, but allowing reordering of the r-lists might be preferable.
Incremental Interpretation of Categorial Grammar 81 of the 32nd ACL, Las Cruces, U.S.A.
Earley, J.: 1970, `An Ecient Context-free Parsing Algorithm', ACM Communications 13(2), 94-102. Gaifman, H.: 1965, `Dependency Systems & Phrase Structure Systems', Information & Control 8: 304-337. Gazdar, G., Klein, E., Pullum, G.K., & Sag, I.A.: 1985, Generalized Phrase Structure Grammar, Blackwell, Oxford. Hays, D.G.: 1964, `Dependency Theory: A Formalism & Some Observations', Language 40, 511-525. Joshi, A.K.: 1987, `An Introduction to Tree Adjoining Grammars', in Manaster-Ramer (ed.), Mathematics of Language, John Benjamins, Amsterdam. Lambek, J.: 1958, `The Mathematics of Sentence Structure', American Mathematical Monthly 65, 154-169. Marcus, M., Hindle, D., & Fleck, M.: 1983, `D-Theory: Talking about Talking about Trees', in Proceedings of the 21st ACL, Cambridge, Mass. Marslen-Wilson, W.: 1973, `Linguistic Structure & Speech Shadowing at Very Short Latencies', Nature 244, 522-523. Milward, D.: 1992, `Dynamics, Dependency Grammar & Incremental Interpretation', in Proceedings of COLING 92, Nantes, vol 4, 1095-1099. Milward, D. & Cooper, R.: 1994, `Incremental Interpretation: Applications, Theory & Relationship to Dynamic Semantics', in Proceedings of COLING 94, Kyoto, Japan, 748-754. Milward, D.: 1994, `Dynamic Dependency Grammar', Linguistics & Philosophy 17, 561-605. Mitchell, D.C., Cuetos, F., & Corley, M.M.B.: 1992, `Statistical versus linguistic determinants of parsing bias: cross-linguistic evidence'. Paper presented at the 5th Annual CUNY Conference on Human Sentence Processing, New York. Moore, R.C.: 1989, `Uni cation-Based Semantic Interpretation', in Proceedings of the 27th ACL, Vancouver. Moortgat, M.: 1988, Categorial Investigations: Logical & Linguistic Aspects of the Lambek Calculus, Foris, Dordrecht. Morrill, G., Leslie, N., Hepple, M. & Barry,G.: 1990, `Categorial Deductions & Structural Operations', in Barry, G. & Morrill, G. (eds.), Studies in Categorial Grammar, Edinburgh Working Papers in Cognitive Science, 5. Polanyi, L. & Scha, R.: 1984, `A Syntactic Approach to Discourse Semantics', in Proceedings of COLING 84, Stanford, 413-419. Pollard, C. & Sag, I.A.: 1994, Head-Driven Phrase Structure Grammar, University of Chicago
82 David Milward Press & CSLI Publications, Chicago. Pulman, S.G.: 1986, `Grammars, Parsers, & Memory Limitations', Language & Cognitive Processes 1(3), 197-225. Spivey-Knowlton, M., Sedivy, J., Eberhard, K., & Tanenhaus, M.: 1994, `Psycholinguistic Study of the Interaction Between Language & Vision', in Proceedings of the 12th National Conference on AI, AAAI-94. Stabler, E.P.: 1991, `Avoid the Pedestrian's Paradox', in Berwick, R.C. et al. (eds.), PrincipleBased Parsing: Computation & Psycholinguistics, Kluwer, Netherlands, 199-237. Steedman, M.J.: 1991, `Type-Raising & Directionality in Combinatory Grammar', in Proceedings of the 29th ACL, Berkeley, U.S.A. Tanenhaus, M.K., Garnsey, S., & Boland, J.: 1990, `Combinatory Lexical Information & Language Comprehension', in Altmann, G.T.M. Cognitive Models of Speech Processing, MIT Press, Cambridge Ma. Tugwell, D.: 1995, `A State-Transition Grammar for Data-Oriented Parsing', in Proceedings of the 7th Conference of the European ACL, EACL-95, Dublin, this volume. Vijay-Shanker, K.: 1992, `Using Descriptions of Trees in a Tree Adjoining Grammar', Computational Linguistics 18(4), 481-517.
4
Incremental Interpretation: Applications, Theory, and Relationship to Dynamic Semantics David Milward and Robin Cooper 1 2 3 4 5
Applications . . . . Current Tools . . . . Dynamic Semantics Implementation . . Conclusions . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
83 Edinburgh Working Papers in Cognitive Science, Vol. 11: Incremental Interpretation, pp. 83{99. D. Milward and P. Sturt, eds.. c 1995 David Milward and Robin Cooper. Copyright
84 87 93 95 96
84 David Milward and Robin Cooper
Abstract
Why should computers interpret language incrementally? In recent years psycholinguistic evidence for incremental interpretation has become more and more compelling, suggesting that humans perform semantic interpretation before constituent boundaries, possibly word by word. However, possible computational applications have received less attention. In this paper we consider various potential applications, in particular graphical interaction and dialogue. We then review the theoretical and computational tools available for mapping from fragments of sentences to fully scoped semantic representations. Finally, we tease apart the relationship between dynamic semantics and incremental interpretation.1
1 Applications Following the work of, for example, Marslen-Wilson (1973), Just and Carpenter (1980) and Altmann and Steedman (1988), it has become widely accepted that semantic interpretation in human sentence processing can occur before sentence boundaries and even before clausal boundaries. It is less widely accepted that there is a need for incremental interpretation in computational applications. In the 1970s and early 1980s several computational implementations motivated the use of incremental interpretation as a way of dealing with structural and lexical ambiguity (a survey is given in Haddock 1989). A sentence such as the following has 4862 dierent syntactic parses due solely to attachment ambiguity (Stabler 1991). (1)
I put the bouquet of owers that you gave me for Mothers' Day in the vase that you gave me for my birthday on the chest of drawers that you gave me for Armistice Day.
Although some of the parses can be ruled out using structural preferences during parsing (such as Late Closure or Minimal Attachment (Frazier 1979)), extraction of the correct set of plausible readings requires use of real world knowledge. Incremental interpretation allows online semantic ltering, i.e. parses of initial fragments which have an implausible or anomalous interpretation are rejected, thereby preventing ambiguities from multiplying as the parse proceeds. However, on-line semantic ltering for sentence processing does have drawbacks. Firstly, for sentence processing using a serial architecture (rather than one in which syntactic and semantic processing is performed in parallel), the savings in computation obtained from on-line ltering have to be balanced against the additional costs of performing semantic computations for parses of fragments which would eventually be ruled out anyway from purely syntactic considerations. Moreover, there are now relatively sophisticated ways of packing ambiguities This paper appears in the Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, 1994, pp. 748{754. The research reported here was supported by the UK Science and Engineering Research Council, Research Grant RR30718. 1
Incremental Interpretation: Applications, Theory, and Relationship to Dynamic Semantics 85
during parsing (e.g. by the use of graph-structured stacks and packed parse forests (Tomita 1985)). Secondly, the task of judging plausibility or anomaly according to context and real world knowledge is a dicult problem, except in some very limited domains. In contrast, statistical techniques using lexeme co-occurrence provide a relatively simple mechanism which can imitate semantic ltering in many cases. For example, instead of judging bank as a nancial institution as more plausible than bank as a riverbank in the noun phrase the rich bank, we can compare the number of co-occurrences of the lexemes rich and bank1 (= riverbank) versus rich and bank2 (= nancial institution) in a semantically analysed corpus. Cases where statistical techniques seem less appropriate are where plausibility is aected by local context. For example, consider the ambiguous sentence, The decorators painted a wall with cracks in the two contexts The room was supposed to look run-down vs. The clients couldn't aord wallpaper. Such cases involve reasoning with an interpretation in its immediate context, as opposed to purely judging the likelihood of a particular linguistic expression in a given application domain (see e.g. Cooper 1993 for discussion). Although the usefulness of on-line semantic ltering during the processing of complete sentences is debatable, ltering has a more plausible role to play in interactive, real-time environments, such as interactive spell checkers (see e.g. Wiren (1990) for arguments for incremental parsing in such environments). Here the choice is between whether or not to have semantic ltering at all, rather than whether to do it on-line, or at the end of the sentence. The concentration in early literature on using incremental interpretation for semantic ltering has perhaps distracted from some other applications which provide less controversial applications. We will consider two in detail here: graphical interfaces, and dialogue. The Foundations for Intelligent Graphics Project (FIG)2 considered various ways in which natural language input could be used within computer aided design systems (the particular application studied was computer aided kitchen design, where users would not necessarily be professional designers). Incremental interpretation was considered to be useful in enabling immediate visual feedback. Visual feedback could be used to provide con rmation (for example, by highlighting an object referred to by a successful de nite description), or it could be used to give the user an improved chance of achieving successful reference. For example, if sets of possible referents for a de nite noun phrase are highlighted during word by word processing then the user knows how much or how little information is required for successful reference.3 Human dialogue, in particular, task oriented dialogue is characterised by a large numbers of self-repairs (Levelt 1983, Carletta et al. 1993), such as hesitations, insertions, and replacements. It is also common to nd interruptions requesting extra clari cation, or disagreements before the end of a sentence. It is even possible for sentences started by one dialogue participant to be nished by another. Applications involving the understanding of dialogues 2 Joint Councils Initiative in Cognitive Science/HCI, Grant 8826213, EdCAAD and Centre for Cognitive Science, University of Edinburgh. 3 This example was inspired by the work of Haddock (1987) on incremental interpretation of de nite noun phrases. Haddock used an incremental constraint based approach following Mellish (1985) to provide an explanation of why it is possible to use the noun phrase the rabbit in the hat even when there are two hats, but only one hat with a rabbit in it.
86 David Milward and Robin Cooper include information extraction from conversational databases, or computer monitoring of conversations. It also may be useful to include some features of human dialogue in manmachine dialogue. For example, interruptions can be used for early signalling of errors and ambiguities. Let us rst consider some examples of self-repair. Insertions add extra information, usually modi ers e.g. (2)
We start in the middle with ..., in the middle of the paper with a blue disc (Levelt 1983:ex.3)
Replacements correct pieces of information e.g. (3)
Go from left again to uh ..., from pink again to blue (Levelt 1983:ex.2)
In some cases information from the corrected material is incorporated into the nal message. For example, consider4 : (4)
a. The three main sources of data come, uh ..., they can be found in the references b. John noticed that the old man and his wife, uh ..., that the man got into the car and the wife was with him when they left the house c. Every boy took, uh ..., he should have taken a water bottle with him
In (a), the corrected material the three main sources of data come provides the antecedent for the pronoun they. In (b) the corrected material tells us that the man is both old and has a wife. In (c), the pronoun he is bound by the quanti er every boy. For a system to understand dialogues involving self-repairs such as those in (4) would seem to require either an ability to interpret incrementally, or the use of a grammar which includes self repair as a syntactic construction akin to non-constituent coordination (the relationship between coordination and self-correction is noted by Levelt (1983)). For a system to generate self repairs might also require incremental interpretation, assuming a process where the system performs on-line monitoring of its output (akin to Levelt's model of the human self-repair mechanism). It has been suggested that generation of self repairs is useful in cases where there are severe time constraints, or where there is rapidly changing background information (Carletta, p.c.). A more compelling argument for incremental interpretation is provided by considering dialogues involving interruptions. Consider the following dialogue from the TRAINS corpus (Gross et al., 1993): 4
Example (a) is reconstructed from an actual utterance. Examples (b) and (c) were constructed.
Incremental Interpretation: Applications, Theory, and Relationship to Dynamic Semantics 87
(5)
A: so we should move the engine at Avon, engine E, to ... B: engine E1 A: E1 B: okay A: engine E1, to Bath ...
This requires interpretation by speaker B before the end of A's sentence to allow objection to the apposition, the engine at Avon, engine E. An example of the potential use of interruptions in human computer interaction is the following: (6)
User: Put the punch onto ... Computer: The punch can't be moved. It's bolted to the oor.
In this example, interpretation must not only be before the end of the sentence, but before a constituent boundary (the verb phrase in the user's command has not yet been completed).
2 Current Tools 1. Syntax to Semantic Representation In this section we shall brie y review work on providing semantic representations (e.g. lambda expressions) word by word. Traditional layered models of sentence processing rst build a full syntax tree for a sentence, and then extract a semantic representation from this. To adapt this to an incremental perspective, we need to be able to provide syntactic structures (of some sort) for fragments of sentences, and be able to extract semantic representations from these. One possibility, which has been explored mainly within the Categorial Grammar tradition (e.g. Steedman 1988) is to provide a grammar which can treat most if not all initial fragments as constituents. They then have full syntax trees from which the semantics can be calculated. However, an alternative possibility is to directly link the partial syntax trees which can be formed for non-constituents with functional semantic representations. For example, a fragment missing a noun phrase such as John likes can be associated with a semantics which is a function from entities to truth values. Hence, the partial syntax tree given in Fig. 15 , The downarrow notation for missing constituents is adopted from Synchronous Tree Adjoining Grammar (Shieber & Schabes 1990). 5
88 David Milward and Robin Cooper s np John
vp v likes
np
Fig. 1
can be associated with a semantic representation, x. likes(john,x). Both Categorial approaches to incremental interpretation and approaches which use partial syntax trees get into diculty in cases of left recursion. Consider the sentence fragment, Mary thinks John. A possible partial syntax tree is provided by Fig. 2. s np Mary
v thinks
vp s np John
vp
Fig. 2
However, this is not the only possible partial tree. In fact there are in nitely many dierent trees possible. The completed sentence may have an arbitrarily large number of intermediate nodes between the lower s node and the lower np. For example, John could be embedded within a gerund e.g. Mary thinks John leaving here was a mistake, and this in turn could be embedded e.g. Mary thinks John leaving here being a mistake is surprising. John could also be embedded within a sentence which has a sentence modi er requiring its own s node e.g. Mary thinks John will go home probably6 , and this can be further embedded e.g. Mary thinks John will go home probably because he is tired. The problem of there being an arbitrary number of dierent partial trees for a particular fragment is re ected in most current approaches to incremental interpretation being either incomplete, or not fully word by word. For example, incomplete parsers have been proposed by Stabler (1991) and Moortgat (1988). Stabler's system is a simple top-down parser which does not deal with left recursive grammars. Moortgat's M-System is based on the Lambek Calculus: the problem of an in nite number of possible tree fragments is replaced by a corresponding problem of initial fragments having an in nite number of possible types. A complete incremental parser, which is not fully word by word, was proposed by Pulman (1986). This is based on arc-eager left-corner parsing (see e.g. Resnik 1992). To enable complete, fully word by word parsing requires a way of encoding an in nite number of partial trees. There are several possibilities. The rst is to use a language describing trees where we can express the fact that John is dominated by the s node, but do not 6 The treatment of probably as a modi er of a sentence is perhaps controversial. However, treatment of it
as a verb phrase modi er would merely shift the potential left recursion to the verb phrase node.
Incremental Interpretation: Applications, Theory, and Relationship to Dynamic Semantics 89
have to specify what it is immediately dominated by (e.g. D-Theory, Marcus et al. 1983). Semantic representations could be formed word by word by extracting `default' syntax trees (by strengthening dominance links into immediated dominance links wherever possible). A second possibility is to factor out recursive structures from a grammar. Thompson et al. (1991) show how this can be done for a phrase structure grammar (creating an equivalent Tree Adjoining Grammar (Joshi 1987)). The parser for the resulting grammar allows linear parsing for an (in nitely) parallel system, with the absorption of each word performed in constant time. At each choice point, there are only a nite number of possible new partial TAG trees (the TAG trees represents the possibly in nite number of trees which can be formed using adjunction). It should again be possible to extract `default' semantic values, by taking the semantics from the TAG tree (i.e. by assuming that there are to be no adjunctions). A somewhat similar system has recently been proposed by Shieber and Johnson (1993). The third possibility is suggested by considering the semantic representations which are appropriate during a word by word parse. Although there are any number of dierent partial trees for the fragment Mary thinks John, the semantics of the fragment can be represented using just two lambda expressions7 :
P. thinks(mary,P(john)) P. Q. Q(thinks(mary,P(john))) Consider the rst. The lambda abstraction (over a functional item of type e!t) can be thought of as a way of encoding an in nite set of partial semantic (tree) structures. For example, the eventual semantic structure may embed john at any depth e.g. thinks(mary,sleeps(john)) thinks(mary,possibly(sleeps(john))) etc. The second expression (a functional item over type e!t and t!t), allows for eventual structures where the main sentence is embedded e.g. possibly(thinks(mary,sleeps(john))) This third possibility is therefore to provide a syntactic correlate of lambda expressions. In practice, however, provided we are only interested in mapping from a string of words to a semantic representation, and don't need explicit syntax trees to be constructed, we can merely use the types of the `syntactic lambda expressions', rather than the expressions themselves. Two representations are appropriate if there are no VP-modi ers as in dependency grammar. If VPmodi cation is allowed, two more expressions are required: P. R. (R(x.thinks(x,P(john))))(mary) and P. R. Q. Q((R(x.thinks(x,P(john))))(mary)). 7
90 David Milward and Robin Cooper This is essentially the approach taken in Milward (1992) in order to provide complete, word by word, incremental interpretation using simple lexicalised grammars, such as a lexicalised version of formal dependency grammar and simple categorial grammar8.
2. Logical Forms to Semantic Filtering In processing the sentence Mary introduced John to Susan, a word-by-word approach such as Milward (1992) provides the following logical forms after the corresponding sentence fragments are absorbed: Mary Mary introduced Mary introduced John Mary introduced John to Mary introduced John to Sue
P.P(mary) x.y.intr(mary,x,y) y.intr(mary,john,y) y.intr(mary,john,y)
intr(mary,john,sue)
Each input level representation is appropriate for the meaning of an incomplete sentence, being either a proposition or a function into a proposition. In Chater et al. (1994) it is argued that the incrementally derived meanings are not judged for plausibility directly, but instead are rst turned into existentially quanti ed propositions. For example, instead of judging the plausibility of x.y.intr(mary,x,y), we judge the plausibility of 9(x,T,9(y,T,intr(mary,x,y)))9. This is just the proposition Mary introduced something to something using a generalized quanti er notation of the form Quanti er(Variable,Restrictor,Body). Although the lambda expressions are built up monotonically, word by word, the propositions formed from them may need to be retracted, along with all the resulting inferences. For example, Mary introduced something to something is inappropriate if the nal sentence is Mary introduced noone to anybody. A rough algorithm is as follows: 1. Parse a new word, Wordi 2. Form a new lambda expression by combining the lambda expression formed after parsing Wordi?1 with the lexical semantics for Wordi 3. Form a proposition, Pi , by existentially quantifying over the lambda abstracted variables. 4. Assert Pi . If Pi does not entail Pi?1 retract Pi?1 and all conclusions made from it10 . 5. Judge the plausibility of Pi . If implausible block this derivation. It is worth noting that the need for retraction is not due to a failure to extract the correct `least commitment' proposition from the semantic content of the fragment Mary introduced. The version of categorial grammar used is AB Categorial Grammar with Associativity. The proposition T is always true. See Chater et al. (1994) for discussion of whether it is more appropriate to use a non-trivial restrictor. 10 Retraction can be performed by using a tagged database, where each proposition is paired with a set of sources e.g. given (P!Q,fu4g), and (P,fu5g) then(Q,fu4,u5g) can be deduced. 8
9
Incremental Interpretation: Applications, Theory, and Relationship to Dynamic Semantics 91
This is due to the fact that it is possible to nd pairs of possible continuations which are the negation of each other (e.g. Mary introduced noone to anybody and Mary introduced someone to somebody). The only proposition compatible with both a proposition, p, and its negation, :p is the trivial proposition, T (see Chater et al. for further discussion).
3. Incremental Quanti er Scoping So far we have only considered semantic representations which do not involve quanti ers (except for the existential quanti er introduced by the mechanism above). In sentences with two or more quanti ers, there is generally an ambiguity concerning which quanti er has wider scope. For example, in sentence (a) below the preferred reading is for the same kid to have climbed every tree (i.e. the universal quanti er is within the scope of the existential) whereas in sentence (b) the preferred reading is where the universal quanti er has scope over the existential. (7)
a. A tireless kid climbed every tree. b. There was a sh on every plate.
Scope preferences sometimes seem to be established before the end of a sentence. For example, in sentence (a) below, there seems a preference for an outer scope reading for the rst quanti er as soon as we interpret child. In (b) the preference, by the time we get to e.g. grammar, is for an inner scope reading for the rst quanti er. (8)
a. A teacher gave every child a great deal of homework on grammar. b. Every girl in the class showed a rather strict new teacher the results of her attempt to get the grammar exercises correct.
This intuitive evidence can be backed up by considering garden path eects with quanti er scope ambiguities (called jungle paths by Barwise 1987). The original examples, such as the following, (9)
Statistics show that every 11 seconds a man is mugged here in New York city. We are here today to interview him
showed that preferences for a particular scope are established and are overturned. To show that preferences are sometimes established before the end of a sentence, and before a potential sentence end, we need to show garden path eects in examples such as the following: (10)
Mary put the information that statistics show that every 11 seconds a man is mugged here in New York city and that she was to interview him in her diary
92 David Milward and Robin Cooper Most psycholinguistic experimentation has been concerned with which scope preferences are made, rather than the point at which the preferences are established (see e.g. Kurtzman and MacDonald, 1993). Given the intuitive evidence, our hypothesis is that scope preferences can sometimes be established early, before the end of a sentence. This leaves open the possibility that in other cases, where the scoping information is not particularly of interest to the hearer, preferences are determined late, if at all.
3.1 Incremental Quanti er Scoping: Implementation Dealing with quanti ers incrementally is a rather similar problem to dealing with fragments of trees incrementally. Just as it is impossible to predict the level of embedding of a noun phrase such as John from the fragment Mary thinks John, it is also impossible to predict the scope of a quanti er in a fragment with respect to the arbitrarily large number of quanti ers which might appear later in the sentence. Again the problem can be avoided by a form of packing. A particularly simple way of doing this is to use unscoped logical forms where quanti ers are left in situ (similar to the representations used by Hobbs and Shieber (1987), or to Quasi Logical Form (Alshawi 1990)). For example, the fragment Every man gives a book can be given the following representation: (11)
z.gives(< 8,x,man(x)>,< 9,y,book(y)>,z)
Each quanti ed term consists of a quanti er, a variable and a restrictor, but no body. To convert lambda expressions to unscoped propositions, we replace an occurrence of each argument with an empty existential quanti er term. In this case we obtain: (12)
gives(< 8,x,man(x)>,< 9,y,book(y)>,< 9,z,T>)
Scoped propositions can then be obtained by using an outside-in quanti er scoping algorithm (Lewin, 1990), or an inside-out algorithm with a free variable constraint (Hobbs and Shieber, 1987). The propositions formed can then be judged for plausibility. To imitate jungle path phenomena, these plausibility judgements need to feed back into the scoping procedure for the next fragment. For example, if every man is taken to be scoped outside a book after processing the fragment Every man gave a book, then this preference should be preserved when determining the scope for the full sentence Every man gave a book to a child. Thus instead of doing all quanti er scoping at the end of the sentence, each new quanti er is scoped relative to the existing quanti ers (and operators such as negation, intensional verbs etc.). A preliminary implementation achieves this by annotating the semantic representations with node names, and recording which quanti ers are `discharged' at which nodes, and in which order.
Incremental Interpretation: Applications, Theory, and Relationship to Dynamic Semantics 93
3 Dynamic Semantics Dynamic semantics adopts the view that \the meaning of a sentence does not lie in its truth conditions, but rather in the way in which it changes (the representation of) the information of the interpreter" (Groenendijk and Stokhof, 1991). At rst glance such a view seems ideally suited to incremental interpretation. Indeed, Groenendijk and Stokhof claim that the compositional nature of Dynamic Predicate Logic enables one to \interpret a text in an online manner, i.e., incrementally, processing and interpreting each basic unit as it comes along, in the context created by the interpretation of the text so far". Putting these two quotes together is, however, misleading, since it suggests a more direct mapping between incremental semantics and dynamic semantics than is actually possible. In an incremental semantics, we would expect the information state of an interpreter to be updated word by word. In contrast, in dynamic semantics, the order in which states are updated is determined by semantic structure, not by left-to-right order (see e.g. Lewin, 1992 for discussion). For example, in Dynamic Predicate Logic (Groenendijk & Stokhof, 1991), states are threaded from the antecedent of a conditional into the consequent, and from a restrictor of a quanti er into the body. Thus, in interpreting, (13)
John will buy it right away, if a car impresses him
the input state for evaluation of John will buy it right away is the output state from the antecedent a car impresses him. In this case the threading through semantic structure is in the opposite order to the order in which the two clauses appear in the sentence. Some intuitive justi cation for the direction of threading in dynamic semantics is provided by considering appropriate orders for evaluation of propositions against a database: the natural order in which to evaluate a conditional is rst to add the antecedent, and then see if the consequent can be proven. It is only at the sentence level in simple narrative texts that the presentation order and the natural order of evaluation necessarily coincide. The ordering of anaphors and their antecedents is often used informally to justify left-to-right threading or threading through semantic structure. However, threading from left-to-right disallows examples of optional cataphora, as in example 13, and examples of compulsory cataphora as in: (14)
Beside her, every girl could see a large crack
Similarly, threading from the antecedents of conditionals into the consequent fails for examples such as: (15)
Every boy will be able to see out of a window if he wants to
94 David Milward and Robin Cooper It is also possible to get sentences with `donkey' readings, but where the inde nite is in the consequent: (16)
A student will attend the conference if we can get together enough money for her air fare
This sentence seems to get a reading where we are not talking about a particular student (an outer existential), or about a typical student (a generic reading). Moreover, as noted by Zeevat (1990), the use of any kind of ordered threading will tend to fail for Bach-Peters sentences, such as: (17)
Every man who loves her appreciates a woman who lives with him
For this kind of example, it is still possible to use a standard dynamic semantics, but only if there is some prior level of reference resolution which reorders the antecedents and anaphors appropriately. For example, if 17 is converted into the `donkey' sentence: (18)
Every man who loves a woman who lives with him appreciates her
When we consider threading of possible worlds, as in Update Semantics (Veltman 1990), the need to distinguish between the order of evaluation and the order of presentation becomes more clear cut. Consider trying to perform threading in left-to-right order during interpretation of the sentence, John left if Mary left. After processing the proposition John left the set of worlds is re ned down to those worlds in which John left. Now consider processing if Mary left. Here we want to reintroduce some worlds, those in which neither Mary or John left. However, this is not allowed by Update Semantics which is eliminative: each new piece of information can only further re ne the set of worlds. It is worth noting that the diculties in trying to combine eliminative semantics with left-toright threading apply to constraint-based semantics as well as to Update Semantics. Haddock (1987) uses incremental re nement of sets of possible referents. For example, the eect of processing the rabbit in the noun phrase the rabbit in the hat is to provide a set of all rabbits. The processing of in re nes this set to rabbits which are in something. Finally, processing of the hat re nes the set to rabbits which are in a hat. However, now consider processing the rabbit in none of the boxes. By the time the rabbit in has been processed, the only rabbits remaining in consideration are rabbits which are in something. This incorrectly rules out the possibility of the noun phrase referring to a rabbit which is in nothing at all. The case is actually a parallel to the earlier example of Mary introduced someone to something being inappropriate if the nal sentence is Mary introduced noone to anybody. Although this discussion has argued that it is not possible to thread the states which are used by a dynamic or eliminative semantics from left to right, word by word, this should not be taken as an argument against the use of such a semantics in incremental interpretation.
Incremental Interpretation: Applications, Theory, and Relationship to Dynamic Semantics 95
What is required is a slightly more indirect approach. In the present implementation, semantic structures (akin to logical forms) are built word by word, and each structure is then evaluated independently using a dynamic semantics (with threading performed according to the structure of the logical form).
4 Implementation At present there is a limited implementation, which performs a mapping from sentence fragments to fully scoped logical representations. To illustrate its operation, consider the following discourse: (19)
London has a tower. Every parent shows it ...
We assume that the rst sentence has been processed, and concentrate on processing the fragment. The implementation consists of ve modules: 1. A word-by-word incremental parser for a lexicalised version of dependency grammar (Milward, 1992). This takes fragments of sentences and maps them to unscoped logical forms. INPUT: Every parent shows it OUTPUT: z.show(< 8,x,parent(x)>,,z) 2. A module which replaces lambda abstracted variables with existential quanti ers in situ. INPUT: Output from 1. OUTPUT: show(< 8,x,parent(x)>,,< 9,z,T>) 3. A pronoun coindexing procedure which replaces pronoun variables with a variable from the same sentence, or from the preceding context. INPUT: Output(s) from 2 and a list of variables available from the context. OUTPUT: show(< 8,x,parent(x)>,w,< 9,z,T>) 4. An outside-in quanti er scoping algorithm based on Lewin (1990). INPUT: Output from 3. OUTPUT1: 8(x,parent(x),9(z,T,show(x,w,z))) OUTPUT2: 9(z,T,8(x,parent(x),show(x,w,z))) 5. An `evaluation' procedure based on Lewin (1992), which takes a logical form containing free variables (such as the w in the LF above), and evaluates it using a dynamic semantics in the context given by the preceding sentences. The output is a new logical form representing the context as a whole, with all variables correctly bound. INPUT: Output(s) from 4, and the context, 9(w,T,tower(w) & has(london,w)) OUTPUT1: 9(w,T,tower(w) & has(london,w) & 8(x,parent(x),9(z,T,show(x,w,z)))) OUTPUT2: 9(w,T,9(z,T,tower(w) & has(london,w) & 8(x,parent(x),show(x,w,z)))) At present, the coverage of module 5 is limited, and module 3 is a naive coindexing procedure which allows a pronoun to be coindexed with any quanti ed variable or proper noun in the context or the current sentence.
96 David Milward and Robin Cooper
5 Conclusions The paper described some potential applications of incremental interpretation. It then described the series of steps required in mapping from initial fragments of sentences to propositions which can be judged for plausibility. Finally, it argued that the apparently close relationship between the states used in incremental semantics and dynamic semantics fails to hold below the sentence level, and brie y presented a more indirect way of using dynamic semantics in incremental interpretation.
References Alshawi, H. (1990). Resolving Quasi Logical Forms. Computational Linguistics, 16, p.133144. Altmann, G.T.M. and M.J. Steedman (1988). Interaction with Context during Human Speech Comprehension. Cognition, 30, p.191-238. Barwise, J. (1987). Noun Phrases, Generalized Quanti ers and Anaphors. In P. Gardenfors, Ed., Generalized Quanti ers, p.1-29, Dordrecht: Reidel. Carletta, J., R. Caley and S. Isard (1993). A Collection of Self-repairs from the Map Task Corpus. Research Report, HCRC/TR-47, University of Edinburgh. Chater, N., M.J. Pickering and D.R. Milward (1994). What is Incremental Interpretation? ms. To appear in Edinburgh Working Papers in Cognitive Science. Cooper, R. (1993). A Note on the Relationship between Linguistic Theory and Linguistic Engineering. Research Report, HCRC/RP-42, University of Edinburgh. Frazier, L. (1979). On Comprehending Sentences: Syntactic Parsing Strategies. Ph.D. Thesis, University of Connecticut. Published by Indiana University Linguistics Club. Groenendijk, J. and M. Stokhof (1991). Dynamic Predicate Logic. Linguistics and Philosophy, 14, p.39-100. Gross, D., J. Allen and D. Traum (1993). The TRAINS 91 Dialogues. TRAINS Technical Note 92-1, Computer Science Dept., University of Rochester. Haddock, N.J. (1987). Incremental semantic interpretation and incremental syntactic analysis. Ph.D. Thesis, University of Edinburgh. Haddock, N.J. (1989). Computational Models of Incremental Semantic Interpretation. Language and Cognitive Processes, 4, (3/4), Special Issue, p.337-368.
Incremental Interpretation: Applications, Theory, and Relationship to Dynamic Semantics 97
Hobbs, J.R. and S.M. Shieber (1987). An Algorithm for Generating Quanti er Scoping. Computational Linguistics, 3, p47-63. Joshi, A.K. (1987). An Introduction to Tree Adjoining Grammars. In Manaster-Ramer, Ed., Mathematics of Language, Amsterdam: John Benjamins. Just, M. and P. Carpenter (1980). A Theory of Reading from Eye Fixations to Comprehension. Psychological Review, 87, p.329-354. Kurtzman, H.S. and M.C. MacDonald (1993). Resolution of Quanti er Scope Ambiguities. Cognition, 48(3), p.243-279. Lewin, I. (1990). A Quanti er Scoping Algorithm without a Free Variable Constraint. In Proceedings of COLING 90, Helsinki, vol 3, p.190-194. Lewin, I. (1992). Dynamic Quanti cation in Logic and Computational Semantics. Research report, Centre for Cognitive Science, University of Edinburgh. Levelt, W.J.M. (1983). Modelling and Self-Repair in Speech. Cognition, 14, p.41-104. Marcus, M., D. Hindle, and M. Fleck (1983). D-Theory: Talking about Talking about Trees. In Proceedings of the 21st ACL, Cambridge, Mass. p.129-136. Marslen-Wilson, W. (1973). Linguistic Structure and Speech Shadowing at Very Short Latencies. Nature, 244, p.522-523. Mellish, C.S. (1985). Computer Interpretation of Natural Language Descriptions. Chichester: Ellis Horwood. Milward, D.R. (1991). Axiomatic Grammar, Non-Constituent Coordination, and Incremental Interpretation. Ph.D. Thesis, University of Cambridge. Milward, D.R. (1992). Dynamics, Dependency Grammar and Incremental Interpretation. In Proceedings of COLING 92, Nantes, vol 4, p.1095-1099. Moortgat, M. (1988). Categorial Investigations: Logical and Linguistic Aspects of the Lambek Calculus, Dordrecht: Foris. Pulman, S.G. (1986). Grammars, Parsers, and Memory Limitations. Language and Cognitive Processes, 1(3), p.197-225. Resnik, P. (1992). Left-corner Parsing and Psychological Plausibility. In Proceedings of COLING 92, Nantes, vol 1, p.191-197. Shieber, S.M. and M. Johnson (1993). Variations on Incremental Interpretation. Journal of
98 David Milward and Robin Cooper Psycholinguistic Research, 22(2), p.287-318.
Shieber, S.M. and Y. Schabes (1990). Synchronous Tree-Adjoining Grammars. In Proceedings of COLING 90, Helsinki, vol 3, p.253-258. Stabler, E.P. (1991). Avoid the pedestrian's paradox. In R. Berwick, S. Abney, and C. Tenny, Eds., Principle-Based Parsing: Computation and Psycholinguistics. Kluwer. Steedman, M. (1988). Combinators and Grammars. In R. Oehrle et al., Eds., Categorial Grammars and Natural Language Structures, p.417-442. Thompson, H., M. Dixon, and J. Lamping (1991). Compose-Reduce Parsing. In Proceedings of the 29th ACL, p.87-97. Tomita, M. (1985). Ecient Parsing for Natural Language. Kluwer. Veltman F. (1990). Defaults in Update Semantics. In H. Kamp, Ed., Conditionals, Defaults and Belief Revision, DYANA Report 2.5.A, Centre for Cognitive Science, University of Edinburgh. Wiren, M. (1990). Incremental Parsing and Reason Maintenance. In Proceedings of COLING 90, Helsinki, vol 3, p.287-292. Zeevat, H. (1990). Static Semantics. In J. van Benthem, Ed., Partial and Dynamic Semantics I, DYANA Report 2.1.A, Centre for Cognitive Science, University of Edinburgh.
5
Incrementality in Discourse Understanding Simon Garrod and Anthony Sanford 1 2 3
Immediacy, incrementality and modes of processing . . . . . . . 100 Immediacy in relation to discourse comprehension . . . . . . . . 104
2.1 2.2
Immediate contextual recovery for pronouns and fuller noun-phrases. 105 Immediacy in relation to information integration . . . . . . . . . . . 110
De ning the relationship between immediacy and incrementality 115
99 Edinburgh Working Papers in Cognitive Science, Vol. 11: Incremental Interpretation, pp. 99{122. D. Milward and P. Sturt, eds.. c 1995 Simon Garrod and Anthony Sanford. Copyright
100 Simon Garrod and Anthony Sanford
Abstract
Twenty years ago in the Annual Review of experimental psycholinguistics, JohnsonLaird de ned the fundamental problem as that of establishing what happens when we understand sentences: what mental operations occur, when they occur in relation to language perception and in what order (Johnson-Laird, 1974). So psychologists have a long standing interest in the time course of language processing and it is against this background that we want to consider the issue of incrementality in discourse understanding. The idea that human language comprehension is essentially incremental, continually adding to the interpretation as each word is encountered, is a very attractive one. In conversation, there is a strong impression that we know what our interlocutor is trying to say as it is being said; we may even be tempted, on occasion, to complete their sentence before they have nished speaking. Similarly when reading, there is a strong impression that the interpretation is being built up continuously as each word is encountered. So at the level of introspection the idea of incremental language comprehension is compelling. There is also a sound psychological reason for the language processing system to operate in this way which comes from what is known about human memory and its poor capacity for dealing with uninterpreted information. Any well adapted human language processing system should favour immediate interpretation when possible. However, although there is a large body of evidence to suggest that syntactic processing is essentially incremental (this is associated mainly with garden path phenomena studied extensively since Bever, 1970), comparable evidence for incremental semantic analysis or incremental interpretation at the level of the discourse is much harder to nd. So the principal aim of this paper is to review the evidence on the time course of discourse comprehension and see what light it can throw on the more general issue of incremental interpretation. In doing this we will come to the conclusion that it is important from a psychological point of view to distinguish two general modes of language processing: one concerned with building up an interpretation in an essentially incremental fashion and the other lower level process concerned with matching patterns in the input against both local and global knowledge representations. We will argue that the latter mode of processing, which is immediate but not incremental, can operate at any level from syntactic analysis right through to discourse analysis. We start by drawing a distinction between immediate `recovery' of information and immediate `integration' and show how this can be applied at the level of sentence interpretation. We then review the evidence on immediacy of discourse comprehension in the light of this distinction and show how it may help to sort out a number of apparent contradictions in the ndings.
1 Immediacy, incrementality and modes of processing We have already indicated that immediacy of processing would seem to be well motivated from a psychological point of view because of the severe memory constraints which apply to holding uninterpreted information. But there are a number of ways in which this immediate interpretation might be operating and it is important to specify these if we are to make sense of the experimental evidence on discourse understanding. The two main versions of immediacy that need to be considered are immediate `recovery' of information and immediate `integration' of information. To illustrate this distinction consider the situation with syntactic parsing. As each word and phrase is encountered it is generally assumed that the processor immediately recovers relevant syntactic information about the element in question { for example its syntactic category, morphological structure and so on. So syntactic parsing
Incrementality in Discourse Understanding 101
is generally thought to be immediate at this low level of information recovery. However, most psychologists would want to argue that the immediacy of syntactic processing does not stop there, it also must involve some form of integration to specify the various syntactic relations between the words and phrases encountered. In other words there must be both a process of immediate recovery and a process of immediate integration into the current phrase marker if the system is to be described as a truly incremental processor. In general, research on the time course of parsing supports the view that syntactic processing is immediate and incremental in this stronger sense. For example, Frazier & Rayner (1982) demonstrated that readers confronted with certain kinds of structural syntactic ambiguity would commit themselves in the rst instance to only one reading. Thus when encountering the ambiguous sentence fragment in (1), readers typically treat the prepositional phrase on the cart as attached to the verb loaded rather than to the noun phrase the boxes. (1)
Sam loaded the boxes on the cart ......
They were able to demonstrate this by measuring the reader's eye movements with the two versions of a sentence containing this ambiguous fragment shown below: (2) (3)
Sam loaded the boxes on the cart before lunch. Sam loaded the boxes on the cart onto the van.
When presented with (3) readers encountered diculty at the point where the prepositional phrase attachment was disambiguated. They spent substantially longer xating this region of the sentence and were much more likely to re xate the ambiguous fragment shown in (1). Frazier and Rayner used this nding to argue that the syntactic parsing process was essentially incremental with the processor opting to track the simplest structure whenever a potential syntactic ambiguity is encountered. In this case it is the structure involving minimal attachment.1 From the present point of view, what is important about these ndings is that they demonstrate a pressure towards immediate incremental analysis even at the risk of subsequent misunderstanding. At the same time, they illustrate the main pitfall of doing it this way which arises from the problem of early commitment. The system will either be forced to track multiple alternative analyses or be forced to make an early and risky commitment to following one line of interpretation over the other. In fact Frazier & Rayner (1987) found other cases of local syntactic ambiguity associated with lexical categorisation that do not trigger such immediate commitment. This particular account of the result may be open to question, but it is beyond the scope of this paper to give a complete account of the issues surrounding incrementality in syntactic parsing (see other papers in the series). 1
102 Simon Garrod and Anthony Sanford For instance, when readers are given sentences such as (4) or (5) below, a dierent pattern of results emerges. (4) (5)
I know that the desert trains young people to be especially tough. I know that the desert trains are especially tough on young people.
Here a potential syntactic ambiguity is present at the words desert trains which could either be treated as noun plus verb (as in 4) or as adjective plus noun (as in 5). These sentences were compared with the unambiguous controls (40 and 50) below: (40 )
I know that this desert trains young people to be especially tough.
(50 )
I know that these desert trains are especially tough on young people.
Now if the processor was always interpreting incrementally even at the expense of making an early commitment then readers should take just as long (if not longer) processing the ambiguous fragment desert trains in (4) and (5) as when processing the disambiguated fragment in (40) and (50). In fact quite the opposite pattern emerges. The readers spend more time xating the words in the unambiguous case than the ambiguous one, but then spend less time on the remainder of the sentence. This is consistent with a mechanism that holds o interpretation of the ambiguous information and attempts to look ahead for disambiguators before committing itself to an immediate incremental analysis.2 As Frazier and Rayner point out the extent of such delaying may only be limited to one or two words which will usually be quite sucient to sort out this kind of syntactic category ambiguity. Nevertheless the result does point to a certain amount of exibility in relation to the immediate and incremental analysis of syntactic structure.3 When we turn to the work on immediacy in relation to semantic processing the picture turns out to be even more complicated. Readers do seem to be garden pathed when they encounter radical lexical ambiguities, but not when they encounter ambiguities of sense. Frazier and Rayner (1990) set up a contrast between sentences like (6 & 7) containing the ambiguous word pitcher (vase vs.baseball thrower) and sentences like (8 & 9) containing the word record which is not lexically ambiguous, but does admit dierent senses (disc vs. account). (6) (7)
Being so elegantly designed, the pitcher pleased Mary. Throwing so many curve balls, the pitcher pleased Mary.
This result is also consistent with an immediate interpretation account in a resource unbounded parallel model. But on this account it would be assumed that the extra processing time for trains in 4 and 5 is due to the fact that trains is disambiguated at that point and hence attracts longer xation (see Mitchell, 1994, for a discussion of how processing load increases at the point of disambiguation). 3 Frazier and Rayner proposed that what might determine delayed processing is whether or not the analysis depends only on prestored information (e.g. syntactic categorisation) or requires actual computation. The assumption is that computing alternative interpretations is more costly than recovering prestored alternatives. 2
0
0
Incrementality in Discourse Understanding 103
(8) (9)
After they were scratched, the records were carefully guarded. After the political take-over, the records were carefully guarded.
In all cases the words were chosen so as to have one dominant and one subordinate meaning or sense. Hence, out of context subjects tend to take pitcher to refer to a baseball thrower and record to refer to a disc. To establish whether or not the reader would immediately assign one interpretation over the other they compared cases like (6) where the disambiguating phrase precedes the ambiguous word with cases like (10) where it follows. (10)
Of course the pitcher pleased Mary, being so elegantly designed.
The results were quite striking. When the target word was lexically ambiguous (e.g. pitcher), there was generally a marked advantage in having the disambiguating phrase precede the target, and, if disambiguation was toward the non-preferred meaning, readers spent a little extra time xating the word, but soon recovered. However, if the disambiguating phrase followed the target, readers spent longer overall and experienced considerable extra diculty in cases where the overall context favoured the non-preferred meaning of the word. So for the lexically ambiguous cases the pattern of results is consistent with the standard garden path situation with syntactic ambiguity. When readers encounter a choice point in constructing an interpretation they track one and only one meaning, and in the absence of prior disambiguating context the meaning that they track corresponds to the dominant meaning of the word in question. However, the process does not seem to be quite the same when the materials contained multiple sense words like record or newspaper. In these cases, there was no detectable reading time dierence overall between prior and post disambiguation conditions. Having no context to select between dierent senses of the target word did not seem to lead to immediate adoption of one sense or the other. However, there was some evidence that when prior context was available readers would immediately adopt the appropriate reading. This came from a small but reliable eect of dominance immediately following the reading of the target word: in a context which selected the subordinate sense, readers would take slightly longer to integrate the information. Frazier and Rayner explain this result in relation to what they called the immediate partial interpretation hypothesis. According to this hypothesis the processor will generally operate in an immediate and incremental fashion but may delay its semantic commitments, if this does not result in either (a) a failure to assign any semantic value whatsoever to a word or major phrase, or (b) the need to maintain multiple incompatible values for a word, phrase or relation. To account for the dierence between lexical versus sense ambiguity they assumed that meanings and senses relate to dierent kinds of underlying mental representations. Whereas there is no single representation for the two lexemes underlying an ambiguous word, dierent senses of the same word can be represented in a single more abstract form and so do not require maintenance of incompatible alternative semantic values at this level.
104 Simon Garrod and Anthony Sanford Whether or not Frazier and Rayner's account proves to be correct, the results of these studies illustrate some of the issues surrounding incremental processing. From a psychological point of view it would seem that there are two general constraints operating, both related to working memory limitations. The rst is that of requiring some immediate interpretation for each element as it is encountered { to avoid holding uninterpreted material { and the second opposing constraint is that of only being able to track one interpretation at a time { to avoid holding multiple incompatible interpretations of the same material. This latter constraint presumably re ects the system's inability to simultaneously track alternative readings and also guards against the risk of combinatorial explosion when trying to trace out all the possible alternative interpretations down-stream of the initial ambiguity. We are now in a position to consider how such constraints on immediacy and incrementality might be expected to aect processing beyond the sentence. In relation to the discussion of parsing and semantic analysis there are two important issues to consider. First, the issue of what might correspond to information recovery and information integration at the level of discourse understanding, and secondly, what constitutes a sucient level of interpretation for the item to impose no special memory load { in eect what degree(s) of partial interpretation may be possible at this level .
2 Immediacy in relation to discourse comprehension Like syntactic parsing and semantic analysis, discourse processing can be viewed as requiring both recovery of information and integration of that information into an overall representation of the text. Garrod and Sanford (1994; see also Garrod, 1994) have argued that discourse comprehension is essentially a process of anchoring interpretations of the sentence and its fragments (i.e. noun-phrases, verb groups, etc.) into this representation. Recovery of information therefore corresponds to identifying the appropriate anchoring site in the representation and integration corresponds to incorporating and linking (e.g. through various coherence relations) the current information in the sentence into that already represented at that site. The most straightforward examples of anchoring arise with anaphoric expressions such as pronouns or fuller de nite noun phrases. Thus in the following example, discourse level interpretation requires the reader to identify the co-indexed items in the two sentences: (11) (12)
Bill wanted to lend Susan1 some money2 . She1 was hard up and really needed it2 .
In eect the pronoun she identi es with the entity corresponding to Susan in the discourse representation and the pronoun it identi es with the entity corresponding to some money. But full discourse level interpretation of sentence (12) also requires integrating this information with the prior representation. The reader needs to do this in order to be able to infer that
Incrementality in Discourse Understanding 105
the reason for Bill's wanting to lend the money to Susan was his recognition of her parlous nancial state (see Garrod, 1994, for a fuller discussion). In such a simple example it is easy to conceive of the recovery process and the integration process as, in principle, independent of each other. The pronoun she will identify Susan and it will identify the money solely on the basis of gender and number matching. But this is by no means always the case. For example with the following variant of (11 and 12) the rst pronoun is potentially ambiguous when considered in isolation. (13) (14) (15)
Bill1 wanted to lend his friend2 some money. He2 was hard up and really needed it. However, he1 was hard up and couldn't aord to.
So with examples like (13,14 & 13,15) the processor is presented with the same problem encountered with lexical or sense ambiguities discussed above: choosing the appropriate interpretation depends upon information only available downstream of the pronoun. This means that making a referential commitment to the interpretation of a pronoun or fuller NP is very much like making a semantic commitment to one particular sense of a word like record. According to the immediate partial interpretation hypothesis we would expect the whole process to be similar to that of sense selection: in the presence of strong prior evidence, the processor should immediately make a referential commitment whereas in its absence the processor should retain a more abstract representation re ecting only the semantic content of the anaphoric description. This contrasts with an immediate referential commitment hypothesis whereby the processor would always make an initial commitment to one referential interpretation but at the risk of subsequently being garden pathed. So let's turn to the psychological evidence for immediacy in relation to anaphoric processing. First, we consider the evidence for immediate recovery in the sense given above, and then turn to the evidence on immediate integration.
2.1 Immediate contextual recovery for pronouns and fuller noun-phrases. A technique that has been widely used to draw inferences about when information is recovered during reading involves antecedent probe recognition. In the case of written material, the subject is presented with a text, usually one word at a time, and at a critical point a probe word is presented. They are then required to make a timed judgement of whether or not the probe matched a word in the prior text. In one of the earliest probe recognition studies Dell, McKoon & Ratclie (1983) used texts of the following kind: A burglar surveyed the garage set back from the street. Several milk bottles were piled at the curb. The banker and her husband were on vacation.
106 Simon Garrod and Anthony Sanford The criminal/A cat slipped away from the street lamp. At the critical point following either the anaphor the criminal or the non-anaphor a cat they presented the test word burglar for probe recognition. They found that recognition was primed immediately following criminal as compared to cat. They also obtained a similar enhancement for words drawn from the sentence in which the antecedent had occurred (e.g. garage). This nding, together with related ndings from Gernsbacher (1989), suggests that the relevant antecedent information is recovered rapidly (at least within 250 msecs.) following exposure to an anaphor. Gernsbacher also demonstrated a similar pattern of results with proper name anaphors. In this case, she was able to show a reliable dierential between positive priming for the antecedent and inhibition for a non-antecedent relative to a point just before the anaphor. Therefore it would seem that there is evidence for immediate recovery of contextual information, at least for explicit anaphors such as repeated names and de nite descriptions. In the case of pronouns, the situation is somewhat more complicated. In a spoken cross-modal version of the priming task, Shillcock (1982) demonstrated some early eects following presentation of an unambiguous pronoun, but only in terms of suppression of the non-antecedent control word. Gernsbacher (1989), on the other hand, did not nd evidence for such a rapid priming dierence in the case of unambiguous pronouns. The only clear dierence in her study emerged at the end of the sentence containing the anaphor. So while there is clear indication that the fuller anaphors immediately recover antecedent information, with pronouns this does not seem to always be the case. However, this apparent contradiction in the ndings for the fuller anaphors versus pronouns may have something to do with the nature of the probes that are used. Cloitre & Bever (1988) report a number of experiments which suggest that noun anaphors only immediately activate surface information about their antecedents, whereas pronouns immediately activate deeper conceptual information. The experiments compared priming eects using a number of dierent tasks. In general, they found that tasks which tapped recovery of conceptual information about the antecedent, such as category decision, produced earlier eects following the pronoun than the noun anaphors, whereas the opposite was true for a lexical decision task which taps surface information. At the same time, secondary eects associated with conceptual properties of the antecedent, such as concreteness, emerged in the immediate responses following the pronoun but not the noun anaphors. So one possibility is that the dierent referential devices are recovering dierent types of information from the prior discourse, with pronouns having a privileged status in terms of access to conceptual information about the antecedent. A recent set of experiments by Vonk, Hustinx & Simon (1992) also indicates that unambiguous pronouns may on occasion recover information related to the antecedent more rapidly than fuller anaphors. They used materials where subsequent reference was made to a character that was clearly established as the thematic subject of the preceding text. The following target sentence then contained either a pronoun or de nite description anaphor identifying this character. Under these circumstances they found evidence for earlier recovery of information following the pronoun as opposed to the fuller description. So this reinforces the previous claims that topicalisation and antecedent focusing may play an important part in antecedent
Incrementality in Discourse Understanding 107
recovery for pronouns as opposed to fuller anaphoric references (Sanford, Moar & Garrod, 1988; Gordon, Grosz & Gilliom, in press). In relation to the rst criterion of immediacy { immediate information recovery { anaphoric processing seems to behave in a similar fashion to other sentence internal semantic processes such as sense selection. However it is perhaps worth pointing out that antecedent priming studies are not without their problems. In particular, presenting texts in a piece meal word by word fashion is a poor simulation of the normal reading process and may well interfere with the time course of the sentence resolution. A second issue that turns out to be particularly important for interpretation of pronouns is the degree to which they identify a focused antecedent in the discourse representation, hence the con ict between the Vonk et al. (1992) results and those from Gernsbacher (1989). A less invasive procedure for establishing what is happening during reading is to track eyemovements, and there have been a few studies which have used this technique to look at recovery of contextual information. The rst study we consider looked at the interpretation of unambiguous pronouns but with antecedents either close in the text or far removed. By measuring the amount of time the reader spent xating the pronoun and subsequent regions of the sentence, Ehrlich & Rayner (1983) were able to demonstrate an antecedent distance eect. When the antecedent was distant readers spent a reliably longer time xating the region immediately after the pronoun and for a few words beyond it as compared to the other condition. This result is consistent with the idea that the pronoun immediately triggers access to its antecedent, but recovery takes longer when the antecedent is distant, and so presumably out of focus. The second eye-tracking study that has some bearing on the time course of antecedent recovery looked at the interpretation of de nite description anaphors. This study, reported by Garrod, O'Brien, Morris & Rayner (1990) and based on an earlier study by O'Brien, Shank, Myers & Rayner (1988) explored the eects of role restriction constraints on the time taken to interpret the anaphors. Various contexts were constructed which could impose a potential restriction on the nature of an antecedent referent. An example set is shown below: (16) (17) (18) (19)
He assaulted her with his weapon He stabbed her with his weapon He assaulted her with his knife He stabbed her with his knife
After a further intervening sentence subjects were then presented with one of the following target sentences, and their eye-movements were recorded: (20)
a. He threw the knife into the bushes, took her money and ran away. b. He threw a knife into the bushes, took her money and ran away.
108 Simon Garrod and Anthony Sanford The basic question of interest was how the dierent types of contextual restriction on the antecedent knife might eect the subsequent xation time for the reference to the knife in sentence (20.b). In sentences (18) and (19) the antecedent is explicitly introduced as a knife, whereas in sentence (17) as opposed to (16) the verb implicitly restricts the weapon to be knife like. So one question that the study addressed was how these two forms of restriction might aect the amount of time the reader actually xated on the subsequent anaphor the knife. The study also had a control condition to establish any general lexical priming advantage from the context. Hence the inclusion of the non-anaphoric matching NP a knife on half of the trials. An example of one material in all its conditions is shown in Table 1.
Table 1: Materials in Garrod, O'Brien, Morris and Rayner (1990) All the mugger wanted was to steal the woman's money. But when she screamed, he [stabbed] [assaulted] her with his (knife/weapon) in an attempt to quieten her down. He looked to see if anyone had seen him. He threw ftheg fag knife into the bushes, took her money, and ran away. Factors manipulated 1. Restricting versus non-restricting context for the antecedent. (i.e. stabbed v. assaulted) 2. Explicitly matching the antecedent for the target noun knife. (i.e. knife v. weapon) 3. Target in de nite or inde nite NP (i.e. a... v the...) The resulting xation durations on the critical noun-phrase are shown in Figure 1. With the non-anaphoric controls, there was only a reading advantage when the antecedent exactly matched the lexical speci cation of the target noun. Contexts containing either (18) or (19) led to shorter reading times than contexts containing either (16) or (17), but the implicit restriction from the verb had no eect whatsoever. However, with the anaphoric target sentences xation duration was equally reduced by either implicit restriction from the verb as in sentence (17) or lexical speci cation on the antecedent as in (18) and (19.) So the only case where there was a reliably longer reading time was when neither restriction applied as in (16.) This experiment clearly demonstrates that an anaphor immediately recovers the contextual information in the antecedent. Although there was a lexical priming eect observed for the
Incrementality in Discourse Understanding 109
RT(msecs) 240 230 220 210 200 190 180 Non-anaphoric
Anaphoric
Restricting Context Explicit Ante.
Non-restricting Context
Implicit Ante.
Explicit Ante.
Implicit Ante.
Figure 1: non-anaphoric control there was no eect associated with the role restriction imposed by the verb or other part of the sentence. The role restriction eect observed in the anaphoric materials must therefore come from attempting to interpret a de nite description which signals some coherent link between antecedent and anaphor. In conclusion, both the priming studies and the few eye-tracking experiments reported to date indicate that the recovery of contextually relevant information occurs at the time of encountering a fuller anaphor. In the case of pronouns the evidence is not quite so clear cut. The priming studies indicate that the form of the antecedent may not be so rapidly accessed with pronouns as with the fuller anaphors, but at the same time they suggest that deeper conceptual information can be recovered more rapidly. However, the eye-tracking experiment by Ehrlich & Rayner (1983) would also indicate that immediacy is subject to focusing constraints on the antecedent in the case of pronouns. In relation to our opening discussion it would therefore seem that discourse interpretation proceeds immediately with respect to recovery of relevant antecedent information or identifying the site in the discourse representation where subsequent information is to be integrated. However, this still leaves
110 Simon Garrod and Anthony Sanford open the issue of incrementality in the stronger sense of immediate integration. We shall turn to this issue in the next section.
2.2 Immediacy in relation to information integration Establishing that contextually relevant information has been recovered immediately following an anaphor does not license the stronger assumption that this information is immediately integrated into the sentence interpretation. The point is particularly important in relation to resolving sentences containing pronouns, where the overall coherence of the interpretation seems to play such an important part. One attempt to establish the immediacy of integration is reported in a study by Garrod & Sanford (1985) using a spelling error detection procedure. They presented materials of the following form, and measured the time readers took to detect the spelling error on the critical verb: A dangerous incident in the pool Elizabeth was an inexperienced swimmer and wouldn't have gone in if the male lifeguard hadn't been standing by the pool. But as soon as she got out of her depth she started to panic and wave her hands about in a frenzy. (a) Within seconds Elizabeth senk (sank) into the pool (b) Within seconds Elizabeth jimped (jumped) into the pool (c) Within seconds the lifeguard senk into the pool (d) Within seconds the lifeguard jimped into the pool The passages always introduced two characters described as being in dierent states. For instance, in this example the main character Elizabeth is described as out of her depth in the pool while the subsidiary character the lifeguard is standing by the pool. The passages then continued with one of four critical sentences containing misspelled verbs depicting actions either consistent or inconsistent with what was known about each character in the story. Hence given the current state of Elizabeth it is consistent for her to sink at this point but not to jump, while the opposite is true for the lifeguard. Furthermore the consistency or inconsistency is only apparent in relation to the whole anaphor-verb complex, in that it is not the jumping or sinking which is anomalous but the fact that Elizabeth should jump or the lifeguard sink given the current state of the discourse scenario. The subjects read the materials one sentence at a time while monitoring for spelling errors. We reasoned that immediate and incremental resolution, if it occurred, should be re ected in increased latency for detection of the spelling error on the inconsistent verb. We found just such consistency eects in the case where the anaphors were proper names or de nite descriptions as in the example above. However when the full anaphors were replaced with unambiguous pronouns a more complicated pattern of results emerged.
Incrementality in Discourse Understanding 111
When the pronoun referred to the main character or thematic subject of the passage (e.g. Elizabeth in this example), readers took longer to detect a misspelling on the inconsistent verb, but when it referred to the subsidiary character (e.g. the lifeguard) there was absolutely no evidence of such an eect. So these results suggested that the nature of any incremental resolution was dependent on both the linguistic form of the anaphors and the focus state of the discourse representation at that point, on the assumption that thematic subject but not the subsidiary character would be in focus at that point. Although these results are consistent with immediate integration of the sentence in relation to the discourse representation, the procedure is not without its problems. Readers take a long time to detect spelling errors and they only manage to read the material at about half the normal rate. Consequently it is not clear to what extent the spelling error is being responded to immediately that the verb is encountered. To overcome these inherent methodological limitations Garrod, Freudenthal and Boyle (1994) carried out an eye-tracking study based on the earlier experiment. With the eye-tracking procedure it is possible to present the same materials containing contextually consistent or inconsistent verbs and measure the point in the sentence when the reader rst detects the anomaly. We reasoned that this should appear as a rst pass increase in xation duration either on the verb itself or on the immediate post verb region of the sentence together with evidence of increased regressions to earlier regions. Apart from the full anaphor versus pronoun contrast we also manipulated the degree to which the pronoun was disambiguated by gender and number. So there was a condition where the antecedents were gender dierentiated as in Garrod et al. (1985) and a matching condition where they were not (e.g. we changed \Elizabeth" in the above example to \Alexander"). The critical question is how early the inconsistency can be detected in the eye-movement record. An example set of materials for the various pronoun conditions is shown below: A dangerous incident in the pool Elizabeth1 /Alexander2 was an inexperienced swimmer and wouldn't have gone in if the male lifeguard3 hadn't been standing by the pool. But as soon as she1 /he2 got out of her1 /his2 depth she1 /he2 started to panic and wave her1 /his2 hands about in a frenzy. Within seconds she1 sank into the pool +F+G+C Within seconds she1 jumped into the pool +F+G-C Within seconds he3 jumped into the pool -F+G+C Within seconds he3 sank into the pool -F+G-C Within seconds he2 sank into the pool +F-G+C Within seconds he2 jumped into the pool +F-G-C (+F = Matches focused antecedent, +G = Gender dierentiated & +C = consistent verb. The `-' conditions represent the converse situation)
112 Simon Garrod and Anthony Sanford Considering rst the results from the pronoun conditions, there was strong evidence for very early detection of inconsistency, but only in the case where the pronoun was both gender disambiguating and maintained reference to the focused antecedent (conditions +F+G+/-C). The magnitudes of the consistency eects are shown in Figure 2b, represented as dierences in xation time between consistent and inconsistent verb condition in msecs. per character. Turning to the pattern of results from the conditions with the fuller anaphors, a rather dierent pattern of results emerged. Here there was no evidence of early detection of verb inconsistency (see Figure 2a). However, in all anaphor conditions, there were marked eects of consistency appearing in the second pass xations for the verb and subsequent regions. So while it is clear that the readers all ultimately detect the anomaly, it seems that it is only in the case where the pronoun identi es the focused thematic subject of the passage that there is an immediate attempt to integrate the contextual information. Taken together these results indicate that the immediate resolution of the pronouns comes about through an interaction between the syntactic (gender) information in the pronoun and the focus state of the prior discourse representation. If an antecedent is both focused and the pronoun uniquely identi es it through gender matching then the system makes an early commitment to the full interpretation. When either of these conditions does not hold then commitment is delayed until after the verb has been encountered. Presumably this arrangement makes best use of a coherence checking mechanism to x the nal interpretation of the pronoun. The outcome for the fuller anaphors is much more surprising. On the one hand, there is clear evidence from the antecedent priming literature discussed above that the fuller anaphors immediately recover some information about their antecedents, but on the other hand this does not seem to lead to immediate commitment on the part of the processing system to this particular referential interpretation. One possible explanation for the apparent discrepancy comes from considering the degree to which the fuller forms presuppose that particular interpretation. Full de nite descriptions, unlike pronouns, only occasionally take their meaning from explicitly introduced antecedents. Fraurud (1990) found that over 60% of de nites in a large corpus of written text were rst mentions without discourse antecedents. This arises because in many cases the de nite identi es an implicitly de ned role in the discourse situation (see Garrod & Sanford, 1982; 1990; 1994). The dierence in presupposition between the pronoun as opposed to the fuller de nite can be illustrated with the materials discussed above where the target sentences containing the fuller descriptions would be perfectly acceptable without explicit discourse antecedents. Thus the following amended version of the example material is a well formed text with the de nite description but not with the associated pronoun: Elizabeth wouldn't have gone in if her sister had not been standing by the pool. But as soon as she got out her depth she started to panic and wave her hands about in a frenzy. Within seconds, the lifeguard/he* jumped into the pool.
Incrementality in Discourse Understanding 113
Consistency (msec./char.)
Full Anaphors
7 6
Name
5
Def.Descr
4 3 2 1 0 -1 -2 -3 -4 .
Verb
Consistency (msec./char.)
Post-verb
Pronouns
7 6
+gender, -focus
5
+gender, +focus
4
-gender
3 2 1 0 -1 -2
Verb
Post-verb
Figure 2: Consistency eects for full anaphors and pronouns under various conditions from Garrod, Freudenthal & Boyle (1994)
114 Simon Garrod and Anthony Sanford The passage can be understood and is perfectly coherent even when the fuller description does not have an explicit antecedent in the prior text. This is clearly not the case with the pronoun version. So one possible consequence of the dierences in contextual presupposition might be in terms of the requirements for immediate commitment to one particular referential interpretation. A second related issue concerns the degree to which it is possible to formulate a partial interpretation for the pronoun as opposed to the fuller de nite which would satisfy the immediate interpretation constraint discussed in the previous section. If there is no eective partial interpretation for the pronoun then holding it in memory while attempting to process the rest of the sentence may impose a heavy memory load on the system. The fact that it is possible to dierentiate experimentally between immediate recovery and immediate integration of antecedent discourse information motivates drawing a distinction between anaphor bonding and anaphor resolution. As Sanford (1985; also Sanford & Garrod, 1989) suggest, anaphors may immediately set up bonds with potential antecedents without necessarily forcing a commitment to referential resolution at that stage. Consider the following sentence-pair: (21)
Sailing to Ireland was eventful for Harry. It ( the boat?) sank without trace.
The second sentence sounds odd with the pronoun even though a potential referent boat is easily inferred from the context. It gives a compelling impression that it was Ireland that sank without trace, even though this is ruled out in the ultimate interpretation (the \soundslike" eect). Sanford, Garrod, Lucas and Henderson (1983) showed that reading times were longer for such sentences than their counterparts in which bonding is ruled out by gender and number cues, and where no sounds-like eect is reported: (22)
Being arrested was embarrassing for Andy. They (the police?) took him to the station in a van.
A sentence may be said to be \bond enabling" if there is a suitably foregrounded element which can serve as a false antecedent for a pronoun, by virtue of a match in number and gender. Now it might be supposed that (21) creates a problem as compared to (22) because there is an incorrect but immediate resolution of the pronoun it, constituting a semantic garden path . But as Sanford (1984) showed, a similar problem does not arise with sentence-pairs like (23) and (24): (23) (24)
Sailing to Ireland was eventful for Jim. It was a really windy day. John had a fearful headache. It was Dr. Brown who had prescribed the wrong pills.
In order to accommodate the dierence between the eects of (21) and (23), the distinction between bonding and resolution was proposed. To explain the sounds-like eect in (21) we
Incrementality in Discourse Understanding 115
have to assume that the possibility of coreference was entertained at some point. But because there is no such eect with sentences like (23), immediate false reference resolution seems to be ruled out. On encountering the pronoun, and even later in most of the materials, the processor would have no way of knowing that (21) and (23) were dierent syntactic and semantic forms. Our argument was that early bonding takes place when a pronoun is encountered that has a suitable (but possibly false) antecedent. The bond simply associates the pronoun with the possible antecedent word, without assigning any speci c semantic relation to the association, hence the term bond. If the predicate then indicates that the pronoun is being used coreferentially, as in (21), the association is tested as a probable site for instantiation as an anaphor. If it does not, as in (23), then further processing of the bond does not occur. So it is only in the former case that the sounds-like eect occurs. Of course, under normal circumstances where an anaphoric relation is intended, bonding will facilitate resolution by providing early identi cation of the locus for the relation. So, in processing terms, bonding amounts to locating where in the representation relevant information may be found, whereas resolution involves commitment to one particular interpretation at that point in the process. Such a commitment would in eect pipe the relevant contextual information through to the processing system and so enable it to integrate subsequent information in the sentence directly into the discourse representation. In the same way that the syntactic processor may be loath to always make early commitments as in the Rayner & Frazier (1987, 1990) experiments, it seems that the sentence resolver may also be loath to make such immediate referential commitments except under rather special conditions. With respect to the whole question of incrementality of discourse processing this would indicate that truly incremental processing is possible but may be subject to strategic manipulation. The results also raise more broad ranging questions about the extent to which evidence for immediacy should always be taken as indicative of incrementality in processing. Below we end by considering this more general question in the light of the evidence from syntactic parsing, semantic processing and processing at the level of the discourse as a whole.
3 De ning the relationship between immediacy and incrementality We are now in a better position to say something about the general question of how immediacy in processing relates to incrementality. We can start with the two general psychological processing constraints identi ed at the beginning of this paper: the constraint that the system should avoid trying to maintain uninterpreted information and the competing constraint that it should avoid having to track multiple incompatible interpretations of the sentence as a whole. Whereas evidence for tracking only one interpretation at a time is well documented in a range of garden path phenomena, less is known about the operation of the rst constraint on the depth of the immediate analysis. The main issue raised by this constraint concerns the criterion for interpretation: Just how complete does the interpretation of any unit of input have to be to satisfy it? At the syntactic
116 Simon Garrod and Anthony Sanford level this would seem to be moderately straightforward, but even here there may be problems. For example, Perfetti (1990) has argued that the parser may on occasion only immediately compute partial interpretations which leave certain aspects of the syntactic representation vague. Thus he argues that local constituents such as prepositional phrases may be constructed incrementally without necessarily committing the system to remote attachments for the phrases at that time. This would enable the processor to avoid having to opt for one particular high level syntactic analysis in the absence of clear syntactic triggers while complying with the second processing constraint against tracking multiple incompatible interpretations. Questions about partial or incomplete interpretation are particularly relevant to processing at the semantic and discourse levels, since it is often not clear what constitutes a complete interpretation at these levels. There are quite a few studies which give evidence for incomplete or shallow semantic processing. For example, consider the extent to which the full meaning of a word may be ignored when constructing sentence meaning. One well-known example of such partial processing is the \Moses Illusion" (Ericson and Mattson, 1981), in which readers routinely fail to notice the anomaly in the question How many animals of each sort did Moses put on the ark?. (It wasn't Moses, it was Noah). Barton and Sanford (1993) investigated several constraints controlling detection of a related type of anomaly: (25)
Suppose that there is an airplane crash with many survivors who were European. Where should they be buried?
More than half of the subjects failed to notice this anomaly. In contrast, many more noticed it in the following version: (26)
Suppose that there was an airplane crash with many survivors. Where should they be buried?
Not only does this result show that processing of the expression "survivors" is shallow (i.e., it is partial or incomplete), it also shows that when the context sentence contains information that is relevant to answering the question then deeper analysis does not take place. In the present case, that information is present in (25) where we are told that the survivors are European, but not in (26) (see Barton and Sanford for other conditions supporting this analysis). So, depth of processing depends in part on the overall goals of the comprehension process. In this case the aim is to determine where the people should be buried. Consequently when the salient information is available it is processed. However, when it is not available, the earlier information must be re-analysed at a deeper level in order to nd an answer. Barton and Sanford (1993) also tested predictions based on the idea that the contribution which a word makes to a discourse representation is very much a function of context. In a context where the term survivors is highly salient, it is to be expected that the impact of the word itself (or the amount of processing of its semantic representation) will be low. In one
Incrementality in Discourse Understanding 117
experiment, they compared a scenario in which death and survival are only weakly implied, as in a bicycle race crash , with the standard aircrash scenario, and readers were considerably more successful at detecting the anomaly. So it is clear that the scenario has a strong in uence on the degree of interpretation. However, the results of a further experiment suggest that its eect is not strictly incremental. Barton and Sanford varied the order of presentation of the background information relative to the anomalous target item. If the scenario's eect is incremental, then one would expect order of presentation to be important, since it would only be once the scenario had been established that the full interpretation of the anomalous item should be blocked. The following comparisons were made, and the detection rates are given for each of them:
Early scenario, passive VP 26%
(When an aircraft crashes, where should the survivors be buried?)
Early scenario, active VP 44%
(When an aircraft crashes, where should you bury the survivors?)
Late scenario, passive VP 31%
(Where should the survivors be buried after an aircrash?)
Late scenario, active VP 37%
(Where should you bury the survivors of an aircrash?)
The results showed no overall tendency for reduced detection of the anomaly when the scenario came late. So they concluded that a strictly incremental build-up of constraints was not taking place. One initially curious result which they also found was that the anomaly in the noun-phrase the surviving dead was less likely to be detected than in the simpler survivors, while phrases such as the surviving injured, the surviving wounded, etc., had high probabilities of detection (Sanford and Barton, in prep). The explanation given was that the reference to the dead in this context ts expectations so well that it satis es some criterion of comprehension sucient to block analysis of the rest of the noun phrase. With reference to injured, the t is not quite so good, so the analysis is less likely to be blocked. This would imply that a full semantic representation of the noun-phrase may not be computed before integration into the existing discourse model. As a consequence there may be occasions when readers do not check the internal coherence of the phrase itself. Of course, it might be assumed that these examples are unusual and do not reveal much about normal processing. But it would be dicult to show that the present results are the outcome of unusual processing. Furthermore, shallow processing of the Moses Illusion type is well established (e.g. Reder and Kusbit, 1990), and there are several other examples of
118 Simon Garrod and Anthony Sanford incomplete processing which have been reviewed elsewhere (e.g., Sanford and Garrod, 1994), including further cases where pragmatics appears to override local semantics, as in the socalled \depth-charge sentences" of Wason and Reich (1979): (27) (28)
No head injury is too trivial to be ignored. No missile is too small to be banned.
In (27) the usual reading is \however trivial a head injury, it should not be ignored". In (28) the reading is \however small a missile, it should be banned". A little re ection will show that these two readings are incompatible with one another, and that the rst case is an instance where local semantics has been overridden by pragmatics (i.e. situational expectancies). What all of this suggests is that semantic processing can be shallow, but never so shallow that it does not matter what the word in question is. We suggest that the eects parallel the bonding/semantic integration distinction discussed with respect to reference. If the super cial semantics of a word t top-down constraints, then it is not processed in great detail. Such super cial semantics might be based on whether the word is relevant to the contextual domain (paralleling a \bond"). If the focal structure of the sentence, or other requirements of coherence forces it, then the bond is used as a sight of elaboration. The account implies that much processing is in fact shallow, yet comprehension clearly takes place. This is no dilemma if it is assumed that the normal structure of messages is tuned to shallow comprehension mechanisms. In summary, the evidence is for immediate or early shallow analysis, with very selective and often delayed elaborated interpretation. Elsewhere, we have argued that the language comprehension process is designed to anchor utterances to knowledge of speci c situations (Sanford & Garrod, 1981; Garrod & Sanford, 1983; 1990; Sanford & Garrod, 1994; Sanford & Moxey, in press). For instance, stereotypic knowledge about the treatment of head-injuries or the banning of missiles should be accessed by the kinds of statements studied by Wason and Reich. Our working hypothesis is that it is this sort of mapping which enables messages to be understood, and that such mappings are primary in the sense that all else depends upon them (see Sanford, 1987, for a broader discussion). It therefore makes sense for the processor to identify such background-knowledge anchors as early as possible. But because there is no rule about what it takes in an input to identify an appropriate piece of situation-speci c knowledge, there is at present no way of clearly specifying how immediacy should apply at this level of analysis.
Final Conclusion This paper started out promising general conclusions about incrementality in language processing based on evidence from the study of discourse comprehension. Two general points emerge, one methodological and the other more theoretical. The rst, methodological point, is
Incrementality in Discourse Understanding 119
that evidence for immediacy does not always license the stronger conclusion about incrementality in processing. As we have seen, comprehension can occur through immediate recovery of prestored information without necessarily requiring the information to be combined at that point with what has come before. So evidence of immediacy does not necessarily constitute evidence for incrementality. The second, theoretical point, comes from the observation that human language processing may often be incomplete and that partial interpretation can occur at almost any level of analysis. This has consequences for how one might want to evaluate an essentially incremental processing system. Most theories start out with the assumption that to understand something is to give it a full interpretation and the question of incrementality simply concerns the order in which this full interpretation is built up relative to the order of the expressions in an utterance. Hence evidence that a listener makes partial commitments is taken as evidence against a strictly incremental processing system. But of course this assumes that those partial commitments will at some time be converted into full commitments and as we have seen this may not always be the case.
References Barton, S.B. & Sanford,A.J. (1993) A case study of anomaly detection: shallow semantic processing and cohesion establishment. Memory and Cognition, 21(4), 477-487. Bever, T. (1970) The cognitive basis for linguistic structures. In J.R.Hayes (Ed.) Cognition and the Development of Language. 279-360. New York, Wiley. Cloitre, M. & Bever, T.G. (1988) Linguistic Anaphors, Levels of Representation and Discourse. Language and Cognitive Processes, 3, 293-322. Dell, G.S., McKoon, G. & Ratclie, R. (1983) The activation of antecedent information during the processing of anaphoric reference in reading. Journal of Verbal Learning and Verbal Behaviour, 22, 121-132. Ehrlich, K. & Rayner, K. (1983) Pronouns assignment and semantic integration during reading: Eye-movements and immediacy of processing. Journal of Verbal Learning and Verbal Behaviour, 22, 75-87. Ericson, & Mattson (1981) From words to meaning: A semantic illusion. Journal of Verbal Learning and Verbal Behaviour, 20, 540-552. Fraurud, K. (1990) De niteness and the processing of NPs in natural discourse. Journal of Semantics, 7, 395-434. Frazier, L. & Rayner, K. (1982) Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology,
120 Simon Garrod and Anthony Sanford 14, 178- 210. Frazier, L. & Rayner, K. (1987) Resolution of syntactic category ambiguities: Eye movements in parsing lexically ambiguous sentences. Journal of Memory and Language, 26, 505-526. Frazier, L. & Rayner, K. (1990) Taking on semantic commitments: Processing multiple meanings vs multiple senses. Journal of Memory and Language, 29, 181-201. Garrod, S. (1994) Resolving pronouns and other anaphoric devices: The case for diversity in discourse processing. In C. Clifton, Jr., L. Frazier, and K. Rayner.(Eds.) Perspectives on sentence processing, 339-359 Englewood;: N.J. LEA. Garrod,S. , Freudenthal, D. & Boyle, E. (1994) The role of dierent types of anaphor in the on-line resolution of sentences in a discourse. Journal of Memory and Language. 33, 39-68 Garrod, S., O'Brien, E.J., Morris, R.K. & Rayner, K. (1990) Elaborative inferencing as an Active or Passive Process. Journal of Experimental Psychology: Learning, Memory and Cognition. 16, 250-257. Garrod, S. & Sanford, A.J. (in press) Resolving sentences in a discourse context: How discourse representation aects language understanding. In M.Gernsbacher (Ed.) Handbook of Psycholinguistics,. Garrod,S. & Sanford, A.J. (1982) Bridging inferences in the extended domain of reference. In A.Baddeley & J.Long (Eds.) Attention and Performance IX, Hillsdale, N.J. LEA, 331-346. Garrod, S. & Sanford, A.J. (1985) On the real-time character of interpretation during reading. Language and Cognitive Processes, 1, 43-61. Garrod, S. & Sanford, A.J. (1990) Referential processing in reading: Focusing on roles and individuals. In D.A.Balota, G.B.Flores d'Arcais & K.Rayner (Eds.) Comprehension Processes in Reading. Hillsdale, N.J.: Lawrence Erlbaum Associates. 465-486. Gernsbacher, M.A. (1989) Mechanisms that improve referential access. Cognition, 32, 99-156. Gordon, P.C, Grosz,B.J. & Gilliom, L.A. (in press) Pronouns, names and the centering of attention in discourse. Cognitive Science. Johnson-Laird, P.N. (1974) Experimental Psycholinguistics. Annual Review of Psychology, 25. O'Brien, E.J., Shank, D.M., Myers, J.L. & Rayner, K.(1988) Elaborative inferences during reading: Do they occur on-line? Journal of Experimental Psychology: Learning, Memory and Cognition, 14, 410-420. Perfetti, C.A. (1990) The co-operative language processors: Semantic in uences in an autonom-
Incrementality in Discourse Understanding 121
ous syntax. In D.A.Balota, G.B.Flores d'Arcais & K.Rayner (Eds.) Comprehension Processes in Reading. Hillsdale, N.J.: Lawrence Erlbaum Associates. 205-228. Reder, L.M. & Kusbit, G.W. (1990) Locus of the Moses illusion: Imperfect encoding, retrieval, or match? Journal of Memory and Language, 30, 385-406. Sanford, A.J. (1985a) Aspects of pronoun interpretation. In G.Rickheit & H.Strohner (Eds.) Inferences in text processing. North Holland, Elsevier Science Publishers, 183-205. Sanford, A.J. (1985b) Pronoun reference resolution and the bonding eect. In G.Hoppenbrouwers, P.Seuren, and A. Weijters (Eds.) Meaning and Lexicon. Dordrecht: Foris. Sanford, A.J. & Garrod, S. (1981) Understanding Written Language: Explorations in comprehension beyond the sentence. Chichester: J. Wiley & Sons. Sanford, A.J. & Garrod, S. (1989) What, When and How: Questions of Immediacy in Anaphoric Reference Resolution. Language and Cognitive Processes, 4, 263-287. Sanford, A.J., Garrod, S., Lucas, A. & Henderson, R. (1983) Pronouns without explicit antecedents. Journal of Semantics, 2, 303-318. Sanford, A.J. & Garrod, S. (1994) Selective processing and text comprehension. In M.Gernsbacher (Ed.) Handbook of Psycholinguistics: New York; Academic Press. 699-717 M.Gernsbacher (Ed.) Handbook of Psycholinguistics. New York:Academic Press. Sanford, A.J., Moar, K. & Garrod, S. (1988) Proper names as controllers of discourse focus. Language and Speech, 31, 43-56. Sanford,A.J. & Moxey, L.M. (in press) Aspects of coherence in written language: A psychological perspective. In T.Givon and M.Gernsbacher (Eds.) Coherence in spontanious text. Philadelphia, PA: John Benjamins. Shillcock, R. (1982) The on-line resolution of pronominal anaphora. Language and Speech, 24, 4. Vonk, W., Hustinx, L. & Simons, W. (1992) The use of referential expressions in structuring discourse. Language and Cognitive Processes.7,301-335. Wason, P. & Reich, R.S. (1979) A verbal illusion. Quarterly Journal of Experimental Psychology . 31, 591-597.
6
Incremental Syntax in a Holistic Language Model David Tugwell 1 2 3
4 5 6
Models and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Syntactic Formalism: Date-Stamped Information States . . . . . 125 Analysis of Syntactic Constructions . . . . . . . . . . . . . . . . . 127
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9
Wh-Questions . . . . . . . . . . . . . . . . . . . . . . . . . . Raising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tough Movement . . . . . . . . . . . . . . . . . . . . . . . . Passive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Further Wh-Constructions . . . . . . . . . . . . . . . . . . . Contrasting Clause Element Order in Germanic Languages Left-Branching . . . . . . . . . . . . . . . . . . . . . . . . . Free Word Order . . . . . . . . . . . . . . . . . . . . . . . . Coordinate Structures . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
127 135 138 140 141 144 145 146 147
Deriving a Language Model . . . . . . . . . . . . . . . . . . . . . . 151 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
123 Edinburgh Working Papers in Cognitive Science, Vol. 11: Incremental Interpretation, pp. 123{156. D. Milward and P. Sturt, eds.. c 1995 David Tugwell. Copyright
124 David Tugwell
Abstract
This paper broadly sketches an approach to constructing a practical system of natural language interpretation on the basis of pre-analyzed texts. The language model employed in this system is termed \comprehensive" in that it draws on statistical information from various aspects of language experience and is not restricted to modelling linguistic competence. The formal representation of the interpretation of a text constitutes a word-by-word build-up of the deep dependency relations of the sentence. I shall discuss the question of how a Markovian probabilistic language model can be derived from a hand-parsed corpus and how this language model may then be employed in a practical processing system. The main focus of the paper however will be on the adequacy of the incremental language model for syntactic description and in particular how it may be able to explain and make predictions about much-studied linguistic phenomena such as that-trace and \subjacency violation" eects, tough-movement and so forth. One area where dynamic grammars have been argued to oer advantages is in the analysis of coordinated constructions and these also will be brie y looked at. This research is very much work in progress and many aspects of the model will need to be eshed out or possibly reformulated. I only hope that the paper may be of some interest as an alternative approach to language modelling and the study of syntax. All comments and criticisms will be most gratefully received.1
1 Models and Data I shall assume that an interesting and useful process to model is that of the human interpretation of natural language. I shall further assume that a useful component in such a model would be a two-place function which assigns a probability to every input/interpretation pair, where the input is taken from the countably in nite set of word strings2 and the interpretation is taken from the countably in nite set of possible representations in whatever is our chosen formalism. It is this probabilistic function which will be termed the language model.3 To estimate the values of this language model, I propose to employ the methodology of DataOriented Parsing.4 In this approach, instead of any hand-drawn rules, the model relies entirely for its knowledge of language on information derived from a pre-analyzed (ie. hand-parsed) corpus. The model has the potential then to acquire not only knowledge of the language at the abstract level of the combination potentials of \word classes", but also information about the frequencies of particular constructions, the frequencies of individual words and about a wide range of collocational patterns. The adopted methodology places the strong constraint that there is no way in which the linguist can intervene to make sure the model makes the right predictions by postulating their own constraints, rules or conventions. The only input the linguist makes to the model is in parsing the test data, which is naturally-occurring \communicative" language. The speculation is that if the system is able to generalize this nite amount of test data in a This research was nanced by a research studentship from the ESRC. Many thanks for comments and discussion to Chris Brew, Matt Crocker and Patrick Sturt. 2 It need not be assumed that the words belong to a nite vocabulary. Nor need the strings be sentences. 3 There is no attempt being made here to model the process of language acquisition. It seems feasible to develop a theory about how birds y, before having a theory of how they learn to y. But not vice-versa. 4 cf. Bod (1992). 1
Incremental Syntax in a Holistic Language Model 125
suitable way, it will result in a language model which can make predictions about any input string whatsoever. Although judgements on the (degree of) deviancy of marginally usable isolated sentences cannot be put into the model, these limiting cases of linguistic expression will be extremely useful as a test for the model. This paper will mainly consist of thought experiments of how we could expect the model to behave in these limiting conditions.
2 Syntactic Formalism: Date-Stamped Information States The rst task in the Data-Oriented Parsing approach is to come up with a formalism which can be used as the representation of syntactic interpretation in the language model. Obviously, we cannot expect to rely on a competence grammar for this task as they are designed for a dierent purpose and with a dierent methodology in mind. What we might require from an ideal formalism for this task is that it: 1. Can represent the basic dependency structure (deep structure) of language, put simply: Who did what to who, when, where, why, how etc? 2. Has a wide coverage of data|it must be able to represent interpretations of all text, allowing of an interpretation, no matter how fragmentary or \ill-formed". 3. Is easily coupled with a simple probabilistic model, the one in mind here being a Markov model. 4. Should allow as incremental an interpretation as possible (especially important if the language model is to be a component in a speech recognition system). 5. Is as non-arbitrary (ie. the same interpretation of a string should not be representable in more than one way in the formalism) and as simple to use as possible In an attempt to satisfy all of these constraints, I adopt a formalism which represents the word-by-word incremental growth of the information state which is the interpretation of the sentence.5 The information state is represented as a hierarchical feature structure (its precise formal speci cation is yet to be nalized, although this should have little bearing on the overall approach). The derivation of the structure is explicitly and unambiguously represented by each piece of information being marked with the state (ie. the word in the string) at which it was added.6 The feature structure may be graphically represented as a tree structure as for the simple sentence in Figure 1, although it should be remembered that this is not a parse tree and the order of features is purely arbitrary. This seems to be essentially the same approach as Hausser (1989) adopts in his Left-Associative Grammar, in that it takes the \time linearity" of language as being its fundamental characteristic and foregoes a hierarchical surface structure. Hausser uses a conventional hand-drawn rule-based methodology however, rather than a corpus-trained probabilistic one, and there do seem to be other dierences in detail. The information states described here correspond to Hausser's \semantic hierarchies". 6 Information may only be added, not removed. 5
126 David Tugwell 1
spast ` P J P`P`P`P`````
JJ PPPP``````
P6 ` 8 3 1 4 hdgive nnom;sbj nind obj nobj ntime H H H H HH HH 3
3
det1the hd2man
det4the hd5dog det6a hd7bone hd8today
Figure 1: \The man gave the dog a bone today" In this diagram, atomic features are represented as subscripts, while feature-bearing features are represented as daughter nodes. The system of features is still far from nalized, but it includes the major phrasal categories (sentence, noun phrase, prepositional phrase, adjectival phrase, x = adverbial phrase), hd (head of a phrase), the abstract case nom (nominative), various \function-roles" (identifying subject, complements and adjuncts), tenses, right down to features for individual lexemes.7 The \date-stamp" superscript shows at which state in the derivation the feature was added and in the diagram is shared by all atomic features of a particular node unless indicated to the contrary.8 The information structure at a given word in the sentence is easily recovered from the completed information structure by stripping o all information added at subsequent states. It can be seen then that, using an equivalent feature representation (and temporarily ignoring date-stamping), the word-by-word formation of gure 1 is: \The" \The man" \The man gave" \The man gave the" ...
s: s: s: s:
fn: fnom, det:thegg fn: fnom, det:the, hd:mangg fpast, hd:give, n: fnom, sbj, det:the, hd:mangg fpast, hd:give, n: fnom, sbj, det:the, hd:mang, n: find obj, det:thegg
Turning back to our original list of requirements, we can ask how well the incremental information state formalism ful lls them. 1. The formalism does represent deep grammatical relations (in fact it represents these in contrast to surface relations). 2. It has a broad coverage|if a string is any way interpretable, then this \deep level" interpretation can be represented in the formalism. There are no \rules" as such as to how an analysis can be made. The job of the model is to derive probabilistic rules from the data, they are not built into the formalism. \Empty words" such as determiners and pronouns should really be broken down into features, but I shall represent them as having a \lexical value" in this paper to make the structures more convenient to read. 8 For example, in Figure 1 the tense feature past was added to the matrix S at state 3. 7
Incremental Syntax in a Holistic Language Model 127
3. The formalism allows us to recover the information state after the addition of each word. If we can calculate the probability of any particular word describing a transition between any two information states, then we can adopt the Markovian assumption to calculate the probability of the whole derivation for a string of words. This means simply that the probability of the derivation will be the product of the transitions that formed it. 4. Incrementality of processing is made possible, both by the incremental nature of the formalism and by the fact that using the Markov model we will always be able to calculate the probability of the \analysis so far" at any particular stage in the parse. 5. The non-arbitrariness of the formalism is rather harder to demonstrate without evidence of practical use. It is certainly true that if one ignores the date-stamping, then the logical structure of the interpretation is represented straightforwardly and unambiguously.9 As mentioned above, the formalism does not directly represent surface structure. It is however the case that the date-stamping showing the derivation of the deep structural relations in some sense carries the essential information which the hierarchical surface structure does in grammars which use a level of surface (phrase) structure. Even from those working within this paradigm, doubts have been cast as to the necessity of surface structure: \Perhaps phrase structure itself ... is the nonobservable linguistic construct that enjoys the widest acceptance in current theoretical work. Surely the evidence for it is far less direct, robust, and compelling than that for phonological structure..., logical predicate-argument structure ... or underlying grammatical relations. But for all that a theory that successfully dispenses with a notion of surface constituent structure is to be preferred (other things being equal, of course), the explanatory power of such a notion is too great for many syntacticians to be willing to relinquish it."10 In the remainder of this paper, I shall attempt to argue that not only is it possible to get by without surface structure, but that doing so may oer clear and satisfying explanations of many syntactic phenomena.
3 Analysis of Syntactic Constructions 3.1 Wh-Questions The central issue in accounts of questions involving wh-movement is how to relate the displaced wh-phrase with its canonical position in the sentence. This is typically achieved using \traces" or some other manner of co-reference to link the surface realisation and the canonical position.11 9 On the other hand, decisions about at what state particular features are added to the derivation may sometimes be dicult to decide and in such cases we may have to rely on conventions of some sort. 10 Pollard & Sag (1994), pp 9-10. 11 An exception is Blevins (1994), who argues for there being no movement of wh-phrases, although as he retains phrase structure this leads to unorthodox parse trees with multiply crossing branches.
128 David Tugwell Figure 2 shows the derivation of a simple wh-question in the incremental dependency formalism. s1wh;inv2 ;past2
!!HXHXHXXXXX ! ! HHH XXXXX !4!! hdput n3nom;sbj n1wh;obj p5loc PPPP P6 hd3you hd1what hd5in PnobjPPP 4
4
det6the hd7tea Figure 2: \What did you put in the tea?"
The correct interpretation of the wh-phrase in deep structure is achieved by allowing it to \wait" as a feature of the matrix sentence, but without case or a function-role, until it is marked as a direct object by the head verb. The derivation in gure 3 is similar, but slightly more involved, as the wh-phrase needs to end up not as a feature of the matrix sentence but embedded as the feature of a prepositional phrase. 1
s`wh;inv ;past ! HH`````` ! ! HHH ````` ! !4!! `` 7 H hdput n3nom;sbj n5obj p HH Hwh;loc HHH HH 2
2
4
hd3you
det5the hd6tea
hd7in
n1wh;obj 7 hd1what
Figure 3: \What did you put the tea in?" The wh-phrase waits without case or role as before, but when the preposition is met, the wh-phrase is removed as a feature of the matrix clause and reinterpreted as a feature of the locative prepositional phrase, where it is assigned a feature-role. This type of reinterpretation will be referred to as \lowering" and will feature largely in accounts of any type of movement. It remains the case that earlier states in the derivation are unambiguously recoverable by removing information added later.12 The alternatively worded, but semantically identical, sentence in gure 4 results in an information structure which is entirely the same as the previous one, except for the derivational For example, since the wh-phrase in gure 3 is older than the prepositional phrase that contains it, we know that before state 7 it must have been a feature of the matrix clause. We will not in general allow lowering of elements into pre-existing phrases as this would result in ambiguity as to what state they were lowered and hence ambiguity in the sequence of transitions. 12
Incremental Syntax in a Holistic Language Model 129
information recorded in the date-marking. The language model will therefore still be able to distinguish these structures and will in general assign them diering probabilities. 1
s`wh ;inv ;past ! HH`````` ! ! HHH ````` ! `` 1 !5!! H hdput n4nom;sbj n6obj pwh ;loc HH HH HH HH 2
3
3
5
hd4you
2
det6the hd7tea
hd1in
n2wh;obj hd2what
Figure 4: \In what did you put the tea?"
\That-trace" Eect Now that the basic approach to wh-questions has been established, we can take a look at a number of syntactic phenomena which have been the subject of much investigation. We start with the \that-trace" eect, in which the following set of data has to be accounted for: (5) (6) (5a) (6a)
Who did he say liked olives? *Who did he say that liked olives? Who did he say she liked? Who did he say that she liked?
Why does the presence of a complementizer at the front of the embedded clause with a subject gap result in an unacceptable sentence, when it does not with an object gap? We shall rst examine the derivation of the acceptable sentence with a subject gap and no complementizer in gure 5. In sentence (5), when we come to the fourth state (ie. upon meeting the fourth word say) a nite sentence complement is formed, the wh-phrase is moved into it and marked with nominative case. The training data will prove this to be a typical transition for all verbs which take a nite sentential complement and will occur whenever there is a subject gap in an embedded clause, whether due to wh-movement as here, topicalization, relative movement and so forth. The derivation of sentence (6) however cannot go through in the same manner, as gure 6 shows. Suppose the same actions occur at the fourth word say and an embedded clause is formed, then at the next state the complementizer that cannot be added to the embedded nite clause as it is always exclusively clause-initial (indeed its only function is to mark the beginning of a nite clause). If, on the other hand, the caseless wh-phrase is lowered into
130 David Tugwell 1
swh;inv ;past PPPP ! ! ! ! PPP !4!! PP4 hdsay n3nom ;sbj swh;obj;past PPPPPP 3 hdhe 5 1 6 2
3
2
4
5
hdlike nwh;nom4 ;sbj 5 nobj hd1who
hd6olives
Figure 5: \Who did he say liked olives" the newly-formed embedded clause at state 5 (as would happen in the case of an object gap such as in sentence (6a)) then they will be no way the wh-phrase can acquire nominative case and become subject of the nite verb. Neither the complementizer or the verb will ever be involved in transitions assigning case. It is therefore unavoidable that at least one of the transitions involved in the derivation in gure 6 will be \bad", ie. unseen or vanishingly rare in the training corpus. Given that the Markov model we are assuming calculates the probability of the whole interpretation as the product of the probabilities of the individual transitions, the probability assigned to the whole structure will in consequence also be very low. 1
swh;inv ;past XXXX ! ! ! XXXXX ! ! ! X5 !4 3 hdsay nnom;sbj swh;obj;past H`H`H````` ``` H 3 2
4
hdhe
2
6
comp5that hd6like n1wh;nom5=6? ;sbj 6 n7obj hd1who
hd7olives
Figure 6: * \Who did he say that liked olives" To recap then, we can say that the reason sentence (6) cannot go through is by the time we get to that the embedded clause has already been formed. It is therefore bad for much the same reason that any sentence with a complementizer in non-clause-initial position is bad, so the following two sentences share the same derivational infelicity: (6b) (6c)
*Mary, he said that liked olives. *He said Mary that liked olives.
There is however nothing in the explanation given for the infelicity of the \that-trace" construction in gure 6 which would rule out the derivation of its echo-question variant shown in gure 7.
Incremental Syntax in a Holistic Language Model 131 1
spast ;wh XXXX ! ! ! XXXXX ! ! ! X3 ! hd2say n1nom;sbj sobj;wh ;past XHXXX H HH XXX 1 2
4
2
hdhe
4
5
comp3that hd5like n4wh;nom;sbjX5 n6obj hd4who
hd6olives
Figure 7: \He said that who liked olives?" All the words here are associated with transitions of commonly-seen types. The that-complementizer initiates the nite clause, the subject who is produced ready supplied with nominative case as it would be in initial subject position. Hence, notwithstanding the slight markedness associated with all echo questions, there is no jarring eect produced as there is in (5). Happily, the explanation given for that-trace infelicity predicts without further adjustment the following set of data. (5b) (5c)
Who did he think that Jane said liked olives? *Who did he think that Jane said that liked olives?
The reader can con rm this on the basis of the previous examples. Evidence that this approach is on the right track comes from consideration of the co-ordination data: (5d) (5e) (5f) (5g) (5h) (5i) (5j)
*Which man does Jane think [[is awful] but [Mary wants to marry]]? Which man does [Jane think is awful] but [Mary want to marry]? *Which man does Jane think [[Mary wants to marry] but [is awful]]? Do you know which man [Jane thinks is cool], but [Mary doesn't want to marry]? Which man does [Mary want to marry], but [Jane think is awful]? Which man does Jane think [[will marry Mary], but [could do better]]? *Which man does Jane know [[will marry Mary], but [her mother thinks is a disaster]]?
The picture that emerges from these examples can be summarized as: 1. Whenever a nite clause beginning with a subject gap co-ordinates with anything else (ie. a clause with an object gap, or even with an embedded subject gap (5j)) the result is infelicitous.
132 David Tugwell 2. Clauses with object gaps may combine with clauses with subject gaps as long as these are embedded inside the conjunct. It is not the place here to set out the approach to coordination in the model in any great detail, but suce it to say that it essentially involves the idea of \rewinding" the derivation to the state previous to where the rst conjunct begins and then continuing from that state by adding on the second conjunct. Since wh-phrases wait as caseless features, whether they will eventually become subject or object gaps, it should therefore be clear that the model predicts the second of these points, ie. the coordination of phrases with subject and object gaps, simply because they are not distinguished until the gap site itself. But how does the model then explain the rst observation, that initial subject-gapped phrases are impossible to coordinate with anything but their own kind? It is only necessary to remember that the derivation of a subject-gapped clause requires that the clause be initiated and nominative case assigned on the transition corresponding to the matrix verb before the gap. So, for example in sentence (5d), at the fth state (think) there will already be an embedded clause formed with a nominative wh-noun phrase lowered into it. If we retrace our steps to state 5 and try and continue with an object-gapped (or even an embedded subject-gapped) conjunct there will be no felicitous continuation.
Who or Whom? It will be noted that I have been assuming a variety of English which uses who in all positions, and have not treated varieties where whom is used in objective positions. An analysis of the use of who(m) in these latter varieties, however, does appear to lend weight to one side of an old dispute which centres on the relative correctness of the following contrastive pair: (5k) (5l)
I met the woman who he thought liked olives. I met the woman whom he thought liked olives.
The general consensus among traditional grammarians and prescriptivists is that the use of the nominative who is correct in such sentences, since the subject gap in the embedded nite clause is in a position where one would use the nominative. The use of whom is generally viewed as a \hypercorrect" mistake.13 The opposing view however, that the use of whom here is essentially correct, is persuasively argued by Otto Jespersen14 . His main source of evidence for this view is the preponderance of whom in this position by earlier writers (for whom the who/whom distinction was a part of their everyday spoken language and whose decisions can therefore be relied on as being more instinctual rather than having been drilled into them at school). It will be seen from the previous analysis of sentence (5) that the present approach does not require that the interrogative pronoun have nominative case in such positions (indeed as the 13 14
For discussion, see Stuurman (1990), pp.261{266. Jespersen (1927), pp 197{201.
Incremental Syntax in a Holistic Language Model 133
analysis of `tough'-movement will show, it explicitly requires that it not have nominative case). The analysis thus supports Jespersen's position that if a version of English is assumed that restricts who to nominative positions, then whom is correct in the above examples. There is of course nothing \incorrect" in modern English about who in such examples, but they are necessarily analyzed according to the present system as (initially) non-nominative.15
Subjacency Violations The second set of well-trodden data I will look at involves \subjacency-violation" type phenomena. For example, how might the language model characterize and explain the striking infelicity of sentence (8)? (8)
*What did you meet someone who liked?
A possible analysis is given in gure 8: s1wh;inv2 ;past2
!!aaaa ! ! aaa !4!! 3 hdmeet nnom;sbj n5obj;wh PPPP PP 6 hd3you hd5someone swh;rel;past PPPPP 7 6 1 4
7
hdlike
nrel;nom;sbj 7 nwh;obj 7 hd6who
hd1what
Figure 8: * \What did you meet someone who liked?" Once again, as we are assuming a Markovian probabilistic language model, the problem must be traced back to at least one of the words having a transition in the derivation of a type uncommonly associated with it. The most likely suspect here would appear to be the transition at the fth word, the relative pronoun who.16, where the actions are: Jespersen's own rationalization for the infelicity of sentences akin to our sentence (5l) is that \The form whom is used because... the speech-instinct would be bewildered by the contiguity of two nominatives, as it were two subjects in the same clause.", p.199. This is in essence the same argument as that proposed in the present paper|the section on tough-movement gives further evidence for this \multi-nominative bewilderment". 16 That it is the wh-lowering at who that is infelicitous, rather than the initial wh-lowering at someone, can be seen in the contrast between: (9a) What did you meet a connoisseur of? (9b) *What did you meet a connoisseur who liked? 15
134 David Tugwell
A relative clause modi er, containing a nominative relative, is added to the noun phrase. The caseless wh-phrase is lowered from the noun phrase into the clause. The rst of these actions will be absolutely typical for a subject relative pronoun, but the second action will be very seldom associated with it in any successful sentence.17 The model should therefore give a very low probability to this kind of transition, and thus the probability of whole derivation should be low. It might be thought that a simpler explanation for the badness of sentence 8 might be to conjecture restrictions on the features wh (for wh-phrase) and rel (for relatives) co-occurring on any one phrase. Evidence that there is no such restriction is provided by the fact that there is nothing wrong with the echo-question variant of the sentence, whose derivation is shown in (9). As the wh-phrase is not lowered into the relative clause, but rather produced in situ, the previous problem with the aberrant transition does not arise.18 1
!2!
swh ;past ! aaa ! ! ! aa !
hdmeet
6
n1nom;sbj 2 hd1you
2
aa3 nobj;wh PPPP PP 4 hd3someone srel;past ;wh P PPPP 6
5
hd5like
6
n4rel;nom;sbj 5 n6wh;obj hd4who
hd6what
Figure 9: \You met someone who liked what?" In (9a), wh-lowering occurs at connoisseur and yet it is still felicitous. The felicity of wh-lowering into nouns seems to be dependent to some extent on the particular noun in question, hence the \picture noun" phenomena. The infelicity of (9b) must therefore be put down to the transition at the relative pronoun. 17 Patrick Sturt (pc) suggests that such a restriction may help the communication eciency of the language, ie. in the following sentence: (9c) Who did you advise the detective who interviewed you to check up on? the language model knows not to attempt vainly to lower the wh-phrase into the relative clause looking for the wh-gap and end up \garden pathing". If the language did commonly employ sentences such as (8) it might result in an increase in the number of garden paths. A similar argument can be made for avoiding extraction from subjects, as this would also involve the parser making a choice (whether to lower into the subject or not) from which it might not be able to recover. 18 This applies similarly to \quiz questions": (9d) Q. In Hamburg, Mozart met the composer Schulz who had written an opera which featured which of his favourite tunes? A. The Nutcracker Waltz? (9e) Q. Which of his favourite tunes did Mozart meet in Hamburg the composer Schulz who had written an opera which featured? A. !?
Incremental Syntax in a Holistic Language Model 135
It is important to note that even if the training data contains no examples corresponding to gures 8 or 9 (as is quite likely), the information it has about transitions in actuallyoccurring sentences should lead it to categorize the former as bad, without nding fault with the latter.19
3.2 Raising We turn now to the phenomenon of \subject-raising", where an element turns up as the apparent surface subject of a clause, although it is the \logical" or \deep structure" subject of an embedded clause. How is the present formalism able to deal with this, representing as it does only the latter \deep" relations in a direct fashion? 1
spres PPP ! ! PPP !!! hd2appear s3inf;sbj PPPP P 2
hd4like
n1nom1 ;sbj 4
n5obj
hd1John
hd5olives
Figure 10: \John appears to like olives" Figure 10 shows the analysis of a simple subject-raising sentence. The \surface subject", the nominative noun phrase John, is not marked as subject by the raising verb appears, but is instead lowered into the clausal subject. I will assume that non- nite clauses do not place requirements on the case of their subjects, unlike nite ones, and in particular do not object to nominative subjects.20 I shall assume that copula constructions are also best analyzed as involving raising (or what is perhaps best described in the present formalism from the opposite viewpoint as \subject lowering"): In a canonical copula sentence (11), the nominative John is lowered into a small clause where it is marked as subject.21 (11)
John was the culprit.
It will have had to see some examples of echo questions or multiple questions though, otherwise it would think that wh-words occurred only sentence-initially. 20 The contrast between: (10a) *I believe he to be a good singer. (10b) I believe him to be a good singer. must then be accounted for by the transition for he in (10a) introducing a clause which is explicitly marked as nite, rather than the non- nite verb not allowing a nominative subject. 21 These examples are taken from Heycock (1994). 19
136 David Tugwell 1
spast P PPPP 2 hdbe s3sm cl;sbj 1 PPPP3 nnom;sbj npred PPPP 2
3
hd1John
det3the hd4culprit
Figure 11: \John was the culprit" (12)
The culprit was John.
Sentence (12) is an \inverse copula construction", and of the many dierences between the two, perhaps the most characteristic is their ability to appear in it-clefts: (11a) It's John that's the culprit. (12a) *It's the culprit that's John. I shall assume that the propositional content of the two sentences are identical, but that in the canonical case the subject of the small clause has been raised, while in the case of the inverse construction the predicate has been raised, as in gure 12. 1
spast P 3 PPPP 3 hdbe ssm cl;sbj 4 PPPP1 nsbj nnom;pred PPPP 3
hd4John
3
det1the hd2culprit
Figure 12: \The culprit was John" The transition that ensures that this predicate raising derivation can go through is the third one was, where an embedded small clause is created and the nominative noun phrase is lowered into it as its predicate.22 As the \raised" nominatives must agree with the verb, whether they will end up as subject or predicate of the small clause, this also explains why agreement is always with the initial constituent. Note the contrast with fronted adjectives. Not all raising verbs would seem to be associated with such a \predicate lowering" transition, as the following contrast shows: (11b) John seems the natural successor. (12b) *The natural successor seems John. 22
Incremental Syntax in a Holistic Language Model 137
(11c) (12c) (11d) (12d)
They are the problem. The problem is them. They are happy. Happy are they.
Dummy-it Continuing on the topic of raising verbs, the following analysis ( gure 13) is proposed where a dummy-it noun phrase appears as surface subject. 1
spres XXXXX ! ! ! XXX ! ! hd2appear n1it;nom s3sbj;pres PPPPP comp3 hd5 n4 n6 2
5
that
nom
like
hd4John
obj
hd6olives
Figure 13: \It appears that John likes olives" The dummy noun phrase it has nominative case, thus ful lling the agreement requirements of the verb, but as appears is a raising verb it is not assigned the subject role. The nite clausal subject assumes the role of subject of appears. The dummy-it is left without a function-role and therefore plays no part in the interpretation of the sentence. The dummy noun phrase it can be lowered by \subject-lowering" in an fashion identical to other noun phrases, and in gure 14 this occurs twice.23 Considering that verbs like appear and seem typically take a clausal subject, something that remains to be explained is why sentences such as (13a) are impossible, especially when a nite clause subject is apparently possible in (13b). (13a) *That John likes olives appears. (13b) That John likes olives appears to be true. It is sucient to note, however, that in the present model raising verbs of this kind are never associated with transitions where a subject role is assigned (which is why they are able to function as subject-raising verbs). In (13a) it is therefore impossible for the initial that-clause to be assigned the subject role. In (13b), of course, the initial that-clause is lowered and made the subject of an embedded sentence. 23
This example is due to Gazdar.
138 David Tugwell 1
spres XXXX XXX
hd2seem
2
s5sbj;inf
XX3 ni obj PPPP 3
H 6 HHH7 pto hd4John hdtend ssbj;inf H`H`H`H``````` hd8bother n1it;nom s10 n9obj sbj;pres P 12PPP 11 9 10 12
compthat hdrun nnom hdMary hd11 Fido
Figure 14: \It seems to John to tend to bother Mary that Fido runs."
3.3 Tough Movement The treatment of tough-movement is similar in many respects to that of raising. The surface subject cannot be marked as subject of the tough-adjective as the deep level subject is in fact a clause. To take the example of sentence (15), it is not John who can be dicult, it is to understand John. (15)
John can be dicult to understand.
The nominative surface subject is therefore lowered into a non-subject position in the toclause, where it can be subsequently role-marked or lowered as necessary. This is set out in gure 15. s1
modal2can
P 3 PPPP 3 hdbe ssm cl PPPPP s4sbj;inf adj4pred P 6 PPPP1 hd4 5
hdunderstand
nnom;obj 6
difficult
hd1John Figure 15: \John can be dicult to understand" The associated attributive use of the adjective is analyzed as introducing a relative clause with an non-subject gap, as in gure 16.
Incremental Syntax in a Holistic Language Model 139 1
spast P PPPP 2 hdbe s3sm cl P 1 PPPP 3 nnom;sbj npred PPPP P5 hd1Julie det3 hd6 srel;sm cl a girl P PPPP 2
3
s5sbj;rel;inf 7
HH 5
hd8resist
adj4pred5
4 HH
nrel;obj 8 degvery
hd5difficult
Figure 16: \Julie was a very dicult girl to resist." As was the case with subject-raising, the lowered noun-phrase in these tough-movement constructions still bears the nominative case it needed to function as \surface subject" of the nite matrix clause. Given the point made earlier about Jespersen's \multi-nominative bewilderment", we might wonder if this might prove problematic were it to be subsequently lowered into another nite clause. The following pattern of data suggests that this might indeed be the case. (15a) It is dicult to imagine John kissing Mary. (15b) It is dicult to imagine John would kiss Mary. (15c) John is dicult to imagine kissing Mary. (15d) Mary is dicult to imagine John kissing. (15e) *John is dicult to imagine would kiss Mary. (15f) ?*Mary is dicult to imagine John would kiss. The rst two examples show that when the deep-subject clause appears in situ, the embedded clause (kiss(John,Mary)) can be either nite or non- nite. (15c) and (15d) show that both subject and object can be extracted from this non- nite clause, which is what we would expect, there being no \interference" due to the extracted item being nominative. Sentence (15e) shows that it is impossible to extract the subject from a nite clause. If we consider the discussion of wh-movement and how this derivation would have to go through, we will remember that the sentence complement verb imagine would have to assign John nominative case. But John already has nominative case, so either John ends up with two `nom' features or else the verb will not assign nominative case, in neither case will the transition be typical for the verb and it should therefore have a low probability.
140 David Tugwell Sentence (15f) shows that, in addition, extraction from non-subject position of a nite clause is also blocked.24 Here we can note that after we reach the transition for John in the derivation, there will be two nominative noun phrases in the same clause, a situation which will generally be a rarity in the language, and which the probabilistic model should punish to some extent. The point is brought out most clearly by contrasting the felicity of (15g), where an ordinary non-nominative wh-has been extracted from the nite clause and the surface subject is dummy-it, and (15h) where the surface subject is the wh-phrase itself. The salient dierence between the two is that the wh-phrase in the latter is nominative when the word imagine is reached. (15g) Who is it dicult to imagine Mary would kiss. (15h) *Who is dicult to imagine Mary would kiss.
3.4 Passive In the passive voice, we have once again a construction where the surface subject of a clause is not the logical subject. The dierence between passive and raising constructions, however, is that the passivized head verb explicitly marks the surface subject with a function-role, as can be seen in gure 17. 1
spast a`a`;passive a`a`a`````` aa 1 ``` 3 2 4 5 auxbe hdcriticize nsbj nnom;obj advmanner ? @ 1 3 5 6 2
4
4
pby
hdMary
hdJohn
hdheavily
Figure 17: \John was heavily criticized by Mary" We can see that the nominative John is not given a function role by was, and it is the passivized head verb that marks it explicitly as object. The role of subject is then assigned to the noun phrase fronted by the preposition by.25 Another question is to explain how the surface subject can be the deep subject of embedded non- nite clauses, as in: Or at least bad to some extent|my intuitions here seem a little variable. Pollard & Sag (1994) have an example parallel to (15f) on p.168 which is unstarred. But the two examples to come, (15g) and (15h), seem to show a robust contrast. 25 It will be noted that the phrase headed by by is represented as a noun phrase, with the preposition a feature on that noun phrase (playing rather a similar role to a complementizer in a sentence). This seems the most natural analysis where the preposition is devoid of semantic content, as is the case here, and the function-role is more directly associated with the noun phrase itself. It also ties in nicely with the possibility of passivizing the noun phrases out of certain prepositional phrases as is shown in the next example. 24
Incremental Syntax in a Holistic Language Model 141
(17a) (17b)
Peter was thought to have won. Peter was thought capable of winning.
This is achieved by the embedded clause forming at the transition for the passivized head verb and the surface subject being lowered into it to become its subject, and this is shown in gure 18.26 Here Peter is consecutively the surface subject of two passive verbs, and only receives its function-role at state 8. This derivation also shows how phrases may be passivized out of prepositional phrases, but only if the preposition is lacking in semantic content and the phrase can be given a function-role directly. 1
swh;past ;passive XXXXX ! ! ! XXXXX ! !!2 ! auxbe hd4think s4wh;obj;inf ;perf ;passive XXXXX XXX 3 aux7be hd8shout n1wh;sbj nnom;obj 2
4
5
6
7
10
p10 by
hd1who
p9at
8
hd3Peter
Figure 18: \Who was Peter thought to have been shouted at by?" The fact that passivization out of the subject position of nite clauses is impossible (as (17c) shows) can once again be traced back to the problem of Peter ending up with two nominative cases, one from its role as surface subject and a further one assigned by the sentence-complement verb thought. (17c) *Peter was thought had won.
3.5 Further Wh-Constructions Fused Relatives In these constructions the relativizers have a position in the body of the noun phrase and start a relative clause at the same time. It should be noted that in sentence (20) the relative clause comes into being at the transition for whichever. As we saw previously the complementizer must initiate the clause it is in and therefore (20a) is impossible (unlike (20b) where the relative clause is initiated at that). (20)
I will buy whichever cake Mary wants.
The passivized verb cannot explicitly mark it as subject of the embedded clause, but we may surmise that it must act as surface subject of this clause as it has no competitors for the role. 26
142 David Tugwell s1pres2
!PP !!!! PPPP
hd2want
n1nom;sbj 2 hd1I
n3obj
XXXXX hd3what s3rel;pres !!!!PPPPP 5
hd5want n4nom;sbj 5 n3rel;obj 5 hd4Mary
Figure 19: \I want what Mary wants." (20a) *I will buy whichever cake that Mary wants. (20b) I will buy the cake that Mary wants. s1
PPPPP PP 4 modal2will hd3buy n1nom;sbj 3 nobj PPP P hd1I det4 hd5cake PPs4 whichever rel;pres PPP PP 4 7 6 7
hdwant
nnom;sbj 7 nrel;obj 7 hd6Mary
Figure 20: \I will buy whichever cake Mary wants."
Embedded Questions Indirect wh-clauses are treated in essentially the same way as wh-questions proper, the difference being that there is no wh feature instantiated on the clause itself, and this feature is consequentially not propagated upwards as it would be in echo or multiple wh-questions. We are now in a position to look at an example involving at one and the same time whmovement, tough-movement and an indirect question, and this is shown in gure 22. A similar derivation will be possible for the corresponding sentence with an attributive adjective, as in: (22a)
Which problems are they dicult students to know what to say to about?
Incremental Syntax in a Holistic Language Model 143
s1pres2
!!PPPPP ! ! PPP !4!! 1 hdknow nnom;sbj s3obj;past X XXXXXX 1 X s5 hdI hd5 n4 say nom;sbj wh;obj;past PPPP 3 P hd4he 6 2
5
5
6
hdlike nwh;nom5 ;sbj 6 n7obj hd3who
hd7olives
Figure 21: \I know who he said liked olives" 1
swh;inv ;pres P 3 PPPP 6 hdbe swh;sm cl PPPPP s7wh;sbj;inf adj6pred 8 HHH 9 hd6difficult hdknow swh;obj;inf Hh hhhhhh ?h H HHH hhhhhhh 9 ? hh n1wh;respect 11 4 nnom;i obj hdsay nwh;obj H H HHH HHH 9 3
3
10
11
12
13
hdwhat p12 det4 5 13 1 2 to these hdstudents pabout detwhich hdproblems
Figure 22: \Which problems are these students dicult to know what to say to about?" It will be seen that at states 9 and 10 of the derivation in gure 22 there are three noun phrases in one clause none of which having a function role. There would seem to be quite a severe limitation on the number of roleless phrases that can be present at the same time in any successful communication. As will be shown in the following section, centre-embedded and cross-serial dependency constructions give rise to the same phenomenon of an inde nite number of roleless phrases, while multiply right-branching and left-branching structures do not. In right-branching structures the role of the embedded phrase will already have been assigned even before it is completed, while in multiply left-branching structures there is a \telescoping"-eect with a repeated lowering of already formed structure (see the section after next for details).
144 David Tugwell
3.6 Contrasting Clause Element Order in Germanic Languages It may be instructive to take a brief look at how languages may dier markedly in their surface realisation of parallel deep structures, and how the incremental formalism can represent this not by way of diering hierarchical surface structures, but rather by a diering derivation orders for the deep structure. The example chosen is a doubly embedded subordinate clause in English, Dutch and German.27 1
sreason ;past PPP ? PPPP ? P4 ? comp1reason hd3see n2nom;sbj s PPP obj;base 1 2 5 4 PPPPP 6 hdbecause hdI hdhelp nsbj sobj;base HH HH 4 hd 1
3
3
5
Cecilia
hd7feed n6sbj 7
hd6Henk
n8obj
PPPP
det8the hd9hippos
Figure 23: \... because I saw Cecilia help Henk feed the hippos." Figure 23 shows the derivation for the right-branching English clause, while gures 24 and 25 show conjectured derivations for the corresponding cross-serial and centre-embedded constructions of Dutch and German respectively.28 s1reason1 ;past7
?PPPPPP PP 1 ?7 ? 2 compreason hdzien nnom;sbj s8obj;inf PPPP PPPP hd1omdat hd2ik 8 3 s9obj;inf hdhelpen nsbj HHH 3 H hd 7
8
Cecilia
hd9voeren n4sbj 9
n5obj 9
PPPP
hd4Henk det5de hd6nijlpaarden Figure 24: \... omdat ik Cecilia Henk de nijlpaarden zag helpen voeren." In both Dutch and German, noun phrases wait as function-role until the verbs are encountered at the end of sentence|after the last of the noun phrases is encountered the information state The example sentences are taken from Steedman (1985) where they are further attributed to Riny Huybregts. 28 The Dutch and German examples are only tentative and await a fuller consideration of a wider range of data. Case-marking these derivations, for example, is not treated. 27
Incremental Syntax in a Holistic Language Model 145 1
sreason ;past PPPP ? PPP P8 1 8 ?? 2 compreason hdsehen nnom;sbj sinf;obj PPPPP 1 2 PPP 7 hdweil hdich sinf;obj hd8helfen n3sbj HHH 3 H hd 1
9
8
9
8
8
Cecilia
hd7f uttern n4sbj 7
hd4Henk
n5obj 7
PPPP
det5die hd6Nilpferde
Figure 25: \... weil ich Cecilia Henk die Nilpferde futtern helfen sah." is as in gure 26 for both languages. After this they are successively lowered into the newlyforming clauses and given roles, but in contrasting orders. There is no need to store the waiting noun phrases in stacks since they are already date-stamped, and so the most recent (in the case of German) and the most ancient (for Dutch) can be identi ed and appropriately combined. s`1reason
QPQP`P`P`P````` QQ PPPP `````` 1 compreason n2nom n3 n4 n5 HH 1 2 3 4 5 hdomdat
hdik
hdCecilia hdHenk
detde hd6nijlpaarden
Figure 26: \... omdat ik Cecilia Henk de nijlpaarden"
3.7 Left-Branching Multiply left-branching structures are characterized by a repeated reinterpretation of constituents already formed, to give a kind of \telescoping" eect. However as each transition is of a common type (and all phrases have feature-roles), there is predicted to be no sudden drop in acceptability.
146 David Tugwell 1
swh;inv ;pres XXXX XXXXX XX 1 hd9feed n4nom;sbj nwh;ob PPPP HHH H PP det7poss hd8sister det1poss;wh hd2dog HHH H7 hd1whose 7 nposs hd s HHH H det5poss hd6friend PPPP 5 5 3
3
9
9
0
nposs hd s hd4Peter 0
Figure 27: \Whose dog does Peter 's friend 's sister feed?"
3.8 Free Word Order In languages with a highly-developed and explicit case system, the order of constituents has typically less of a role in syntax and often expresses discourse functions such as focus. Although free ordering is often restricted to the relative ordering of constituents, it can also extend to scrambling at the word levels. In the interests of eective communication this is usually limited, but when other concerns are more important (such as poetic style), word order can be bewilderingly free, as the following extract of Latin verse demonstrates.29 1
swh;pres h XhXhXhXhXhh Q QQ XXhXhXhhhhh 4 XXX 14hhhhh 13 nacc;obj QQ 6 1 hd10 n p ploc nvoc;fem urgeo loc wh;nom;mas;sbj P H H P 4 HHH HHH 5 3 PPP 8 hdte hd13 1 6 2 14 adjwh hdpuer adjmod hdin nabl;fem hdsub n12 srel;passive;perf abl;neut XXXXXPyrrha PPP PPPP PP 9 15 12 2 hd1quis hd3gracilis hd8 hd12 rosa adjquant hdantrum adjmod perfundo nP inst;abl;pl PPP hd2multus hd12 11 9 gratus 10
10
10
hdodor adjmod hd9liquidus
Figure 28: \Quis multa gracilis te puer in rosa perfusus liquids urget odoribus grato, Pyrrha, sub antro?" 29
Horace, (Carmina, Odes I, 5), quoted in Ross (1967), p. 50.
Incremental Syntax in a Holistic Language Model 147
(1)
Quis multa gracilis te What many a slender you urget odoribus grato, press (with) scents delightful
puer in rosa perfusus boy on rose drenched Pyrrha, sub antro? Pyrrha in a cave
liquids liquid
`What slender boy, drenched with perfumes Is making love to you, Pyrrha, On a heap of roses, in a delightful cave?' There is no way to draw a boundary between this and \normal language". It is reasonable to expect that much the same process of interpretation is taking place (although possibly here with more false starts) and it is therefore essential that the formalism can cover instances such as this. It should not be forgotten either that crossed dependencies of this kind are a relatively common feature even in English, and that con gurational, \uncrossed" language is only a special instance of the general case.
3.9 Coordinate Structures Finally, I shall take a brief look at how coordination is to be handled in the present framework. This will mainly address clausal coordination, although at the end of the section I shall also examine the need for coordination at lower levels. As touched upon in a previous section, the general procedure in a dynamic grammar such as the one presented here30 is to \rewind" the actions of the rst conjunct to a previous state and begin the second conjunct from there. The two conjuncts are then best imagined as being in some sense parallel structures which share a common anchor in the state where the coordination starts. So in the rst example: (29)
John likes olives, but hates stoning them.
we derive the sequence of states as normal, but when we get to the second conjunct, beginning at hates, we rewind back to state 1 (the state after John has been added) and continue from there. The most intuitive way to represent these parallel structures would perhaps be in three dimensions, but as the formalism is very much a two-dimensional structure, I shall attempt to use a reentrancy notation to represent the same thing. The second conjunct clause is represented as a feature on the rst clause. The boxed gure one indicates that the conjunct clause contains the information present up to and including state 1 of the rst clause. The superscript above the box indicates at what state this reentrancy was created. 30
For a fuller discussion of coordination in dynamic grammars, see Milward (1994).
148 David Tugwell 1
sXpres XXXXX XXXX 1 5 1 2 3 hdlike nnom;sbj nobj spres XHXHXXX HHH XXXX hd1John hd3olives 4 s6obj;ing;sbj control conjbut hd5hate nsbj1 PPPP hd6 n7 2
2
5
stone
obj
hd6them Figure 29: \John likes olives, but hates stoning them." The above example would typically be treated as a simple example of verb phrase coordination, and it might seem that the representation given here is a complicated way of getting around the fact that the formalism does not recognize any kind of verb phrase to coordinate. That this is not so, and that the present representation has advantages of coverage and generality can be seen by considering gure 30, where the two elements being coordinated are a noun phrase most sports on the one hand and a negative adverb concatenated with a gerundive clausal object [not] [playing golf] on the other. It will be seen however that they are both possible continuations from the second state of the derivation.31 s1
pres X ! XXXX ! ! XXXXX 6 ! 1!! 3 2 n 2 obj sneg hdenjoy nnom;sbj H 3 H 4 HXhHXhHXhXhXhXhhhhh hd hd sports HH 2 XXX7 hhhhhhh 6 hd1John most n sobj;ing;sbj control xneg conj5but hd 2 PPPP hd7 n8 hd6 2
2
play
obj
not
hd8golf Figure 30: \John enjoys most sports, but not playing golf." It can even happen that the rst \conjunct" is completely empty, ie. the whole of the rst clause is copied (or alternatively a reentrancy link is made to its nal state) and the second conjunct consists merely of modifying adjuncts as in the following example, gure 31. It should be borne in mind that such examples are far from uncommon, indeed they abound in English of all kinds. It seems that when a verb is shared by two clauses as it is in gure 30, it must have the same basic lexical meaning in both and generally the same complement pattern (though In point of fact, adding not to state 2 in the rst conjunct would not be felicitous in modern English. The rules of derivation are not always identical in the initial clause and in a conjunct clause, but they are nevertheless suciently common and regular to be learnt from a corpus. 31
Incremental Syntax in a Holistic Language Model 149 1
spres XXXXX XXXXX 5 1 3 2 3 hdplay nnom;sbj nobj sneg h X @aaXhaXhaXhaXhXhXhhhhh 1 3 XX hhhhhh 5 @ hdJohn hdfootball conj4but hd 2 n 1 n 3 p6manner xtime;neg PPPP 2
2
hd6with n7obj
hd5never
hd7Mary Figure 31: \John plays football, but never with Mary." the complements do not necessarily have to be of the same basic category as that example demonstrates). There seems also a tendency for the adjunct pattern of the verb to be similar, although here it seems a weaker constraint. The following example, gure 32 shows a case where the second conjunct outweighs the rst by two adjuncts to nil. 1
spres XXXXX XXXX 5 hd2eat n1nom;sbj n3obj s 2h Xa@@XahaXhXahaXhXhXhhhhh 1 3 XX hhh hhh 5 hdJohn hdolives conj4and hd 2 n 2 n6obj s7manner;ing;sbj hcontrol xtime PPPP hd6anchovies hd7use n8obj hd5sometimes X HXHXHHXXXX det8 hd11 adj9 n10 2
2
his
spoon
mod
mod
hd9silver hd10 anchovy Figure 32: \John eats olives and sometimes anchovies using his silver anchovy spoon." In the next example, gure 33, the conjuncts are not only not constituents, neither are they concatenations of whole constituents (both being composed of half a noun phrase and a time adverbial) and furthermore the two words of which each conjunct is formed belong to entirely dierent clauses.32 However, given the approach to describing coordinated structures adopted here, it means that it is possible to see the coordination as an example of the same basic operation that would give rise to a simple coordination of verb phrases. It is a common feature of English that conjuncts introduced without a conjunction will typically themselves introduce conjuncts, whereas those introduced by a conjunction typically do not. This gives rise the characteristic A, B, C and D pattern. It has been a mystery, This represents the reading of the sentence (which I nd possible at any rate) where the time adjuncts are attached to the highest clause. 32
150 David Tugwell 1
s`past a` a` a`aa``````` 2
7 59 sh s3obj;past4 ntime HXHXhHXhXhXhXhhhh QQ hd7 ? HH 1 XXX 5hhhhh 10 hd1John 4 3 Q 5 yesterday 8? 2 hdwant nnom nobj conjand hd n s ntime
hd2think n1nom;sbj 2
hd3Mary
PPPP 5
deta
hd6lion
H 4 5 HHH 5 hd n n 5 9 det
hd10 today
hdtiger
Figure 33: \John thought Mary wanted a lion yesterday and tiger today." however, that this pattern is also seen when the conjuncts are seemingly conjoined at dierent levels, as in the following examples: (34) John has eaten the olives, drunk the wine, and is starting on the tri e. (34a) ??John has eaten the olives, drunk the wine. (34b) John enjoyed riding, shooting pheasant, and especially adored hunting badgers. (34c) ??John enjoyed riding, shooting pheasant. Such examples present a problem33 to theories which would analyse these as examples of constituent coordination, where it would seem as though the structure were A[1,2] and B[3] where the rst two clauses are conjoined at a lower level to form the rst conjunct and then this coordinated with the third clause. But if this is so, why is this rst conjunct unacceptable on its own? The state-based approach does not face the same problem, however, for it can analyse them as a construction containing three separate conjuncts (hence the A, B and C pattern), but where the latter two conjuncts share slightly dierent amounts of the rst. This can be imagined as a tree where instead of two branches emerging from the main trunk at exactly the same point (as in a more regular coordination), one branches o slightly before the other. This is represented in gure 34. Other types of coordinate structures such as right-node raising, gapping and parasitic gaps have yet to be examined in detail, although it is expected that they will similarly involve some sharing of information between conjoined structures.
Sub-Clausal Coordination It might seem from the discussion of coordination given above that a sentence such as: 33 The simplest solution would be to declare them ungrammatical, but given the methodology adopted here this is not a possibility since they are perfectly clear to interpret.
Incremental Syntax in a Holistic Language Model 151 1
spres ;perf PPPP 3 1 nX4obj PPP 2 6 s` hdeat nnom;sbj XX 5 `````` hd 4 detthe olives QQ ```` 1 10 Q hd1John Q 1 hd6drink nsbj n7obj spres;progressive QXQXXXXX 7 8 9 QQ 1 XX 12 detthe hdwine 11 nobj conjand hdstart nsbj PPPP 2
2
3
6
11
13 14 p12 on detthe hdtrifle
Figure 34: \John has eaten the olives, drunk the wine, and is starting on the tri e" (34a)
John adores sh and chips.
would have to be analyzed as a clausal coordination, which would have an interpretation pretty similar to that of: (34b)
John adores sh and John adores chips.
In actual fact, the formalism will have to recognise that sentence (34a) is ambiguous between an interpretation with clausal coordination and a reading where sh and chips is represented as some conjunction of nominal heads. This dierence is brought out when the sentence is negated: (34c) (34d)
John doesn't adore sh or chips. John doesn't adore sh and chips.
Similar lower-level coordinations will have to be available for other items.
4 Deriving a Language Model This paper has involved a thought experiment about what sort of behaviour we may expect from a corpus-trained language model, if the corpus has been parsed with the formalism descried. The reality is that in such a bewilderingly complex system as language, predictions about how a particular system will perform should not be taken at face value. Although one can argue about the kinds of distinctions the model could in theory make, the only way to con rm this is to construct the language model and employ it in a working system.
152 David Tugwell Supposing we have a suciently large corpus34, hand-parsed with the formalism described, how might we derive from it a probabilistic model which can be used to predict the interpretation of a novel string of words? A simple-minded approach would be take each word (ie. word-type) in the corpus and count in a 2-dimensional matrix (with prior states down one axis and successor states down the other) how often a particular transition occurred.35 Having this information, it would be simple to calculate the probability of each word having a particular state-transition|simply by counting the number of times the word occurs with the transition and dividing by the number of times it occurs in total. Knowing these probability distributions, we could apply the Markov assumption|the probability of any interpretation for any string is simply the product of all the transition probabilities that make it up. The optimal interpretation would then be the one that maximized this product and would thus be the interpretation which the model predicted. The reason why this is not a practical method to go about the task is not so much that the number of states is countably in nite, but rather that the data will always be much too sparse, no matter how large the text chosen. This is because the states contain much too much information (even down to the identity of each lexeme occurring previously in the sentence), most of which is quite irrelevant to the eect of the new word on the state and which therefore should be abstracted away. So what information is relevant? We might suppose this will include: 1. The new information added. 2. The interaction of this information with that already present, including both syntactic feature and lexical feature co-occurrence. 3. The distance (in number of elapsed states) at which this interaction occurs. To look at this from a less abstract perspective, let us take gure 35 as an example. (35)
He saw someone he knows well today.
This sentence is potentially at least triply ambiguous, but in a neutral context the majority of people would surely come to the interpretation given here (ie. with low attachment of well and high attachment of today) without even being aware of the other possible readings. Imagine that we have successfully analyzed the rst ve words and have reached the sixth well. Suppose in addition that it is its use as a manner adverb which is the only one we have to consider here. It could be placed:
In the matrix sentence where it would modify with the head verb see at a distance of 4 states, or
See the next section for some discussion of \suciently large". The date-stamping could be normalized|eg. the present state could be called state 0, the previous one state -1, the one before that state -2 etc|to avoid redundancy. 34
35
Incremental Syntax in a Holistic Language Model 153 1
s`past ! Q` ! ! Q`Q```````` ! ! ``` 7 !2! QQ 3 1 hdsee nnom;sbj nobj ntime HHH H4 hd1he hd3 hd7today someone sXrel;pres QXQXQQXXXXXX 5 4 4 2
2
6
hdknow nnom;sbj 5 nrel;obj 5 adv6manner hd4he
hd6well
Figure 35: \He saw someone he knows well today"
In the embedded sentence, modifying know at a distance of 1 state. The latter co-occurrence (know/well) would almost certainly be more common in any suciently large corpus and this would be reinforced by the advantage of proximity. At the next word today, it would be hoped that the collocation see/today (or perhaps the more general \lexeme/case-role" collocation see/ntime) would be so much more likely than its rival know/today (or know/ntime ) as to overcome the disadvantage of being more remote in time, and thus favour placing the temporal modi er in the matrix clause. s`1past2
`````` !!Q` ! Q ! `````` QQQ !2!! hdsee n1nom;sbj n3obj n4time HHH H5 1 hdhe hd3 hd4today someone sXrel;pres QXQXQQXXXXXX 6 5 5 2
6
hdknow
nnom;sbj 6 nrel;obj 6 adv7manner hd5he
hd7well
Figure 36: \He saw someone today he knows well" It is instructive to compare (35) with its slight variant in gure 36, which contains a discontinuous constituent. As the information state formalism represents deep structural relations, the interpretation arrived at for the two sentences will of course be the same, but they will be derived in a dierent way and will thus have dierent probabilities. (36)
He saw someone today he knows well.
154 David Tugwell Whether the derivation of the discontinuous (36) will be assigned a higher probability by the model than the continuous (35) will depend on whether the advantages gained by today being closer to the head verb it is modifying are oset by the cost occasioned by the extra distance of the relative clause from its modi ed head noun.36 As the length of the relative clause increases, and the distance of the temporal adjunct gets further and further from the matrix verb, the probability of the model preferring the discontinuous variant over the continuous one according will increase. Thus the following data can be successfully characterized as can other varieties of heavy-element shift. (36a)
p He saw someone today he rst met while picking potatoes last year for that grumpy
old farmer who only pays ve pounds a crate. (35a) ??He saw someone he rst met while picking potatoes last year for that grumpy old farmer who only pays ve pounds a crate today.
To reiterate, the calculation of transition probabilities is a composite aair involving the probability of some basic transition type, modi ed by information about lexical co-occurrences (if available) and further modi ed by eects of the distance between the new information and its related elements. None of these need be estimated by ad-hoc methods|they can all be calculated on the basis of information in the pre-parsed corpus. One exceedingly important element in these calculations, which I shall not go into here in any depth at all, is that of nding methods of smoothing transition probability information across members of the same word-class in order to combat the problem of data-sparseness. At the extreme, this will also apply to providing estimated transition probabilities for words completely unknown to the model.37
5 Implementation The language model is an abstract relation between interpretations and word strings and could be incorporated into a processing system in a number of ways. As the model is a Markov process, the Viterbi algorithm can be used to calculate the interpretation which gives the best score according to the language model. It may turn out to be more practical to use an n-best beam-search technique, however, although this would not be guaranteed to nd the best-score interpretation. Although a system using the formalism presented in this paper is still to be implemented, a comparable system using a similar state-transition formalism has been implemented and this is described in Tugwell (1994) and (1995a). The model was trained on a hand-parsed corpus of 18,800 running words taken from the Brown corpus (light ction, section N). When tested on unseen sentences (< 15 words) from This cost will of course depend on the nature of the text on which the model is trained. Such discontinuous constructions are much scarcer in formal written English than in natural conversation, for example. 37 For optimal results these estimations should take into account the morphology of the unknown word. 36
Incremental Syntax in a Holistic Language Model 155
the same source the system produced a parse for roughly half of the sentences, and about half of these parses were judged \correct" (ie. agreed with an independently-made hand-parse). This does demonstrate that such a system can start producing results even when trained on only a relatively tiny corpus. Obviously the amount of data available is an important, and eventually a limiting, factor, but it seems that optimizing the data-smoothing methods will be of more immediate importance.
6 Summary The approach to language modelling outlined in this paper is an attempt to combine the Data-Oriented Parsing approach of corpus-training, with an incrementally-derived formalism, representing deep structural relations. These are used to create a Markovian probabilistic model of language interpretation. Although each of these components can be justi ed in their own right, it may be argued that it is in their combination that they can be seen to their best advantage. A major goal of the approach taken here is to produce a practical and useful system for the automatic analysis of language. At the same time, it may be hoped that the comprehensive nature of the model could enable it to explain and predict linguistic data in a simpler and more concise fashion than more restricted theories.
References Blevins, James P. 1994. Derived Constituency Order in unbounded dependency construc-
tions. Journal of Linguistics, 30, 349{409.
Bod, Rens 1992. A Computational Model of Language Performance: Data Oriented Parsing.
COLING-92.
Hausser, Ronald 1989. Computation of Language: An Essay on Syntax, Semantics and
Pragmatics in Natural Man-Machine Communication. Springer-Verlag, Berlin.
Heycock, Caroline 1994. The Internal Structure of Small Clauses: new evidence from
inversion. To appear in Proceedings of NELS 25.
Jespersen, Otto 1927. A Modern English Grammar on Historical Principles, Vol III.
London: George Allen & Unwin.
Milward, David 1994. Non-constituent Coordination: Theory and Practice. COLING-94. Pollard, Carl & Sag, Ivan A. 1994. Head-Driven Phrase Structure Grammar. University of Chicago Press and CSLI Publications, Chicago. Ross, John Robert 1967. Constraints on Variables in Syntax, PhD thesis, MIT. [reprinted in 1986 as In nite Syntax!, New Jersey: Ablex]
156 David Tugwell Steedman, Mark 1985. Dependency and Coordination in the Grammar of Dutch. Language, 61, 3, 523{568. Stuurman, Fritz 1990. Two grammatical models of modern English. London: Routledge. Tugwell, David 1994. A Probabilistic Parser for English Based on Texts Hand-Parsed with a State-Transition Grammar, MSc thesis, Centre for Cognitive Science, University of Edinburgh. Tugwell, David 1995a. A State-Transition Syntax for Data-Oriented Parsing. Proceedings
of the 7th EACL, Dublin, pp 272{277.
Tugwell, David 1995b. Syntax as Information-State Transitions in a Corpus-Trained Lan-
guage Model. Proceedings of the 4th International Conference on the Cognitive Science of Natural Language Processing, Dublin City University.