A computational model of human coreference judgements Peter Wiemer-Hastings
Carlo Iacucci
[email protected] DePaul University School of Computer Science, Telecommunications, and Information Systems 243 S. Wabash Chicago IL 60604
[email protected] University of Edinburgh Division of Informatics 2 Buccleuch Place Edinburgh EH8 9LW Scotland
Abstract Although much effort has been directed towards creating theoretical models of coreference resolution and towards developing algorithms to perform coreference, no one has previously attempted to directly account for human judgements of coreference acceptability. Gordon and Hendrick (1997 and 1998) have collected human data and developed a formal theory of human coreference resolution which they call Discourse Prominence Theory (DPT). This paper describes our computational implementation of DPT and our evaluation of that implementation with respect to the human data.
Introduction Coreference resolution has long been the focus of research in discourse aspects of natural language processing (Grosz, 1977; Walker, Joshi, & Prince, 1998, for example). Recently, it has become the focus of intense practical work as a subtask of information extraction (DARPA, 1995). And although the main theories of discourse representation have their roots in psycholinguistic concerns, there have been no previous attempts to directly account for human judgements of coreference acceptability with a computational model. Gordon and Hendrick (1997) (hereafter, G&H97) have performed a series of experiments to determine human coreference acceptability judgements in a variety of contexts. The main aim of this paper was to evaluate Chomsky’s claims (Chomsky, 1981) with respect to suggested principles which govern when coreference is allowed. Gordon and Hendrick concluded that some of Chomsky’s claims were supported by the human data, but others were not (Gordon & Hendrick, 1997, p. 338). Gordon and Hendrick (1998) (hereafter, G&H98) present a theoretical model of human coreference acceptability judgements which they call Discourse Prominence Theory. This theory is based on Discourse Representation Theory (Kamp & Reyle, 1993; van Eijck & Kamp, 1997). Although Gordon and Hendrick claim that their model accounts for human data better than Chomsky’s principles, the model is not specified in sufficient detail for data on specific sentences to be inferred from it, and therefore, their claims can not be directly validated.
In this paper, we describe a computational implementation of Discourse Prominence Theory. Because the model was underspecified, we were forced to make many decisions about significant implementation details. We created three different variations of the basic implementation in order to evaluate the effect of our various decisions. This paper describes the background of the relevant theories of discourse representation, the related work in coreference resolution, our set of implementations, and our evaluations of them.
Related work Theories of discourse and coreference The best-performing current algorithms and systems for pronoun resolution (Hobbs, 1978; Lappin & Leass, 1994; Mitkov, 1998) tend to be based on finelytuned heuristic methods rather than being directly motivated by psychological results. However, a great deal of research in computational linguistics starting with the classical work of Grosz and Sidner (Grosz, 1977; Sidner, 1979) has been concerned with focusing effects (Anderson, Garrod, & Sanford, 1977; Sanford & Garrod, 1981). Centering theory is currently the most popular model of salience (Grosz, Joshi, & Weinstein, 1995; Walker et al., 1998). Although mainly motivated by corpus work, it has originated a lot of psycholinguistic research (Hudson, 1988; Brennan, 1995; Gordon, Grosz, & Gillion, 1993), as well as a number of computational models (Brennan, Friedman, & Pollard, 1987; Walker et al., 1998; Strube & Hahn, 1999; Tetreault, 1999), although none of these models performs as well as the best heuristic methods.1 The version of centering theory proposed by (Guindon, 1985) is perhaps the most closely motivated by psychological research. (Poesio, 2001) proposes a model of pronoun resolution that ties in results from the literature on focusing with results concerning incrementality and underspecification in pronoun resolution (Corbett & Chang, 1983; Garrod, Freudenthal, & Boyle, 1993). 1
This work on focusing has been applied to production as well as interpretation; see, e.g., (McKeown, 1983; Kibble & Power, 2000; Henschel, Cheng, & Poesio, 2000).
(Poesio & Stevenson, 2002) analyzes the predictions of current computational models of pronoun resolution in light of psychological research.
x NP
Discourse Representation Theory As mentioned above, Gordon and Hendrick’s model is a modified version of Discourse Representation Theory (DRT). In this section, we give a very brief introduction to DRT so that the reader can appreciate the general operation of the model described here. Discourse Representation Theory (DRT) was developed by Kamp (1981) and extended by Kamp and Reyle (1993). The goal of DRT is to support the computation of a semantic representation of a discourse from a syntactic representation of the utterances in the discourse. A key attribute of DRT is that by taking account of the structure of a discourse, it provides a natural mechanism for accounting for the accessibility of referents in discourse. For example, in the discourse: If Mary had a car, she would drive it. It is really fast.
name(x) num(x) becomes gen(x) x
name Figure 2: A DRS rule for processing proper nouns x lisa(x) singular(x) female(x) IP I’
x I TV
Syntax tree to rules As previously mentioned, the basic processing of DRT consists of converting syntactic structures to semantic representations which are called Discourse Representation Structures (DRSs). The processing starts by putting the parse tree of a sentence into a DRS, as shown in Figure 1. The box represents the universe of the discourse, and the narrow section at the top of the box contains variables that represent any discourse referents that are introduced in the discourse.2 The construction of the DRS continues with the application of rules which dismantle the syntactic structure and replace it with semantic conditions. When the parse tree is completely gone, the DRS is complete. Figure 2 shows a standard DRT construction rule that allows the processing of noun phrases which contain proper nouns (compare with CR.PN in G&H98). The rule applies when its right hand side matches some substructure in the parse tree. To apply the rule, the indicated conditions are added (substituting for num and gen as appropriate), and the matched parse tree substructure is replaced with the reduced structure shown. For the example above, the new DRS structure after application of the proper noun rule is shown in Figure 3. Note that the discourse referent, x, acts as a variable here.
visited
The syntactic structure here includes elements of Chomskyan syntax in order to be consistent with G&H98, but other DRT references commonly forego references to IP and I’ nodes.
PN
DRT supports the coreference of the first “it” with the car, but disallows the second “it” to refer to the car.
2
number : num gender : gen
VP NP
number : singular gender : male
PossPro her
N brother
Figure 3: The DRS after applying the proper noun rule When introducing a new discourse referent, a new variable name should be used.
The Gordon and Hendrick approach Human experiments G&H97 presents a series of six experiments which are designed to determine the conditions under which human participants label coreference to be acceptable or not. This paper focussed on coreference within sentences, so some of the discourse-based influences mentioned in the introduction did not come into play. The participants were presented with sentences in the form: “John’s roommates met him at the restaurant.” They were asked to indicate whether or not they thought that the two bold-faced items could refer to the same person. The six experiments focussed on different syntactic aspects which might affect the human judgements. In this implementation, we only evaluated the model with respect to Experiment 1 in G&H97. In this experiment, 36 sentences were presented
IP NP
number : singular gender : f emale
I’
PN
I
Lisa
TV
VP NP
visited
number : singular gender : male
PossPro her
N brother
Figure 1: The initial DRS for the sentence, “Lisa visited her brother.” to the participants. The data was analyzed with respect to two features: 1) the presence or absence of a c-command relation (Reinhart, 1981) between the antecedent, and 2) whether the sequence of coreferents was Name-Name, Name-Pronoun, or PronounName. According to Gordon and Hendrick, the acceptability data from the humans were somewhat ambivalent with respect to the principles of Chomsky’s Binding Theory (Chomsky, 1981). They were, in general, supportive of Principle B, but were partially conflicting with Principle C (Gordon & Hendrick, 1997, pp. 338–339). Gordon and Hendrick describe their alternative to Chomsky’s principles in G&H98, claiming that their model accounts better for the overall human acceptability judgements. In the next section, we describe their model.
Discourse Prominence Theory Discourse Prominence Theory (DPT) is built within the general framework of DRT. It diverges in the specific mechanisms that it proposes for coreference resolution. Gordon and Hendrick describe three basic principles which form the core of their approach (G&H98, p. 416). We summarize the first two principles here: • Pronouns primarily refer to entities that have already been mentioned in a discourse. Names primarily introduce new entities into the discourse. • The syntactic and sequential structure of a sentence determines the prominence of discourse entities which facilitates coreference by pronouns, and impedes coreference by proper names. The third principle comes into play for such syntactic structures as fronted adjuncts. Because this was not the focus of Experiment 1 in G&H97, it does not affect the implementation described here.
The principles are made more specific in two ways: First, an ordering is imposed on the discourse referents in the DRS based on “syntactic prominence”. Syntactic prominence can be intuitively understood as being related to the item’s height in the parse tree. The lower the item in the tree, the lower the prominence. Second, DRS construction rules for potentially referring expressions specify how the set of discourse referents should be searched. The DRS rule which processes pronouns searches for potential coreferents in order of discourse prominence, and introduces a new discourse referent if none is found. The DRS rule for proper names always adds a new discourse referent. An additional rule (called CR.EQ) applies for proper names. It is triggered when one discourse variable refers to a proper name, and that proper name is already associated with another variable. CR.EQ searches the list of discourse referents in reverse order for that other referent with the same name and then it unifies the two. Additional detail can be found in G&H98, pp. 402–407.
An implementation of DPT Although the DPT was formalized in G&H98 in more detail than described here, it was not specific enough to lead to a straightforward implementation. Because we had to make many implementation decisions, we evaluated three different, but related versions of the computer model which are described in the section. We start, though, by describing the core system. The system was inspired by the approach of Bos (Bos & Gabsdil, 2000) which implements a DRS construction mechanism in conjunction with a fairly powerful inference engine. The Bos system performs not only coreference, but also processes presuppositions. We will give a note about the evaluation of our system with respect to Bos’s system below. Because the construction of DRSs is essentially a
transformation of one representation into another, and because Prolog is well-suited for such transformation, we used Prolog for our implementation. This allowed us to have a very simple one-to-one mapping between the formal version of the DRS construction rules and the Prolog clauses in our implementation. We augmented the list of construction rules in G&H98 with the additional rules necessary to create a complete DRS for each of the sentences in the Experiment 1 of G&H97, including rules for processing sentences, and verb phrases, as well as attaching prepositional phrases and relative clauses. To minimize the “cognitive complexity” of our implementation, we made it just take one pass through the input parse tree, in a recursive, top-to-bottom, left-to-right sequence. For example, the clause that implements the construction rule for a sentence recursively calls the construction clause on the subject NP and then on the VP, and then returns the completed DRS. Our top-to-bottom traversal of the parse tree simplified the computation of prominence. Instead of explicitly computing the “n-command” relation defined in G&H98, we could simply order the discourse referents by their depth in the tree. During this pass through the parse tree, the construction rules for pronouns and for proper names search through an accumulated list of discourse referents as described above. The construction rules in G&H98 did not directly address the constraints on the matching of number and gender information that come into play during coreference. We implemented a simple strategy which ensured that the constraints were satisfied for any potential coreferent. The experimental sentences were designed to be unambiguous, so our implementation came up with a single interpretation for each. It calculated a score for each sentence based on the amount of search it had to do through the list of discourse referents and based on the use of the additional CR.EQ rule for name-name unification. This score can be interpreted as the “difficulty” for our system to make the particular coreference. We evaluated the implementation by comparing these scores to the averaged acceptability ratings from the human data in G&H97. These ranged from 0 to 1.0, where 1.0 means that all participants rated the coreference as acceptable.
Three versions In our original implementation, we did not create the additional CR.EQ rule because it seemed redundant. Our original mechanism which checked number and gender constraints also checked if the names were the same for name-name pairs. So our first model simply checked the constraints of two potential coreferents to ensure that there were no contradictions between them. So, for example, [male(x1), singular(x1)] would match with [john(x3), male(x3), singular(x3)]. But items with two different names,
genders, or numbers would not match. This is the standard unification approach used in unificationbased grammars. Although version 1 accomplished coreference with a minimum of machinery, it allowed some coreferences which people do not (in general), specifically pronoun-name coreference. A second version of the implementation also did not use the separate CR.EQ rule of G&H98, but used a variant of the unification algorithm described above. In this case, two referents would only unify if the first was the same as, or more specific than the second (i.e. the second had no more constraints than the first, and no conflicting constraints). Thus, [male(x1), singular(x1)] (“him”) could serve as the referent for [john(x3), male(x3), singular(x3)], but not vice versa. For version 3, we implemented the full DPT model (except for those construction rules which did not apply to our test set). This involved the addition of the CR.EQ rule, and allowing its use to affect the score for the coreference. Although this may seem like adding complexity to the implementation just to make the scores come out right, it could be argued that it is justified. It may be that humans treat the “implicit” constraints of number and gender differently than they treat explicit proper names. In fact, other DRT formalizations represent the two differently. For each version of the system, the coreference score was computed during the search for antecedents. The score started at 1.0 for the first item on the discourse referent list, and was multiplied by 0.5 for each additional item that was tested. As mentioned above, pronouns and proper names were treated differently. For pronouns, the list was searched in order of prominence. For names, it was searched in reverse order. For version 3, the CR.EQ rule was treated essentially as one extra step in the search. Thus, instead of the score for these items starting at 1.0, it started at 0.5. Clearly, this is a rather limited subspace of the possibilities for implementing this score. Future work will evaluate other methods.
Evaluation As mentioned above, we evaluated the system by comparing the scores from the implementation with the human acceptability data from G&H97. Figure 4 shows a graph of the correlation between the three versions (in order) of the system and the human data. The correlation with human data for versions 2 and three was significant at the P < 0.01 level. The best match with the human data was for version 3 which reached a correlation of r = 0.77. As mentioned above, our implementation was informed by Bos’s work. We could not evaluate his system with respect to the G&H97 human data for three reasons:
Acknowledgments
1
Many thanks to Peter Gordon for help with collecting information on his model. Thanks to Johan Bos for information about his DRT implementation and DRT in general. And thanks to Massimo Poesio for information about related work in coreference resolution.
0.8
0.6
References Anderson, A., Garrod, S., & Sanford, A. (1977). The accessibility of pronominal antecedents as a function of episode shifts in narrative text. Quarterly Journal of Experimental Psychology, 35, 427–440.
0.4
0.2
0
Figure 4: Correlation between human judgements and system versions 1, 2, and 3 • Because his system addresses more aspects of discourse processing than ours (presupposition and accommodation), it would be difficult to evaluate it on its coreference judgements alone. • Although the system does compute a score for each different reading of a sentence, these scores are intended only for ranking, that is, picking the most preferred reading. • The system in its current state does not process some of the syntactic structures in our test set.
Conclusions Based on the results reported above, we believe that our implementation is, in effect, a computational replication (or reification) of the claims in G&H98. The relatively high correlation between our version 3 (which fully implemented DPT) and the human data support their claims about the cognitive validity of their model. Unfortunately, because we do not have a parallel implementation of Chomsky’s Binding Theory, we can not evaluate the claim that the DPT approach accounts better for the human data. Future work in this area will focus on comparing the model directly to the Binding theory model and also on exploring the space of scoring methods. We also will work with Bos to enable evaluation of his system’s performance on this task. As a final note, this implementation was developed as part of an assignment for a project for third year students at the University of Edinburgh. The scope of the implementation was appropriate for our needs, and it required understanding of the related literature. After the conclusion of the course, we will make the implementation available by contacting the first author.
Bos, J., & Gabsdil, M. (2000). First-Order Inference and the Interpretation of Questions and Answers. In Poesio, M., & Traum, D. (Eds.), Proceedings of Gotalog 2000. Fourth Workshop on the Semantics and Pragmatics of Dialogue, Gothenburg Papers in Computational Linguistics 00-5, pp. 43–50. Brennan, S. E. (1995). Centering Attention in Discourse. Language and Cognitive Processes, 10, 137–167. Brennan, S., Friedman, M., & Pollard, C. (1987). A Centering Approach to Pronouns. In Proc. of the 25th ACL, pp. 155–162. Chomsky, N. (1981). Principles and parameters in syntactic theory. In Hornstein, N., & Lightfoot, D. (Eds.), Explanation in Linguistics: The Logical Problem of Language Acquisition. Longman, London. Corbett, A., & Chang, F. (1983). Pronoun disambiguating: Accessing potential antecedents. Memory and Cognition, 11, 283–294. DARPA (1995). Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufman Publishers, San Francisco. Garrod, S. C., Freudenthal, D., & Boyle, E. (1993). The role of different types of anaphor in the on-line resolution of sentences in a discourse. Journal of Memory and Language, 32, 1–30. Gordon, P., & Hendrick, R. (1997). Intuitive knowledge of linguistic coreference. Cognition, 62, 325–270. Gordon, P., & Hendrick, R. (1998). The representation and processing of co-reference in discourse. Cognitive Science, 22 (4), 389–424. Gordon, P. C., Grosz, B. J., & Gillion, L. A. (1993). Pronouns, names, and the centering of attention in discourse. Cognitive Science, 17, 311– 348.
Grosz, B. J. (1977). The Representation and Use of Focus in Dialogue Understanding. Ph.D. thesis, Stanford University. Grosz, B. J., Joshi, A. K., & Weinstein, S. (1995). Centering: A Framework for Modeling the Local Coherence of Discourse. Computational Linguistics, 21 (2), 202–225. (The paper originally appeared as an unpublished manuscript in 1986.). Guindon, R. (1985). Anaphora resolution: ShortTerm Memory and Focusing. In Proceedings of the 23rd Annual Meeting of the Association for Computational Linguistics, pp. 218– 227 Chicago, IL. Henschel, R., Cheng, H., & Poesio, M. (2000). Pronominalization Revisited. In Proc. of 18th COLING Saarbruecken. Hobbs, J. R. (1978). Resolving Pronoun References. Lingua, 44, 311–338. Hudson, S. (1988). The Structure of Discourse and Anaphor Resolution: The Discourse Center and the Roles of Nouns and Pronouns. Ph.D. thesis, University of Rochester. Kamp, H. (1981). A theory of truth and semantic representation. In Groenendijk, J., Janssen, T., & Stokhof, M. (Eds.), Formal methods in the study of language, No. 135, pp. 277–322. Mathematical Centre, Amsterdam. Kamp, H., & Reyle, U. (1993). From discourse to logic: Introduction to model theoretic semantics of natural language, formal logic and discourse representation theory. Kluwer Academic Press, Dordrecht, the Netherlands. Kibble, R., & Power, R. (2000). An integrated framework for text planning and pronominalization. In Proc. of the International Conference on Natural Language Generation (INLG) Israel. Lappin, S., & Leass, H. J. (1994). An Algorithm for Pronominal Anaphora Resolution. Computational Linguistics, 20 (4), 535–562. McKeown, K. R. (1983). Focus Constraints on Language Generation. In Proc. of IJCAI, pp. 582– 587 Karlsruhe. Mitkov, R. (1998). Robust pronoun resolution with limited knowledge. In Proc. of the 18th COLING, pp. 869–875 Montreal. Poesio, M. (2001). Utterance Processing and Semantic Underspecification. Lecture Notes. CSLI, Stanford, CA. To appear.
Poesio, M., & Stevenson, R. (2002). Salience: Computational Models and Psychological Evidence. Cambridge University Press, Cambridge and New York. To appear. Reinhart, T. (1981). Definite NP anaphora and c-command domains. Linguistic Inquiry, 12, 605–636. Sanford, A. J., & Garrod, S. C. (1981). Understanding Written Language. Wiley, Chichester. Sidner, C. L. (1979). Towards a computational theory of definite anaphora comprehension in English discourse. Ph.D. thesis, MIT. Strube, M., & Hahn, U. (1999). Functional Centering–Grounding Referential Coherence in Information Structure. Computational Linguistics, 25 (3), 309–344. Tetreault, J. R. (1999). Analysis of Syntax-Based Pronoun Resolution Methods. In Proc. of the 37th ACL, pp. 602–605 University of Maryland. ACL. van Eijck, J., & Kamp, H. (1997). Representing Discourse in Context. In van Bethem, J., & ter Meulen, A. (Eds.), Handbook of logic and language, pp. 179–237. Elsevier Science B.V., New York. Walker, M. A., Joshi, A. K., & Prince, E. F. (Eds.). (1998). Centering Theory in Discourse. Clarendon Press / Oxford.