Experiments with Annotating Discourse Relations in the Hindi Discourse Relation Bank Umangi Oza† and Rashmi Prasad‡ and Sudheer Kolachina† and Suman Meena§ Dipti Misra Sharma† and Aravind Joshi‡ † Language Technologies Research Center Indian Institute of Information Technology, Hyderabad, India umangi,
[email protected],
[email protected] § Center for Language, Literature and Cultural studies Jawaharlal Nehru University, New Delhi, India
[email protected] ‡ Institute for Research in Cognitive Science/Computer and Information Science University of Pennsylvania, Philadelphia, PA, USA rjprasad,
[email protected]
Abstract In the Hindi Discourse Relation Bank (HDRB) project, we are developing a large corpus annotated with discourse relations, such as causal, temporal, contrastive and conjunctive relations. Adopting the lexically grounded approach of the Penn Discourse Treebank (PDTB), we annotate the argument structure of both explicit and implicit discourse relations, as well as the senses of relations. We describe our initial annotation experiments, which have led to the discovery of additional connective classes and the development of a modified sense classification scheme. We also present some distributional results from our initial annotations, and propose some insightful cross-linguistic generalizations by comparisons with the discourse relation distributions of English texts in the PDTB. Finally, we present an additional study of the properties of some connectives that belong to the class of discourse adverbials.
1 Introduction For many NLP applications, such as Question Answering, Text Summarization, and Language Generation, among others, the information obtained from a sentence-level analysis is insufficient. To enable research and applications be-
yond the sentence-level, corpora annotated with sentence-level syntactic and semantic relations, such as the Penn Treebank and Propbank, are now being followed by efforts to annotate relations at the level of discourse. The Penn Discourse Treebank (Prasad et al., 2008) is one such recent resource that contains lexically-grounded annotations of relations between abstract objects in discourse, such as eventualities and propositions (Asher, 1993). The lexically-grounded approach to discourse relations, and discourse structure in general, is due to Webber and Joshi (1998). Examples of such discourse relations are causal, contrastive, temporal, and elaboration relations, and in the PDTB, they can be realized in one of three ways: (a) as explicit connectives, which are “closed class” expressions drawn from welldefined grammatical classes; (b) as alternative lexicalizations (AltLex), which are non-connective expressions that cannot be defined as explicit connectives; and (c) as implicit connectives, which are essentially implicit discourse relations “inferred” between adjacent sentences not related by an explicit connective, and for which the annotator “inserts” a connective that best expresses the inferred relation (Martin, 1992; Knott, 1996). The abstract object relata of a discourse relation are called the relation’s arguments, named Arg1 and Arg2, according to syntactic and linear-order conventions in the PDTB. Each relation is assumed to have
Proceedings of ICON-2009: 7th International Conference on Natural Language Processing Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009
two and only two arguments, and argument annotation follows the “minimality principle” in that only as much is selected as the argument text span as is minimally necessary to interpret the relation. (Any additional text deemed as relevant to the arguments’ interpretation is annotated as supplementary material.) Examples (1), (2), and (3) show PDTB annotations of explicit, AltLex, and implicit relations, respectively. Arg1 is enclosed in square brackets, and Arg2 is enclosed in braces. Discourse relations are underlined. (1)
[By most measures, the nation’s industrial sector is now growing very slowly]. Factory payrolls fell in September. So did the Federal Reserve Board’s industrialproduction index. Yet, {many economists aren’t predicting that the economy is about to slip into recession.}
(2)
[Under a post-1987 crash reform, the Chicago Mercantile Exchange wouldn’t permit the December S&P futures to fall further than 12 points for an hour.] {That caused a brief period of panic selling of stocks on the Big Board.}
(3)
[The voters, as well as numerous Latin American and East European countries that hope to adopt the Spanish model, are supporting the direction Spain is taking.] I MPLICIT = SO {It would be sad for Mr. Gonzalez to abandon them to appease his foes.}
Each discourse relation is assigned a sense label based on a hierarchical sense classification developed in the PDTB project.1 For example, “concession” is the sense assigned to the explicit connective in Example (1), while “result” is the sense assigned to the AltLex and implicit relation in Examples (2) and (3), respectively. When no discourse relation can be inferred between adjacent sentences, either an entity-based coherence relation (called EntRel) or the absence of a relation (called NoRel) is marked between the sentences. Following the completion of the PDTB, which contains annotations of English texts (namely, Wall Street Journal articles), further interest in cross-linguistic studies in discourse relations has led to the initiation of similar discourse annota1
each discourse relation and its arguments (Prasad et al., 2007). Attributions are not the focus of this paper, although we plan to annotate them in a later phase of the project.
tion projects in several languages, such as Chinese (Xue, 2005), Czech (Mladova et al., 2008), and Turkish (Zeyrek and Webber, 2008).2 In this paper, we describe our ongoing work on the creation of a Hindi Discourse Relation Bank (HDRB) following the approach of the PDTB.3 The size of the HDRB corpus is 200K words and it is drawn from a 400K corpus on which Hindi syntactic dependency annotation is being independently conducted (Begum et al., 2008). All the source corpus texts are taken from the Hindi newspaper Amar Ujala. They comprise news articles from several domains, such as politics, sports, films, etc. Since HDRB is a substantial subset of the corpus selected for syntactic dependency annotation, it is expected to provide a valuable enriched resource for NLP applications aiming to combine sentence and discourse-level processing, as well as for linguistic research, including cross-linguistic studies. In the rest of this paper, we begin by presenting our approach for annotating discourse relations and their arguments in Hindi (Section 2). We describe our classification of explicit connectives, the naming convention for arguments, and the scope of the implicit connective annotation. Section 3 describes our proposals for modifying the PDTB sense classification scheme. In Section 4, we present and discuss the results of some preliminary annotations we have carried out to date, including distributions of different relation types and senses. Section 5 presents an additional study of discourse adverbials, since they are considered to be anaphoric connectives, with the possibility of long-distance arguments. We conclude in Section 6 by describing the future direction and scope of the project.
2 Discourse Relations and their Arguments 2.1 Explicit Connectives Annotation of relations realized by explicit connectives involves tagging the relation as “Explicit” and marking the text spans associated with the connective and its two arguments. As in PDTB, 2
The PDTB is available from the Linguistic Data Consortium (http://www.ldc.upenn.edu, Catalog No. LDC2008T05). For details on the PDTB approach and guidelines, the reader is referred to the PDTB website (http://www.seas.upenn.edu/˜pdtb) and papers therein. 3 The HDRB annotation is being carried out by researchers at IIIT, Hyderabad. Initial guidelines for the annotation as reported in this paper were developed in collaboration with the PDTB group at the University of Pennsylvania.
discourse connectives in HDRB are drawn from a set of grammatical classes, and are further constrained to be “closed class” elements. In addition to the three major grammatical classes in the PDTB - subordinating conjunctions, coordinating conjunctions, and adverbials - we recognize three other classes. The full set of classes identified so far in HDRB is described below, with examples.4
clause. As the name suggests, only relatives that modify verb phrases are treated as discourse connectives, and not those that modify noun phrases. Some examples are Eяss (so that), Eяsк кArZ (because of which). Example 7 illustrates a sentential relative formed with Eяss , which conveys a goal relation between ”running to the dispensary” and ”getting proper treatment”.5
Coordinating Conjunctions: These are lexical items that conjoin clauses or phrases of the same syntactic status. They occur clause-initially, e.g., aOr (and), Eк\t (but), and include discontinuous paired forms, e.g., n к vl..bESк (not only..but also).
(7) [sArA кAm Cowкr vh us EcEwyA кo uWAкr
(4)
[s\G
к s\gWn an к { h\] Eк\t
{EvcArDArA
‘[There are many groups in the Sangh] but {there is just one ideology.}’ Subordinating Conjunctions: These are lexical items conjoining finite adverbial clauses to their matrix clause. They typically occur clauseinitially, although a few can occur clause-medially as well. As with coordinating conjunctions, the subordinating conjunctions have both single (e.g., kyo\Eк (because) in Ex. 5) and paired forms (e.g., agr..to (if .. then) in Ex. 6), but unlike the coordinating conjunctions, one of the elements in a paired subordinating conjunction can be implicit.
[aAя
EкyA яA sк ।}
‘[Dropping all his work, he picked up the bird and ran towards the dispensary] so that {it could be given proper treatment}.’
eк
hF { h।}
(5)
dvA Gr кF aor BAgA] Eяss {usкA shF ilAя
rF vq gA\W dFyA яlAyA gyA { h] kyo\Eк {m
{ h।}
Adverbials: These are adverbial and prepositional phrases that are argued to function as anaphoric discourse connectives (Webber et al., 2003). That is, while one of their arguments is the clause they structurally modify, their other argument needs to be resolved anaphorically in the prior discourse. Some examples of these are so (so), EPr (then), nhF\ to (otherwise), vA-tv m \ (in fact), tBF (just then) etc. More often than not, adverbials in Hindi contain an explicit anaphoric expression providing the referential link with the connective’s Arg1 in the prior discourse, such as isк bA&яd (in spite of this), isF trh (similarly), isк alAvA (in addition) etc. Example 8 illustrates an annotation of the adverbial isк alAvA. (8) [dAnvF lhro\ к кArZ a\XmAn к pE[cmF tV
‘[Today the lamp has been lit] because {it is my birthday}.’ (6)
agr
{кoи
[aAp
BF nhF\ Cow \g । ]
pr tVFy vn-ptF prF trh bbA d ho gyF।] {m\ g кF cÓAno\ кo BF nкsAn
isк alAvA haA { h।}
aAps кh Eк nmк Cow do} to
‘[The coastal vegetation on the west coast of the Andaman has been completely destroyed due to wild waves]. In addition, {the coral reefs have also been damaged}.’
‘If [one were to ask you to quit taking salt] then {even you would not quit}.’ Sentential Relatives: These are relative pronouns that conjoin a relative clause with its matrix 4
We have consulted several grammar descriptions for determining the set of grammatical classes from which connectives are drawn, such as Kachru (2006) and Sahay (2007). However, we note that there is considerable disagreement among linguists on how to make these classifications, both in Hindi as well as other languages. Consequently, in our annotation scheme, these classifications are not provided as part of the annotation of any connective. They are used primarily for expository purposes, and for convenience in describing the syntax of connectives to the annotators. Indeed, the same is true of PDTB as well. Users of PDTB and HDRB can therefore partition the set of connectives according to whichever typological theory they adhere to.
Subordinators: This class includes postpositions (Ex. 9), verbal participles (Ex. 10), and suffixes that introduce non-finite clauses with an abstract object interpretation (Ex 11). (9) {bA к яAn } к bAd [ uho\n us lwк кo apn pAs blAyA।]
‘After {Baa left} [he called the boy to him].’ 5
Similar sentential relatives and their associated relative pronouns are found in English as well, although they are not annotated in the PDTB.
(10)
... {х lt } he [yh Bl яAtA { h Eк yEd usкA Em/ BF us apn EхlOn кo hAT lgAn nhF\ dtA to us EкtnA brA lgtA।]
‘... while [playing] {he forgets that if his friend too didn’t let him touch his toy, then he would feel very bad too].’ (11)
{bA кF bAt\ sn} кr [mn - hF - mn gA\DFяF bht lE>яt he।] ‘Upon {hearing Baa’s words}, [Gandhiji felt very ashamed].’
Some instances of subordinators are not discourse connectives, such as when they denote the manner of an action (Ex. 12). However, our preliminary annotation experiments suggest that distinguishing the discourse and non-discourse usage of subordinators is a difficult task, and we have decided to annotate these in a later phase of the project. (12)
uho\n bA кA hAT pкwA aOr drvAя tк хF\c кr l gy ।
‘[Lit.] He caught Baa’s hand and took her to the door by dragging her.’ Particles: Particles such as BF, hF act as discourse connectives in Hindi. BF is an emphatic inclusive particle that can be used to suggest the inclusion of verbs, entities, adverbs, as well as adjectives. The instances of such particles which indicate the inclusion of verbs are taken as discourse connectives (Ex. 13) while others are not (Ex. 14).6
to the argument with which the connective is syntactically associated, while the Arg1 label is assigned to the ‘other’ argument. In HDRB, however, the Arg1/Arg2 label assignment is semantically driven, in that it is based on the sense of the relation to which the arguments belong. Thus, each sense definition for a relation specifies the sense-specific semantic role of each of its arguments, and stipulates one of the two roles to be Arg1, and the other, Arg2. For example, the ‘cause’ sense definition, which involves a causal relation between two eventualities, specifies that one of its arguments is the cause, while the other is the effect, and further stipulates that the cause will be assigned the label Arg2, while the effect will be assigned the label Arg1. The effect of this convention can be seen with Examples (15) and (16). The connectives in both the sentences have the ‘cause’ sense, but while the cause appears after the effect in Example (15), it appears before the effect in Example (16). According to the PDTB convention, Arg2 for both examples would be the second clause since it is syntactically associated with the connective, and Arg1 would be the other argument, in this case, the first clause. Thus, with the syntax-based convention, the labels would be ordered as Arg1-Arg2 in both examples. In HDRB, on the other hand, with the sense-specific labelling convention for ‘cause’, these argument label orderings are Arg1-Arg2 (cause after effect) for Example (15), and Arg2-Arg1 (cause before effect) for Example (16). (15)
(13)
is dono\ dшo\ к bFc bx^t Er[t к pErZAm к !p m \ dх rh { h\।] {кшmFrF log
[EvяtA
rh { h\।} iss eк rAяnFEtк sbк} BF {l
to us кC dr tк хd pr EvшvAs nhF\ haA], kyo\Eк {vh mAn кr cl rhF TF Eк yh
‘After the competition, Sonal said that [when her name of announced as the winner, she could not believe herself for some time], because {she was thinking that the competition was fixed}.’
usn кC BF nhF\ хAyA।
‘He didn’t eat anything.’ (16) 2.2
Arguments of Discourse Relations
In PDTB, the assignment of the Arg1 and Arg2 labels to a discourse relation’s arguments is syntactically driven, in that the Arg2 label is assigned 6 BF can also occur as a modifier of connectives like EPr, and to, when it serves to change their sense from that of their unmodified forms. For example, the connective to (then) has a conditional sense but to BF has a concessive sense.
к !p m \ яb usкA nAm pкArA gyA
prEtyoEgtA E'?s { h।}
‘[People see this as a consequence of the improving relation between the two countries]. {The Kashmiris are} also {learning an political lesson from this}.’ (14)
prEtyoEgtA к bAd sonl n aOaA btAyA Eк
[log
' {шn EXsAinro\ кA кhnA { h Eк sbs >yAdA nкl yA corF monopolF EXsAin кF hotF { h।
{EXsAinr in bAto\ кo bхbF яAnt {h\} isEly [ки bAr @yAn nhF\ dt {h\।] ”Fashion designers say that the most prevalent thefts or copies are of monopoly designs. {Designers know this fact very well} so [it does not matter to them many times].”
Apart from giving meaning to the argument labels, our semantics-based convention has the added advantage of simplifying the sense classification scheme.7 This is discussed further in Section 3. 2.3
Implicit Discourse Relations
The HDRB annotation of implicit discourse relations largely follows the PDTB scheme. In particular, adjacent pairs of sentences not related by an explicit connective are annotated with implicit relations that are nevertheless inferred between the two sentences. The only difference is that while implicit relations in PDTB are annotated only between paragraph-internal adjacent sentences, we also annotate such relations across sentences separated by paragraph boundaries. For each pair of adjacent sentences not related by an explicit connective, the annotator has four options, to be considered in the order given below. (a) If a discourse relation is inferred between the sentences, an attempt is made to insert an implicit connective that best expresses the relation. The relation is tagged as “Implicit” and the implicit connective is inserted. Insertable connectives are drawn primarily from the set of explicit connectives, but annotators are free to select alternative expressions as well. Example (17) shows an implicit connective inserted to express the causal relation between the two sentences. (17)
{is
g m к sAr EхlAwF sEcn t dSкr
lexicalized non-connective expression. The relation is tagged as “AltLex” (Alternative Lexicalization), and the AltLex expression is marked. An AltLex expression is any expression that doesn’t belong to any of the grammatical classes identified for explicit connectives or is not a closed class element. Example (18) illustrates an AltLex annotation. (18)
{bA\`lAdш
m \ кAnn - &yv-TA кF hAlt AltLex [isF vяh s
m \ sDAr haA { h।}
BArt n sMm ln m \ шAEml hon кA ' {slA EкyA { h।]
‘{Bangladesh’s judiciary has seen an improvement}. That is why [India has decided to participate in the conference.]’ (c) If no discourse relation is inferred, an attempt is made to identify an entity-based relation across the two sentences. In an entity-based relation, the only relational inference made by the reader is that the second sentence identifies one or more entities from the previous sentence, and provides a further description about this entity (or entities). Such a relation is tagged as “EntRel” (Entity Relation). Example (19) illustrates an EntRel annotation, where the only purpose of the second sentence is to provide the reader with some additional information about ”Jha’s second film”. No discourse relation, such as a causal or contrastive relation, is inferred between the sentences.
[inкo s BF mhAn { h\।} Implicit=isEly ?lFn boSX кrnA EкsF к bs кF bAt
(19)
[E'Sm
mho(sv m \ prкAш JA кF nи E'Sm
nhF\।]
aphrZ кA BF prFEmyr honA { h।]
‘{All players in this game are greater than even Sachin Tendulkar} so [it is not possible for anyone to get them clean bowled.]’
{g\gAяl
(b) If a discourse relation is inferred but insertion of an implicit connective leads to a perception of ”redundancy” in the expression of the relation, it suggests that the second sentence of the pair contains an alternatively 7 We note, however, that the PDTB syntax-based convention for argument naming was due to the fact that the annotation of arguments and the annotation of senses were done in two ”separate” phases. In retrospect, therefore, this convention was a practical design decision in PDTB and not one based on any deeper principle.
EntRel
к bAd JA кF yh EкsF alg
Evqy pr bnF dsrF E'Sm { h।}
‘[Prakash Jha’s latest film Apaharan will be premiered at the film festival.] {This is Jha’s second film on a different subject after Gangajal.}’ (d) If neither a discourse relation or an EntRel relation is inferred, the relation is tagged as “NoRel” (No Relation).
3 Senses of Discourse Relations Broadly, we follow the PDTB sense classification in that we take it to be a hierarchical classification, with the four top level sense classes
of “Temporal”, “Contingency”, “Comparison”, and “Expansion”. Further refinements to the top class level are provided at the second type level and the third subtype level. However, we have made some significant changes and additions to the PDTB sense classification. We focus below only on these points of departure from the PDTB, which were partly motivated by general considerations for capturing additional senses, and partly by language-specific considerations. The HDRB sense hierarchy shown below reflects the modifications we have made to the PDTB sense scheme. 1. Class:Comparison Contrast Concession Similarity Pragmatic Contrast – Epistemic, Speech-Act, Propositional • Pragmatic Concession – Epistemic, Speech-Act, Propositional
• • • •
2. Class:Contingency Cause Goal Condition Pragmatic Cause – Epistemic, Speech-Act, Propositional • Pragmatic Condition – Epistemic, Speech-Act, Propositional
• • • •
3. Class:Temporal • Synchronous • Asynchronous 4. Class:Expansion Conjunction Instantiation Exception Restatement – Specification, Generalization, Equivalence • Alternative – Conjunctive, Disjunctive, Chosen Alternative
• • • •
Eliminating argument-specific labels: As mentioned above, the PDTB sense scheme defines two levels below the top class level - the type level and the subtype level. The tags at the type level are meant to express further refinements of the relations’ semantics, while the tags at the subtype level are meant to reflect the semantic contribution of the arguments. In addition to the scheme’s deviation from the purpose of specifying labels for the senses of relations (and not arguments), we also found that the purported function of
the subtype level was true for only some of the senses. That is, while some of the subtype distinctions do represent the arguments’ relative semantic roles; others continue to be refinements of the relations’ semantics. For HDRB, our goal was to modify the classification to avoid these inconsistencies. On closer examination, we found that the PDTB argument related subtype labels were simply expressing variation in the “order” of the arguments. For example, the only difference between the “reason” and “result” subtypes under “Contingency.cause” is the linear order of the arguments expressing the cause and effect: with “Reason”, the effect appears before the cause, whereas with ”Result”, the cause appears before the effect. The same holds true for all the other argument-related subtype labels. In HDRB, we eliminate these argument ordering labels from the sense hierarchy. All levels in the HDRB sense hierarchy thus have the purpose of specifying the semantics of the relation to different degrees of granularity. The relative ordering of the arguments is instead specified in the definition of the type-level senses, and is inherited by the more refined senses at the subtype level. Restricted back-offs: In the PDTB, annotators were allowed to back off to higher levels in the hierarchy when they found it difficult to identify the more refined senses at the lower levels. Thus, for example, they could select “Comparison” at the class level instead of “Comparison.Contrast” or “Comparison.Concession” if they were unable to disambiguate between “Contrast” and ”Concession”. In HDRB, however, such back-offs are allowed only up to the type level. We enforce this constraint since we believe that senses are too coarse-grained to be useful. Note that this guideline is consistent with the fact that the argumentordering specifications are provided for the senses at the type level. Uniform treatment of pragmatic relations: Pragmatic relations in HDRB are broadly based on the distinction made in PDTB between semantic and pragmatic relations (Sanders et al., 1992). Discourse relations are viewed as semantic when they relate the propositional content of the arguments. They are pragmatic when their relations have to be inferred from the propositional content of the arguments. However, in HDRB, we replace the PDTB pragmatic senses with a uniform three-
way classification. Each pragmatic sense specified at the type level is further distinguished into three subtypes: “epistemic”, “speech-act”, and “propositional”. Epistemic and speech-act inferences are based on Sweetser’s (Sweetser, 1990) analysis of polysemous connectives in terms of conceptual domains. An epistemic interpretation is obtained when the relation involves a conclusion (expressed in one argument) based on some observation (expressed in the other argument). For example, in the epistemic reading of “John loved Mary, because he came back”, the relation does not express real-world causality, but rather the causality that obtains between a premise (i.e., John’s coming back) and a conclusion (i.e, John’s love for Mary). Speech-act interpretations obtain when the relation is between a speech-act and the speaker’s justification for performing it. For example, in “What are you doing tonight, because there’s a good movie on.” the fact that there’s a good movie on is the reason for why the question is being asked. In both the epistemic and speech-act interpretations, we can view the relation as a pragmatic one, since they involve the inference of a modality - epistemic (e.g., conclude (speaker, X)) or speech-act (e.g., ask (speaker, X)) - that takes scope over the propositional content of one of the arguments (X). In addition to these two, however, we also recognize a third novel type of inference, namely one which involves the inference of a complete proposition. The relation is then taken to hold between this inferred proposition and the propositional content of one of the arguments. Example (20) illustrates “pragmatic concession” of the “propositional” subtype. Here, Arg1 raises the expectation that “the driver should not be punished” which is then denied by the proposition inferred from Arg2 that “the driver should be punished.” (20)
[inm\
s eк X~Aivr n hETyAro\ кF яAnкArF hon pr mAml s hAT хF\c ElyA TA।] l Eкn
{adAlt
n кhA Eк agr usn us vkt pEls
кo scnA d dF hotF to DmAкo\ кA qXy\/ hF EvPl ho яAtA।}
‘[One of the drivers denied his involvement in the issue in spite of his knowledge about the weapons]. But {the court said that had he informed the police on time, the blast could have been prevented}.’ The “Goal” sense: Under the “Contingency” class, we have added a new type “Goal”, which
applies to relations where the situation described in one of the arguments is the goal of the situation described in the other argument (which enables the achievement of the goal). The argument describing the goal is marked as Arg2, and the other argument is marked Arg1 (Ex. 21). In PDTB, goal relations were subsumed by the “Result” subtype under ”Contingency.cause”, but we believe that distinguishing between ”Cause” and “Goal” leads to important consequences, for example, in the way questions are formulated over the relation. (21)
sBAq кA aArop { h Eк [rAяd a@y" rAZA кo] isEly [EVкV dnA cAht { h\], Eяss {cAr GoVAl m \ v srкArF gvAh n bn sк ।}
‘Subhash has alleged that [the RJD chief wants to give a ticket to Rana] so that {he does not become a government witness in the fodder scam trial}.’
4 Results of Some Initial Annotation Based on the new guidelines as described above, we annotated 35 texts from the HDRB corpus. The texts had an average of approximately 250 words. We annotated both explicit connectives and implicit relations, as well as their senses. A total of 602 relation tokens were annotated. Here we present some useful distributions we were able to derive from our initial annotation, and discuss them in light of cross-linguistic comparisons of discourse relations. Types and Tokens of Discourse Relations: Table 1 shows the overall distribution of the different relation types, i.e., Explicit, AltLex, Implicit, EntRel, and NoRel. The second column reports the number of unique expressions used to realize the relation - Explicit, Implicit and AltLex while the third column reports the total number of tokens and relative frequencies. For comparison purposes, similar distributions are provided for PDTB as well, as reported in Prasad et al. (2008). These distributions show some interesting similarities and differences between how discourse relations are realized in Hindi and English. (a) First, given that Hindi has a much richer morphological paradigm than English, one would have expected that it would have fewer ”explicit connectives”. Compared to English, Hindi has a greater incidence of grammatical and semantic/discourse functions being marked morphologically, i.e., as affixes, postpositions, or particles.
Relations Explicit Implicit AltLex EntRel NoRel Total
HDRB Types 49 35 25 NA NA 109
HDRB Tokens (%) 189 (31%) 185 (31%) 37 (6%) 140 (23%) 51 (9%) 602
PDTB Types 100 102 NA NA NA 202
PDTB Tokens (%) 18459 (45%) 16224 (40%) 624 (1%) 5210 (13%) 254 (1%) 40600
Table 1: Distribution of Discourse Relations Thus, we would not expect there to be any kind of lexical marking for the class of functions realized morphologically in one of these ways. For many grammatical functions, this is indeed what is observed, e.g., for case, tense, and aspect. However, in the case of discourse relations, this expectation is not borne out. While discourse relations are often marked morphologically with what we have called ”subordinators”, other lexical strategies seem to be employed equally often, as seen in the discourse relation marking with conjunctions and adverbs. Even in the small data set that we have annotated so far, we have found 49 unique explicit connective types, which is roughly half the number reported for the 1 million words annotated in English texts in PDTB.8 It is expected that we will find more unique types as we annotate additional data. (b) Second, the percentages of explicit and implicit connective tokens are the same in HDRB (31%), whereas slightly fewer implicit connectives were recorded in the PDTB (40% implicit vs. 45% explicit). However, rather than suggesting a difference between the two languages, this is probably due to a design difference between the two projects: whereas implicit relations are annotated between “all” adjacent sentences in HDRB, they were not annotated across paragraph boundaries in PDTB for practical reasons. (c) Third, the percentage of AltLex relations is higher in HDRB - 6% compared to 1% in PDTB), showing that Hindi makes greater use of cohesive links with the prior discourse. Further studies are needed to characterize the forms and functions of AltLex expressions in both English and Hindi.9 8 Note that the explicit connectives reported in Table 1 do not include the class of subordinators, which are morphological realizations of discourse relations. Subordinators are not annotated in PDTB as well. 9 Current PDTB reports do not provide the numbers for the ”types” of AltLex.
Senses of Discourse Relations: We also examined the distributions for each sense class in HDRB and computed the relative frequency of the relations realized explicitly and implicitly. Crosslinguistically, one would expect languages to be similar in whether or not a relation with a particular sense is realized explicitly or implicitly, since this choice lies in the domain of semantics and inference, rather than syntax. Thus, we were interested in comparing the sense distributions in HDRB and PDTB. Table 2 shows these distributions for the top class level senses. Again, the table gives PDTB sense distributions for comparison purposes. Also, here we counted the AltLex relations together with explicit connectives, since we were interested in looking at how often relations required inferences that could not rely on any explicit indicators of the relation. The table shows that sense distributions in HDRB are indeed similar to those found in PDTB. That is, the chances of “Expansion” and “Contingency” relations being explicit are lower compared to “Comparison” and “Temporal” relations. In sum, the syntactic realization of discourse relations across the two languages show mostly similar patterns, but there are also some important differences with respect to how the different types of relations - explicit connectives, implicit connectives, and AltLex - are realized. Future comparative analysis upon completion of the corpus will provide more robust data on the different linguistic strategies used across the two languages for realizing discourse relations.
5 Additional Exploration of Discourse Adverbials One of the major challenges for automatic discourse processing is determining where the arguments of connectives lie. As seen in Example (1), arguments of connectives need not always be structurally adjacent. In the PDTB, for exam-
Sense Class Contingency Comparison Temporal Expansion Total
HDRB Expl. (%) 57 (58.2%) 68 (76.5%) 43 (65.2%) 64 (40%) 232
HDRB Impl. (%) 41 (41.8%) 21 (23.5%) 23 (34.5%) 94 (60%) 179
PDTB Expl. (%) 3857 (48%) 5562 (66%) 3700 (80%) 6645 (43%) 19764
PDTB Impl. (%) 4185 (52%) 2832 (34%) 950 (20%) 8861 (57%) 16828
Table 2: Distribution of Class Level Senses ple, Prasad et al. (2008) demonstrate that of the explicit connectives that establish relations between arguments across different sentences (as opposed to within a single sentence), 30% have their arguments in non-adjacent sentences. Effective algorithms for the discourse parsing task may benefit from extracting meaningful argument distributions of connectives. In our work, we were interested in exploring whether certain connectives were more likely to have structurally adjacent arguments than others. To explore the behavior of discourse adverbials in Hindi, we chose 2 discourse adverbials and annotated additional instances in order to collect 50 instances of each. One connective was a contrastive adverbial, isк bAvяd (nevertheless) whereas the other was a conjunctive adverbial, isк alAvA (in addition). We found that isк alAvA had non-adjacent LHArgs in 8 (16%) out of 50 cases. Examples (22) and (23) illustrate adjacent and non-adjacent arguments of isк alAvA, respectively. (22)
[dAnvF
lhro\ к кArZ a\XmAn к pE[cmF tV
pr tVFy vn-ptF prF trh bbA d ho gyF।] кF cÓAno\ кo BF nкsAn isк alAvA {m\g haA { h।}
‘[The coastal vegetation on the west coast of the Andaman has been completely destroyed due to wild waves]. In addition, {the coral reefs have also been damaged}.’ (23)
[rAhA ш! s hF aOpcAErкtAao\ кo andхF кr rh T ।] tmAm aAyl pFe sy к bFc aoe nяFsF aк lF e sF к\pnF { h , Eяsn cAl Evtt vq к lAB , lABA\ш aAEd к Ely shmEt pr h-tA"r sy BF nhF\ Eкy { h\। isк alAvA {aAyl pFe smF"A { bWк m \ BF BAg l n s rAhA n i\кAr кr EdyA TA।}
‘[Raha was avoiding the formalities from the beginning itself.] Out of all oil PSUs, ONGC is the only company which has not even signed the agreement on the profit,
loss, etc. of this fiscal year. In addition, {Raha had also refused to participate in the PSU review meeting}.’ In contrast, isк bAvяd always took adjacent arguments, as shown in Example (24). (24)
[Dl
BrF aA\DF n unкF aOr шhrvAEsyo\ кF uMmFdo\ pr pAnF ' r Edy।] {isк bAvяd logo\ кA u(sAh кm nhF\ haA।}.
Although the study reported here is based on a very small sample, it does suggest that it is possible to reliably discriminate the behavior of different discourse adverbials based on the connective alone. Our future work upon completion of the corpus will explore the argument distributions of adverbials in greater depth.
6 Summary and Future Work This paper has reported on the Hindi Discourse Relation Bank (HDRB) project, in which discourse relations, their arguments, and their senses are being annotated. A major goal of our work was to investigate the extent to which the underlying framework of the Penn Discourse Treebank (PDTB) and its guidelines could be adapted for the discourse annotation of Hindi texts. Our work on adapting the PDTB scheme to Hindi discourse annotation has led to the identification of new syntactic categories for explicit connectives, general and language-driven modifications to the sense classification. From our study of our initial annotations, we found that (a) there doesn’t seem to be an inverse correlation between the usage frequency of explicit connectives and the morphological richness of a language, although there does seem to be an indication of an increased use of cohesive devices in such a language; (b) sense distributions in both PDTB and HDRB were similar and confirm the lack of expectation of cross-linguistic “semantic” differences; and (c) discourse adverbials showed
some evidence of long-distance arguments, suggesting that anaphora resolution procedures for such connectives will have to be developed for discourse processing of Hindi texts. Our future goal is to complete the discourse annotation of a 200K word corpus, which will account for half of the 400K word corpus being also annotated for syntactic dependencies. We also plan to extend the annotation scheme to include attributions.
7 Acknowledgements This work was partially supported by NSF grants EIA-02-24417, EIA-05-63063, and IIS-07-05671.
References Nicholas Asher. 1993. Reference to Abstract Objects in Discourse. Kluwer, Dordrecht. Rafiya Begum, Samar Husain, Arun Dhwaj, Dipti Mishra Sharma, Lakshmi Bai, and Rajeev Sangal. 2008. Dependency annotation scheme for indian languages. In Proceedings of IJCNLP-2008. Yamuna Kachru. 2006. Hindi. John Benjamins Publishing Co., Amsterdam. Alistair Knott. 1996. A Data-driven Methodology for Motivating a Set of Coherence Relations. Ph.D. thesis, Department of Artificial Intelligence, University of Edinburgh. James R. Martin. 1992. English text: System and structure. Benjamins, Amsterdam. Lucie Mladova, Sarka Zikanova, and Eva Hajicova. 2008. From sentence to discourse: Building an annotation scheme for discourse based on prague dependency treebank. In Proceedings of LREC-2008. Rashmi Prasad, Nikhil Dinesh, Alan Lee, Aravind Joshi, and Bonnie Webber. 2007. Attribution and its annotation in the penn discourse treebank. Traitement Automatique des Langues, Special Issue on Computational Approaches to Document and Discourse, 47(2). Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse Treebank 2.0. In Proceedings of LREC-2008. Chaturbhuj Sahay. 2007. Hindi Padvigyaan. Kumar Prakashan. Agra. Ted J. M. Sanders, Wilbert P. M. Spooren, and Leo G. M. Noordman. 1992. Toward a taxonomy of coherence relations. Discourse Processes, 15:1–35.
Eve Sweetser. 1990. From etymology to pragmatics: Metaphorical and cultural aspects of semantic structure. Cambridge University Press. Bonnie Webber and Aravind Joshi. 1998. Anchoring a lexicalized tree-adjoining grammar for discourse. In Proceedings of the ACL/COLING Workshop on Discourse Relations and Discourse Markers. Bonnie Webber, Aravind Joshi, Matthew Stone, and Alistair Knott. 2003. Anaphora and discourse structure. Computational Linguistics, 29(4):545–587. Nianwen Xue. 2005. Annotating discourse connectives in the chinese treebank. In Proceedings of the ACL Workshop on Frontiers in Corpus Annotation II: Pie in the Sky. Deniz Zeyrek and Bonnie Webber. 2008. A discourse resource for turkish: Annotating discourse connectives in the metu corpus. In Proceedings of IJCNLP2008.