A Symbolic Approach for Automatic Detection of Nuclearity and ...

16 downloads 3961 Views 208KB Size Report
Nowadays automatic discourse analysis is a very prominent research topic, since it is useful to develop several applications, as automatic summarization, ...
A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations among Intra-sentence Discourse Segments in Spanish Iria da Cunha1, Eric SanJuan2, Juan-Manuel Torres-Moreno2,3, M. Teresa Cabré1, and Gerardo Sierra4 1

Institut Universitari de Lingüística Aplicada - Universitat Pompeu Fabra C/ Roc Boronat, 138, 08018 Barcelona, Spain 2 Laboratoire Informatique d'Avignon - Université d’Avignon et des Pays de Vaucluse 339 chemin des Meinajaries, BP91228 84911 Avignon Cedex 9, France 3 École Polytechnique de Montréal - Département de génie informatique CP 6079 Succ. Centre Ville H3C 3A7 Montréal (Québec), Canada 4 Instituto de Ingeniería - Universidad Nacional Autónoma de México Torre de Ingeniería, Basamento, Ciudad Universitaria, Mexico D.F. 04510 Mexico {iria.dacunha,teresa.cabre}@upf.edu, {eric.sanjuan,juan-manuel.torres}@univ-avignon.fr, [email protected]

Abstract. Nowadays automatic discourse analysis is a very prominent research topic, since it is useful to develop several applications, as automatic summarization, automatic translation, information extraction, etc. Rhetorical Structure Theory (RST) is the most employed theory. Nevertheless, there are not many studies about this subject in Spanish. In this paper we present the first system assigning nuclearity and rhetorical relations to intra-sentence discourse segments in Spanish texts. To carry out the research, we analyze the learning corpus of the RST Spanish Treebank, a corpus of manually-annotated specialized texts, in order to build a list of lexical and syntactic patterns marking rhetorical relations. To implement the system, this patterns' list and a discourse segmenter called DiSeg are used. To evaluate the system, it is applied over the test corpus of the RST Spanish Treebank. Automatic and manual rhetorical analyses of each sentence are compared, by means of recall and precision, obtaining positive results. Keywords: Nuclearity, Rhetorical Relations, Intra-sentence Discourse Segments, Rhetorical Structure Theory, Corpus, Symbolic Approach, Spanish.

1

Introduction1

Nowadays, several examples of indispensable and useful Natural Language Processing (NLP) tools exist: grammar checkers, stemmers, syntactical parsers, 1

This work has been partially financed by the Spanish projects RICOTERM (FFI2010-21365C03-01) and APLE (FFI2009-12188-C05-01), and the Mexican CONACYT project 82050.

A. Gelbukh (Ed.): CICLing 2012, Part I, LNCS 7181, pp. 462–474, 2012. © Springer-Verlag Berlin Heidelberg 2012

A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations

463

semantic parsers, among several others. These diverse linguistic levels of data processing have allowed the development of intelligent and useful applications. Nevertheless, as Hovy (2010) mentions, one of the most difficult phenomena to process is discourse structure. Most of the discourse NLP tools are based on Rhetorical Structure Theory (RST) (Mann and Thompson, 1988). This is a language independent theory based on the idea that a text can be segmented into Elementary Discourse Units (EDUs) linked by means of nucleus-satellite or multinuclear rhetorical relations. In the first case, the satellite gives additional information about the other one, the nucleus, on which it depends (ex. Result, Condition or Concession). In the second case, several elements, all nuclei, are connected at the same level, that is, there are no elements dependent on others and they all have the same importance with regard to the intentions of the text author (ex. Contrast, List or Sequence). RST has been used to develop several applications, like automatic summarization, information extraction (IE), text generation, question-answering, automatic translation, sentence compression, coherence evaluation, etc. (Taboada and Mann, 2006). Nevertheless, most of these works have been developed for English, Portuguese or Japanese. This is due to the fact that at present RST parsers are available only for these three languages. We consider that it is necessary to build a RST Spanish parser, useful for the development of several applications related to computational linguistics. The rhetorical analysis of a text by means of RST includes three phases: a) segmentation, b) detection of relations and c) building of a hierarchical rhetorical tree of the text. Figure 1 exemplifies these three phases. The relations annotation process is the following: once a text is segmented, rhetorical relations between EDUs are detected. The order of relations detection is the following one: 1) EDUs inside the same sentence are linked in a binary way, 2) sentences inside the same paragraph are linked and 3) paragraphs are linked. This paper focuses on phase b) and c) of the analysis, but specifically on the first level of relations detection, that is, at an intrasentence level. For phase (a), a discourse RST segmenter for Spanish called DiSeg (da Cunha et al., 2010, 2012) is used. In this paper, the development of the first automatic

a) Discourse segmentation: [Mi posición es favorable a la adhesión turca,] [pero pienso que deben valorarse previamente las implicaciones y la forma de afrontar esa ampliación.] [I support Turkish adhesion,] [but I also suggest prior evaluation of the implications and the way to face up to this extension.]

b) Detection of rhetorical relations: [Mi posición es favorable a la adhesión turca,]N_Antithesis [pero pienso que deben valorarse previamente las implicaciones y la forma de afrontar esa ampliación.]S_Antithesis c) Building of rhetorical tress:

Fig. 1. Example of the three discourse analysis phases

464

I. da Cunha et al.

system assigning nuclearity and rhetorical RST relations to intra-sentence discourse segments in Spanish texts is presented. We consider that this system constitutes an important step in order to develop a complete discourse parser for Spanish. In Section 2, related work is presented. In Section 3, the resources and tools that have been used in this work are explained. In Section 4, corpus analysis is detailed. In Section 5, system implementation is shown. In Section 6, experiments and evaluation are presented. In Section 7, some conclusions and future work are established.

2

Related Work

With regard to discourse RST segmenters, they exist for English (Soricut and Marcu, 2003; Tofiloski et al., 2009), Brazilian Portuguese (Maziero et al., 2007), French (Afantenos et al., 2010) and Spanish (da Cunha et al., 2010, 2012). They require some syntactic analysis of the sentences. The segmenters developed by Soricut and Marcu (2003), Mazeiro et al. (2007), Tofiloski et al. (2009) and da Cunha et al. (2010, 2012) rely on a set of linguistic rules. The one by Afantenos et al. (2010) relies on machine learning techniques: it learns rules automatically from thoroughly annotated texts. Regarding RST corpora, nowadays they exist only for 4 languages: English (Carlson et al., 2002; Taboada and Renkema, 2008), German (Stede, 2004), Portuguese (Pardo and Nunes, 2008; Pardo and Seno, 2005) and Spanish (da Cunha et al., 2011). These RST corpora suppose an important step on the RST research and they have been very useful to develop several applications, like information extraction, text generation, automatic summarization, etc. They have differences related to the number of included texts and words, the annotation systematicity, the texts’ domain heterogeneity, the amount of double-annotated texts (to measure the agreement between annotators), etc. Most of these corpora have been annotated using the annotation interface RSTtool (O'Donnell, 2000). Finally, discourse RST parsers for some languages are available: for English (Marcu, 2000; Soricut and Marcu, 2003; Subba and Di Eugenio, 2009), Japanese (Sumita et al., 1992) and Brazilian Portuguese (Pardo and Nunes, 2008). These discourse parsers use symbolic or statistical approaches. Some of them, besides some limited assumptions, achieve near human performance (see, e.g., Soricut and Marcu [2003], for sentence level analysis). Nevertheless, at the moment, there is not a discourse parser for Spanish.

3

Ressources and Tools

In this section, the resources and tools that have been used in this work are presented. 3.1

DiSeg

DiSeg (da Cunha et al., 2010, 2012) is the only discourse segmenter existing for Spanish. It produces state of the art results while it does not require syntactic analysis but only shallow parsing with a reduced set of linguistic rules that insert segment boundaries into the sentences. Therefore it can be easily included in applications

A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations

465

requiring fast text analysis on the fly. Thus, this segmenter has been chosen for this work because of this reason. DiSeg segmentation criteria are very similar to the ones used by da Cunha and Iruskieta (2010): a) All the sentences of the text are segmented as EDUs. b) Intra-sentence EDUs are segmented, using the following criteria: b1) An intra-sentence EDU has to include a finite verb, an infinitive or a gerund. b2) Subject/object subordinate clauses or substantive sentences are not segmented. b3) Subordinate relative clauses are not segmented. b4) Elements in parentheses are only segmented if they follow the criterion b1. DiSeg performance was evaluated over a corpus of manually annotated texts, obtaining an F-score between 80% and 96% in experiments with a corpus containing medical texts, and an F-Score of 91% with a corpus of texts about terminology. 3.2

RST Spanish Treebank

The RST Spanish Treebank (da Cunha et al., 2011) is the only corpus including texts in Spanish annotated with rhetorical relations of RST. It is free for research purposes and it can be consulted or downloaded by means of an on-line interface. This is the main reason to use this corpus in this research. The corpus contains texts from nine specialized domains (Astrophysics, Earthquake Engineering, Economy, Law, Linguistics, Mathematics, Medicine, Psychology and Sexuality). The used segmentation criteria are similar to those employed by DiSeg. It includes 52,746 words, 267 texts, 2,256 sentences and 3,349 discourse segments. The 79% of the corpus was tagged by 1 person (called “training corpus”), from a team of 10 RST expert annotators. There is a 31% of the corpus double-annotated (called “test corpus”), with high agreement regarding EDUs, SPANs (that is, a group of related EDUs), nuclearity and relations. The annotation interface used was the RSTtool. The list of rhetorical relations employed to manually annotate the texts is included in Table 1. In this table, the quantity of relations of the training corpus of the RST Spanish Treebank is shown. In this work, this is the corpus analyzed in order to detect linguistic patterns for relations detection, explained in Section 4. Table 1. Quantity of relations in the training corpus Relation name

Elaboration Preparation Background Result Means Purpose

Type

N-S N-S N-S N-S N-S N-S

Quantity of relations Nº % 508 24.71 273 13.28 152 7.39 118 5.74 108 5.2 108 5.2

466

I. da Cunha et al. Table 1. (Continued) List Joint Circumstance Interpretation Antithesis Cause Sequence Condition Evidence Contrast Concession Justification Solution Motivation Reformulation Conjunction Evaluation Summary Disjunction Enablement Otherwise Unless TOTAL

N-N N-N N-S N-S N-S N-S N-N N-S N-S N-N N-S N-S N-S N-S N-S N-N N-S N-S N-N N-S N-S N-S

106 85 86 69 53 50 44 44 43 41 40 38 20 16 13 10 8 8 5 5 3 2 2056

5.16 4.13 4.18 3.36 2.58 2.43 2.14 2.14 2.09 1.99 1.95 1.85 0.97 0.78 0.63 0.49 0.39 0.39 0.24 0.24 0.15 0.10 100

As it can be observed, it contains more than 20 examples of most of the relations. The exceptions are the nucleus-satellite relations of Enablement, Evaluation, Summary, Otherwise and Unless, and the multinuclear relations of Conjunction and Disjunction, because it is not so usual to find these rhetorical relations in the language, in comparison with others.

4

Corpus Analysis

The first step to carry out a list of linguistic patterns useful to detect discourse relations automatically was to build a table including all the identifiers of the texts of the training corpus and all the analyzed RST relations (Table 1). The second step was to analyze manually all the relations of these texts, observing all the possible lexical or syntactic markers indicating the presence of a RST relation (not only traditional discourse markers were analyzed). Then, these markers were included in the table. Table 2 shows a sample of this table (the complete table is too long, so it is not possible to include it here). In the first column, the relation name is included; in the second and third columns, the text identifier (ID) is shown (for example, ec00009 indicates that it is the text 9 of the economic subcorpus), besides the linguistic

A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations

467

markers found in each text. These makers contain the Spanish form and their translation to English, with their number of occurrences in the corresponding text. Once all the linguistic markers were analyzed, they were divided into 3 categories: 1.

2.

3.

Traditional discourse markers, as for example primero (“first”), segundo (“second”), si (“if”), ya que (“since”), etc. The classification of Spanish discourse markers by Portolés (1998) was used. Markers including lexical units, specifically, substantives and verbs, as for example metodología (“methodology”), procedimiento (“procedure”), usar (“to use”), etc. Markers including verbal structures, as for example para + INFINITIVO (“to + INFINITIVE”), siempre que + SUBJUNTIVO (“provided that + SUBJUNTIVE”), etc.

Table 2. Example of the table including texts identifiers, RST relations and linguistic makers Rel./ ID Purpose Cause

ec00009 con el fin de (1) “with the aim of” al + INF (1) “about to + INF”

li00035 para + INF (3) “to + INF”ya que (2) “since debido a (1) “due to” Sequence primero (1) “first Means metodología (1) “methodology” método (1) “method procedimiento (1) “procedure” VB usar (2) “VB to use” Result resultado (2) “result” Condition siempre que + SUBJ (2) “provided that + SUBJ” si (2) “if”

Linguistic markers of Elaboration relations were not analyzed, since this is the most general and frequent relation in the language. As it is explained in Section 5, our algorithm includes a rule assigning the Elaboration relation to all the EDUs that have not been categorized under any RST relation, so it was not really necessary to analyze the linguistic markers of this relation. Preparation relation is not analyzed either, since, in the corpus, Preparation satellites are always document titles, so they do not contain any marker (the only marker is that they are sentences without verb). After the analysis, 778 general markers were detected. The training corpus includes 2056 RST relations; therefore this means that the 37.9% of the relations of this corpus is marked. This percentage is higher than the one included in other works (20-25% for Spanish and English; see Taboada [2006], and Iruskieta and da Cunha [2010]). This is due to the fact that, usually, works about discourse markers and relations only consider traditional markers. In our work, any relevant lexical o syntactic unit can be considered as a marker. Table 3 includes statistics about some marked relations in the training corpus.

468

I. da Cunha et al. Table 3. Some statistics about marked relations in the training corpus Relation name Background Purpose Result Means Antithesis Cause Sequence ... Reformulation Summary Disjunction Unless TOTAL

Type

Nº of linguistic markers Nº %

75 95 55 44 43 32 31

49.34 87.96 46.61 40.74 81.13 64 70.45

...

...

6 2 4 0

46.15 25 80 0

2056

100

It can be observed that some relations have a very high percentage of marks (for example Purpose, Antithesis or Sequence), even if they have not a high number of occurrences in the corpus (as for example Disjunction). This means that these relations appear in texts evidenced by means of a linguistic marker quite frequently. These relations are the easiest ones to detect automatically. Nevertheless, there are some relations with a medium percentage or markers, as for example Result, Means or Reformulation, that are more difficult to detect. The most extreme case is the relation of Unless, which has 2 occurrences in the corpus and no markers. These cases are the most difficult ones to detect automatically. Hovy (2010) states that, given the lack of examples in the corpus, there are 2 possible strategies: a) to leave the corpus as it is, with few or no examples of some cases (but the problem will be the lack of training examples for machine learning systems), or b) to add low-frequency examples artificially to “enrich” the corpus (but the problem will be the distortion of the native frequency distribution and perhaps the confusion of machine learning systems). In the current state of our project, we have chosen the first option. Another difficult that has been detected is the possible ambiguity of some markers. As van Dijk (1984) explains, one of the problems of the semantics of natural connectors is that the same connector can express different connection types and one connection type can be expressed by several connectors. Exactly, Versley (2011) carries out a NLP research about the ambiguity of temporal discourse markers in English. In the analyzed corpus, some cases of ambiguous markers were found. For example, with regard to the third type of markers (Markers including verbal structures), it is found that the verbal form “GERUND” can evidence 3 different relations: Circumstance (10 occurrences), Means (6 occurrences) and Result (2 occurrences). In Section 5 it is explained how this problem is treated in this work.

A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations

5

469

System Implementation

In this section, the algorithm to detect RST relations and nuclearity at an intrasentence level is explained. The algorithm architecture has 3 main stages: -

Given a text, sentence segmentation (using a typical sentence segmenter). Given a sentence, discourse segmentation in EDUs (using the discourse segmenter DiSeg). - Given a sentence segmented in EDUs, application the following list of rules (in this order): PHASE 1: Traditional discourse markers (155 rules). PHASE 2: Markers including lexical units (77 rules). PHASE 3: Markers including verbal structures (41 rules). PHASE 4: Elaboration rule (1 rule). Moreover, these 4 types of rules have to be applied taking into account the following order: 1. 2. 3.

Nucleus-Satellite (N-S) relations rules. Satellite-Nucleus (S-N) relations rules. Multinuclear relations rules (N-N).

Let’s see some examples or rules: PHASE 1: • NUCLEUS-SATELLITE (N-S) RULES:



sin embargo (“nevertheless”) => Satellite of Antithesis of the previous EDU SATELLITE-NUCLEUS (S-N) RULES:



MULTINUCLEAR (N-N) RULES:

-

ya que (“since”) => Satellite of Cause of the next EDU en primer lugar (“in the first place”) => Nucleus of Sequence of the next EDU

PHASE 2: • NUCLEUS-SATELLITE (N-S) RULES:



como alternative (“as alternative”) => Satellite of Otherwise of the previous EDU SATELLITE-NUCLEUS (S-N) RULES:

-

esto es evidencia de que (“that is an Satellite of Evidence of the previous EDU

evidence

of”

=>

SUBJUNTIVE)

=>

PHASE 3: • NUCLEUS-SATELLITE (N-S) RULES:



siempre que + SUBJUNTIVO (“always that Satellite of Condition of the previous EDU SATELLITE-NUCLEUS (S-N) RULES:

-

+

aun + GERUND (“although + GERUND”) => Satellite of Concession of the previous EDU

The algorithm (implemented in Perl) reads each sentence from left to right, and the first rule that it finds is applied. Rules order is important, because the most confident

470

I. da Cunha et al.

rules are the ones included in Phase 1, that is, the rules based on traditional markers. After the application of the rules included in Phases 1-3, the Elaboration rule is applied in Phase 4. This rule assigns the category “Satellite of Elaboration” to all the EDUs that have not been categorized in Phases 1-3. It is linked with the previous EDU, which will be its Nucleus:

-

if no relation was detected between 2 EDUS => Satellite of Elaboration of the previous EDU

The implemented rules are based on most of the linguistic markers detected. Some markers were not used because they were too general. For, example, it was detected that the syntactic structure “SUBJECT + VERB TO BE + OBJECT” marks the Background relation frequently. Nevertheless, this structure is so common in the language, that it was decided not to use it as marker. To deal with the problem of markers ambiguity, 3 strategies could be used: a) to choose the relation with more number of markers of this type, b) to give to the algorithm all the possible relations or c) to develop more fine-grained strategies combining several markers to try to choose only one relation. In the current state of our project, in most of the cases, the first option is chosen. Nevertheless, in the future, strategy 3 will be explored deeply, in the same line that Versley (2011). At the moment, we have developed a few rules of this kind. For example, the marker mientras (“while”) can express at the same time Contrast, Circumstance and Condition or. See the following examples: a) [Mientras a mí me encanta bailar]EDU1 [a ella le gusta cantar.]EDU2 [While I love dancing,]EDU1 [she loves singing.]EDU2

b) [Mientras me dijo lo que quería escuchar]EDU1 [estuve tranquilo.]EDU2 [While he told me what I wanted to listen]EDU1 [I was tranquil.]EDU2

c) [Mientras me digas lo que quiero escuchar]EDU1 [todo irá bien]EDU2 [While you tell me what I want to listen]EDU1 [everything will be ok.]EDU2

On the one hand, example (a) contains 2 Nucleus of Contrast. On the other hand, the EDU1 of example (b) is a satellite of Circumstance of EDU2, while the EDU1 of example (c) is a satellite of Condition of EDU2. The 3 examples contain the same marker, but its meaning is different. In our rules, this marker is used to evidence that all these relations could be possible, but verbal information is given as well to differentiate automatically between both. For example, in Spanish, if the meaning of the relation is Condition, the main verb in EDU1 should be a form in SUBJUNTIVE (as “digas”). This information is given to our rules, in order to manage some cases of makers’ ambiguity.

6

Experiments and Evaluation

After designing and implementing the algorithm, it was applied over a subcorpus of the test corpus of the RST Spanish Treebank. The mathematics corpus was used. It includes 48 research articles about mathematics, published in scientific journals. Therefore, the texts are very specialized and they contain formulae, numbers, specialized prhaseological units, name entities, etc. This means that the treatment of

A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations

471

this kind of texts is difficult. This mathematics corpus test contains 164 sentences. After our analysis, it is observed that the manually-annotated sentences can be divided in two groups: there are 110 sentences that constitute a complete EDU and there are 54 sentences that are divided into some EDUs. In this second group, there are 39 sentences divided into 2 EDUs, 4 sentences divided into 3 EDUs, and 4 sentences divided into 4 EDUs. Regarding the automatically-annotated corpus (using DiSeg and our relations detection system), it is noted that there are 110 sentences that constitute a complete EDU and there are 54 sentences that are divided into some EDUs. In this second group, there are 41 sentences divided into 2 EDUs, 9 sentences divided into 2 EDUs, and 4 sentences divided into 4 EDUs. These statistics mean that the performance of the discourse segmenter DiSeg is quite high. Nevertheless, DiSeg performs an over-segmentation (4.3%) with regard to the manual segmentation. This difference will be one of the limitations of the system presented here. To evaluate the algorithm, it was applied to the original test corpus, and its results were compared with the manually-annotated corpus. The RST trees comparison methodology by Marcu (2000) was used. This methodology evaluates 4 elements (EDUs, SPANs, Nuclearity and Relations), by means of precision and recall measures. It compares a manual gold standard annotation with an automatic annotation. An on-line automatic tool for RST trees comparison, RSTeval (Maziero and Pardo, 2009) is used, where Marcu’s methodology has been implemented (for 4 languages: English, Portuguese, Spanish and Basque). For each of the 164 sentencetrees pair from the test corpus, precision and recall were measured separately. Afterwards, all those individual results were put together to obtain general results. Table 4 shows global results for the 4 categories. Segmentation (EDUs), SPANs and Nuclearity categories obtain recall and precision of 81.75%. Relations category obtains a bit less of precision and recall: 81.73%, but the difference is not high. We think that these results can be considered acceptable. Nevertheless, we cannot compare these results with the results of other systems for nuclearity and relations detection, since, as it has been mentioned, they do not exist for Spanish. Table 4. Results of the evaluation Category EDUs SPANs Nuclearity Relations

Precision 81.75 81.75 81.75 81.73

Recall 81.75 81.75 81.75 81.73

Here, an example of sentence perfectly analyzed by the system is shown: [Se muestra cómo puede usarse la manipulación directa de las representaciones dinámicas generadas en este ambiente,]EDU1_Nucleus [para observar patrones de comportamiento]EDU2_Satellite_Purpose_of_ the_previous_EDU [y formular conjeturas sobre las transformaciones bajo estudio.] EDU3_Nucleus_List_of_the_previous_EDU [It is shown how direct manipulation of the dynamic representations generated in this environment can be used]EDU1_Nucleus

472

I. da Cunha et al. [to observe behavior patterns]EDU2_Satellite_Purpose_of_ the_previous_EDU [and to formulate conjectures on the transformations that we are EDU3_Nucleus_List_of_the_previous_EDU

studying.]

The obtained results could be considered an extrinsic quantitative evaluation of the discourse segmenter DiSeg. Results show that, although segmentation results are quite good, there are some errors. From 164 sentences of the corpus, there were 26 sentences that DiSeg segmented wrong (that is, 15.85%). These segmentation errors are the cause of some errors of our relations detection system. After this quantitative analysis, a qualitative analysis was carried out, in order to observe the main limitations of the proposed system, regarding nuclearity and relations. The main differences between manual and automatic annotations were analyzed. With regard to nuclearity, the main error is related to rules’ order, which we will have to optimize in the future. For example: [Sin embargo, ante la amplitud de su obra nos hemos visto obligadas a escoger algunos temas]EDU1_Satellite_Antithesis

[y hemos dejado de lado otros igualmente importantes.]EDU2_Nucleus [However, because of his huge work we had to EDU1_satellite_Antithesis [and forget other ones very important as well.]EDU2_Nucleus

choose

some

subjects]

In this example, EDU1 includes the marker sin embargo (“nevertheless”), so the system interprets that it is a Satellite of Antithesis of EDU2. But the reality is that, although this marker indicates Antithesis, here it is relating the entire sentence (EDU1 + EDU2) with the previous sentence (that does not appear in the example). This is one of the limitations of the intra-sentence analysis, which will be solved in the future. In this case, EDU1 and EDU2 would be related by a relation of List (marked by y “and”). Regarding rhetorical annotations, the mean reason of error is the mentioned ambiguity of some markers included in the rules. For example, in the next sentence, the relation assigned by the system was Circumstance (marked in the rules by the GERUND verbal form at the beginning of a sentence), instead of Means, the relation assigned by humans: [Generalizando la idea de palanca a la de un sólido "plano"]EDU1_Satellite_Circumstance [obtenemos la definición general de momento estático en un plano, la cual se traduce, fácilmente, en un vector área en el espacio tridimensional.]EDU2_Nucleus [Generalizing the idea about lever to that of a “plane”]EDU1_Satellite_ Circumstance [we obtain the general definition of static moment in a plane, which means, easily, an area vector in the three-dimensional space.]EDU2_Nucleus

7

Conclusions and Future Work

In this work, the first automatic system assigning nuclearity and rhetorical RST relations to intra-sentence discourse segments in texts in Spanish is presented. Precision and recall results are acceptable. We consider that this system constitutes an important step in order to develop a complete discourse parser for Spanish. Moreover, we think that this work means a notable step as well for the general RST research in Spanish, and that the system that is presented here will be useful to carry out diverse researches about RST in this language, from a descriptive point of view (ex. analysis

A Symbolic Approach for Automatic Detection of Nuclearity and Rhetorical Relations

473

of texts from different domains or genres) and an applied point of view (development of NLP applications, like automatic summarization, automatic translation, sentence compression, IE, etc.). As future work, we plan to optimize the rules of the system, since some errors in the rules’ performance have been detected. We want to develop some strategies to treat the ambiguity of some markers. With regard to the evaluation, we would like to create a baseline to compare the results of the system. The final goal of our project is to develop the next stages of automatic discourse analysis (relation of sentences and paragraphs), in order to build the first complete discourse parser for Spanish.

References 1. Afantenos, S., Denis, P., Muller, P., Danlos, L.: Learning recursive segments for discourse parsing. In: Proceedings of the Conference LREC 2010, pp. 3578–3584 (2010) 2. Carlson, L., Marcu, D., Okurowski, M.E.: RST Discourse Treebank. Linguistic Data Consortium, Pennsylvania (2002) 3. da Cunha, I., Torres-Moreno, J.-M., Sierra, G.: On the development of the RST Spanish Treebank. In: Proceedings of the Fifth Law Workshop (ACL 2011), pp. 1–10 (2011) 4. da Cunha, I., SanJuan, E., Torres-Moreno, J.-M., Lloberes, M., Castellón, I.: Discourse Segmentation for Spanish based on Shallow Parsing. In: Sidorov, G., Hernández Aguirre, A., Reyes García, C.A. (eds.) MICAI 2010. LNCS, vol. 6437, pp. 13–23. Springer, Heidelberg (2010) 5. da Cunha, I., SanJuan, E., Torres-Moreno, J.-M., Lloberes, M., Castellón, I.: DiSeg 1.0: The First System for Spanish Discourse Segmentation. Expert Systems with Applications 39(2), 1671–1678 (2012) 6. da Cunha, I., Iruskieta, M.: Comparing rhetorical structures of different languages: The influence of translation strategies. Discourse Studies 12(5), 563–598 (2010) 7. Hovy, E.: Annotation. A Tutorial. Presented at the 48th Annual Meeting of the Association for Computational Linguistics (2010) 8. Iruskieta, M., da Cunha, I.: Marcadores y relaciones discursivas en el ámbito médico: un estudio en español y euskera. In: Bueno, J.L., et al. (eds.) Analizar datos > Describir variación: XXVIII Congreso Internacional AESLA, pp. 146–159. Universidade de Vigo, Servizo de Publicacións, Vigo (2010) 9. Mann, W.C., Thompson, S.A.: Rhetorical structure theory: Toward a functional theory of text organization. Text 8(3), 243–281 (1988) 10. Marcu, D.: The Rhetorical Parsing of Unrestricted Texts: A Surface-based Approach. Computational Linguistics 26(3), 395–448 (2000) 11. Maziero, E., Pardo, T.A.S., Nunes, M.G.V.: Identificação automática de segmentos discursivos: o uso do parser PALAVRAS. Série de Relatórios do Núcleo Interinstitucional de Lingüística Computacional. Universidade de São Paulo, São Carlos (2007) 12. Maziero, E., Pardo, T.A.S.: Metodologia de avaliação automática de estruturas retóricas. In: Proceedings of the III RST Meeting (7th Brazilian Symposium in Information and Human Language Technology), Brasil (2009) 13. O’Donnell, M.: RSTTOOL 2.4 – A markup tool for rhetorical structure theory. In: Proceed. of the International Natural Language Generation Conference, pp. 253–256 (2000)

474

I. da Cunha et al.

14. Pardo, T.A.S., Nunes, M.G.V.: On the Development and Evaluation of a Brazilian Portuguese Discourse Parser. Journal of Theoretical and Applied Computing 15(2), 43–64 (2008) 15. Pardo, T.A.S., Seno, E.R.M.: Rhetalho: um corpus de referência anotado retoricamente. In: Anais do V Encontro de Corpora. São Carlos-SP, Brasil (2005) 16. Portolés, J.: Marcadores del discurso. Ariel, Barcelona (1998) 17. Soricut, R., Marcu, D.: Sentence Level Discourse Parsing using Syntactic and Lexical Information. In: Proceedings of the 2003 Conference of NAACL-HLT, pp. 149–156 (2003) 18. Stede, M.: The Potsdam commentary corpus. In: Proceedings of the Workshop on Discourse Annotation, 42nd Meeting of the ACL (2004) 19. Subba, R., Di Eugenio, B.: An effective discourse parser that uses rich linguistic information. In: Proceedings of the 2009 Conference of HLT-ACL, pp. 566–574 (2009) 20. Sumita, K., Ono, K., Chino, T., Ukita, T., Amano, S.: A discourse structure analyzer for Japanese text. In: Proceedings of the International Conference on Fifth Generation Computer Systems, pp. 1133–1140 (1992) 21. Taboada, T.: Discourse markers as signals (or not) of rhetorical relations. Journal of Pragmatics 38, 567–592 (2006) 22. Taboada, M., Mann, W.C.: Applications of Rhetorical Structure Theory. Discourse Studies 8(4), 567–588 (2006) 23. Taboada, M., Renkema, J.: Discourse Relations Reference Corpus [Corpus]. Simon Fraser University and Tilburg University (2008), http://www.sfu.ca/rst/06tools/ discourse_relations_corpus.html 24. Tofiloski, M., Brooke, J., Taboada, M.: A Syntactic and Lexical-Based Discourse Segmenter. In: Proceedings of the 47th Annual Meeting of ACL (2009) 25. van Dijk, T.A.: Texto y contexto (Semántica y pragmática del discurso). Cátedra, Madrid (1984) 26. Versley, Y.: Multilabel Tagging of Discourse Relations in Ambiguous Temporal Connectives. In: Proceedings de la 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011), pp. 154–161 (2011)

Suggest Documents