Acknowledgments. We thank Yanin Molko and Adriana Careaga for their help with the corpus annotation. This work was partially supported by a CONACyT.
Discourse Segmentation for Sentence Compression Alejandro Molina1 , Juan-Manuel Torres-Moreno1, Eric SanJuan1 , azquez-Morales4 Iria da Cunha2 , Gerardo Sierra3 , and Patricia Vel´ 1
2 3
LIA-Universit´e d’Avignon IULA-Universitat Pompeu Fabra GIL-Instituto de Ingenier´ıa UNAM 4 VM Labs
Abstract. Earlier studies have raised the possibility of summarizing at the level of the sentence. This simplification should help in adapting textual content in a limited space. Therefore, sentence compression is an important resource for automatic summarization systems. However, there are few studies that consider sentence-level discourse segmentation for compression task; to our knowledge, none in Spanish. In this paper, we study the relationship between discourse segmentation and compression for sentences in Spanish. We use a discourse segmenter and observe to what extent the passages deleted by annotators fit in discourse structures detected by the system. The main idea is to verify whether the automatic discourse segmentation can serve as a basis in the identification of segments to be eliminated in the sentence compression task. We show that discourse segmentation could be a first solid step towards a sentence compression system.
1
Introduction
Automatic summarizers have advanced to the point that they are able to identify, with remarkable precision, the sentences that contain the most essential information for any given text. However a great deal of irrelevant information is included due to its placement in the same high-scoring sentences. This leads to an excessive waste of space in the final summary. Hence, a finer analysis is needed to prune the superfluous information while retaining that is relevant [1]. Sentence compression is intended to produce a grammatical condensed sentence that preserves important content. It represents an important resource for automatic summarization systems [2]. Indeed, some authors argue that this task could be a first step towards abstract summary generation [3]. There are expectations for sentence compression improving summarization systems. Nevertheless, there is evidence that individually compressing the sentences of the summary should produce worse results than not compressing the sentences at all [4]. We agree with [4], who hypothesizes that sentence compression systems need to take the context into account. In this paper, we research the relationship between discourse and sentence compression. First, we provide an overview of sentence compression I. Batyrshin and G. Sidorov (Eds.): MICAI 2011, Part I, LNAI 7094, pp. 316–327, 2011. c Springer-Verlag Berlin Heidelberg 2011
Discourse Segmentation for Sentence Compression
317
and discourse segmentation in Section 2. Second, we present our corpus and a discourse segmenter in Section 3. Then, we detail analysis and results using the segmenter over our corpus in Section 4. Finally, we conclude and present the future research directions in Section 5.
2
Related Work
In this section, we present related research in sentence compression and discourse segmentation. We direct more attention to studies related to discourse units. 2.1
Sentence Compression
There are some interesting early studies in sentence compression with regard to its applications. A “Telegraphic Text Reduction” method is described by [5]. Later, [1] presented a non-extractive summarization method capable of generating headlines of any size. A more consistent work was accomplished by [6], whom showed that two of the most important operations done by humans in abstract summarization are sentence reduction and sentence combination. The Noisy Channel model (commonly used in the area of Statistical Machine Translation) was adopted for sentence compression, for the first time by [7]. The authors of this last work consolidated a well defined task set-up and included a standard corpus for sentence compression1 and described a method of evaluation where future research could be compared. [8] discussed that the lack of examples in the standard Ziff-Davis corpus is a cause for the generation of poor compressions. [9] proposed to compress sentences by regarding of context. Its authors did not use the Ziff-Davis corpus because their discourse constraints needed context annotations. They indicated annotators to delete individual words from each sentence [10]. For the first time, a sentence compression corpus, were annotated by humans considering the context. Nonetheless, the criteria used to elicit the compressions remain quite artificial for summarization. Humans are more likely to drop long phrases in an abstract, such as is indicated by [11]. Recent studies have found good results by concentrating on clauses, instead of isolated elements, for the generation of candidates for elimination. The algorithm of [12] first divides sentences into clauses prior to any elimination and then the compression candidates are scored by Latent Semantic Analysis [13]. A critical problem, in this last study, was that sometimes the main subject of the sentence was removed. A “discourse chunking technique” were presented by [14], as an alternative to discourse parsing, thereby showing a direct application to sentence compression. They plausibly argued that, while discourse parsing at document-level stills poses a significant challenge, sentence-level discourse models have shown accuracies comparable to human performance [15]. Finally, we would like to mention some interesting studies in sentence compression in languages other than English. In [16] and [17] the authors show 1
Using 1,067 pairs (sentence, compressed sentence) from Ziff-Davis corpus. All of the compressions were extracted artificially from 4,000 articles.
318
A. Molina et al.
interesting results based on different methods of statistical physics applied to documents in French. In [18] and [19] the authors evaluated sentence compression in Spanish as an optimization method for several automatic summarizers. In [20] a system for summarizing subtitles in Dutch and English is described. In [21] sentence simplification phenomena are studied for Portuguese. 2.2
Discourse Segmentation
The term “segmentation”, in discourse theory, refers to divide sentence in several units maintaining discourse sense. As [22] state: “Discourse segmentation is the process of decomposing discourse into Elementary Discourse Units (edus), which may be simple sentences or clauses in a complex sentence, and from which discourse trees are constructed”. Thus, one sentence constitutes an edu, but sometimes it can contain more edus. Example 1 shows one sentence corresponding with a single edu, and Example 2 shows one sentence with three edus (marked in square brackets). This characteristic allows to consider segmentation at the sentence level, i.e., to decompose a sentence into edus using only local information. In this work, we use the sentence level segmentation approach. Example 1. [The design and management of terminological databases poses theoretical and methodological problems.] Example 2. [In today’s society, there are two apparently contradictory trends:] [on the one hand, there is a growing need for harmonization at international level, due to continuous economic, political, social and cultural links and exchanges,] [but on the other hand, there is a recognition of diversity in all areas of human life.]
Discourse segmentation is the first stage for discourse parsing (the following two stages are detection of rhetorical relations and building of rhetorical trees). The fundamental theoretical framework for research on automatic discourse parsing is the Rhetorical Structure Theory (RST) by [23]. RST is an independent-language theory of textual organization that considers that one text can be divided in edus, being nuclei or satellites. Nuclei are segments providing the most important information with regard to the purposes of the author, while satellites are elements depending on the nuclei and giving an additional information about them. Traditional research considers that all satellites can be eliminated without losing information [24]. However, this assumption is not totally true, as it is discused in [25]. Moreover, results are different if we consider whole-text relations or intra-sentence relations. For example, if a Condition satellite is eliminated (first edu in Example 3), it is impossible to understand the meaning of the sentence. Example 3. [If we want to develop an adequate discourse segmenter,] [it is necessary to have a syntactic parser.]
Nowadays there are automatic discourse segmentation systems for several languages: English [22], Brazilian Portuguese [26], Spanish [27] and French [28]. All of them require some syntactic analysis of the sentences.
Discourse Segmentation for Sentence Compression
3
319
Methodology
In this section, we first detail the elicitation of the corpus. Then we describe a discourse segmenter for Spanish. 3.1
Corpus Centered in Context
Due to the lack of corpora for sentence compression in Spanish, we have created a multi-genre corpus. In our corpus, each phrase was manually compressed taking into account its context as well as the important information to be retained (as opposed as to compress sentences in isolation). Four genres were selected with the intention to represent widely different registers of language: Wikipedia sections, brief news, scientific abstracts and short stories. Each genre is represented by 20 texts composed of no more than 50 phrases each one. Brief news were chosen from three news web sites in Spanish2 . Scientific abstracts were randomly selected from different sites of specific areas like Psychology, Natural Language Processing, Engineering, Economics and Law. The short stories are authored from Augusto Monterroso, Marco Denevi, Macedonio Fern´ andez, Julio Torri and Rub´en Dar´ıo. All of these texts are available on the Web. The corpus has 392 sentences in Wikipedia sections, 840 sentences in brief news, 270 sentences in scientific abstracts and 550 sentences in short stories. The texts in each genre were distributed among two annotators, who were required to read them carefully and to compress them, sentence by sentence, following some few simple instructions listed in Figure 1. For each line in the text, delete the irrelevant information in accordance with the general context of the document; try to eliminate as many elements as possible (including punctuation marks) taking into account that the full text should be readable after compression of the sentences : – – – – – –
Do not rewrite. Do not change the order. Do not replace. New versions of the sentences should be grammatical. New versions must retain the original meaning of the sentence before being altered. New versions must retain the original meaning of the text before compressions. Fig. 1. Instructions for corpus annotation
3.2
DiSeg
After the elicitation and manual compression of the corpus, we applied over it a Spanish discourse segmenter called DiSeg [27]. This system detects discourse boundaries in sentences, offering RST discourse segments (edus). This 2
La Jornada (www.jornada.unam.mx), El Universal (www.eluniversal.com.mx) and Milenio (www.milenio.com)
320
A. Molina et al.
segmentation tool is based on a set of discourse segmentation rules using lexical and syntactic features. First, a text is preprocessed with sentence segmentation, POS tagging and shallow parsing modules using the Freeling tool-kit[29]. Then, a xml file is generated with discourse markers annotations. Finally, several rules are applied to this xml file. The rules are based on: discourse markers, as “while” (mientras que), “although” (aunque) or “that is” (es decir ), which usually mark the relations of Contrast, Concession and Reformulation, respectively; conjunctions, such as “and” (y) or “but” (pero); adverbs, as “anyway” (de todas maneras); verbal forms, as gerunds, finite verbs, etc.; punctuation marks, as parenthesis or dashes. The precision of DiSeg was evaluated using as gold standard a corpus including medical texts (obtaining an 80% of F-score) and terminological texts (obtaining a 91% of F-score). The gold standard, as well as the segmenter, can be downloaded on-line though the following link: http://daniel.iut.univ-metz.fr/DiSeg/.
4
Experiments and Results
In this section we show the quantitative and qualitative analysis of our manually compressed corpus. 4.1
Quantitative Analysis
First, we parsed our corpus with DiSeg at a sentence level. Then, we extracted all of the deleted passages that were removed from an original sentence, according to the rules described in Section 3.1. Finally, we classified them into three classes: 1. Human-deleted passages corresponding to edus detected by DiSeg. 2. Human-deleted passages not corresponding to edus detected by DiSeg. (a) Passages with discourse sense. (b) Passages without discourse sense. Class 1 contains human-removed passages corresponding to DiSeg edus, that is, segments containing a discourse marker detected by the system. As expected, there are some passages containing a discourse marker that do not match exactly DiSeg edus. However, most of the passages in Class 1, do Annotators were used more to remove complete discourse passages than just markers. Table 1 shows the average of the matching proportion between passages included in Class 1 and DiSeg edus (defined as (passage length)/(edu length)), as well as the percentage of passages in Class 1 that full-matched a DiSeg edu. Class 2 includes human-removed passages not corresponding with DiSeg edus. This class is divided into two sub-classes: Class 2a which includes elements that could be considered discourse segments (because they have a discourse sense), but that are not detected by DiSeg, as they do not match with the segmentation criteria of the system. Class 2b is comprised of elements with no discourse sense, that is, short units as substantives, verbs, adverbs, adjectives, adverbial phrases,
Discourse Segmentation for Sentence Compression
321
Table 1. Matching proportion between passages in Class 1 and DiSeg edus Average Full-match matching proportion with edus Wikipedia 0.91 73% News 0.92 73% Scientific 1.00 100% Stories 0.81 57% Table 2. Proportions of deleted content in three classes Class 1 Class 2a Class 2b (%words) (%words) (%words) Wikipedia 31.55 29.57 38.88 News 34.95 16.47 48.58 Scientific 30.26 17.26 52.48 Stories 20.68 09.06 70.26 Table 3. Nucleus-Satellite proportions in DiSeg Segments Nuclei Satellites DiSeg Segments DiSeg Segments Wikipedia 8 (29 %) 20 (71 %) News 9 (13 %) 58 (87 %) Scientific 0 (0 %) 8 (100 %) Stories 4 (40 %) 6 (60 %)
punctuation marks, etc. In Section 4.2 we show some examples extracted from our corpus. In Table 2 we show the distribution of deleted content, in terms of words, belonging to each of the three classes described above. We mark in bold the highest percentages for each class. The passages with discourse sense (Class 1 ∪ Class 2a) seem important for compression, if we consider that they represent approximately half of the volume that annotators have removed from the whole corpus. We verify some differences with regard to genre. As expected, encyclopedic and journalistic texts tend to contain more removable discourse passages, reflected in explanations or extra information. By contrast, literary texts express the information in a more subtle way. The removal of isolated elements, such as adjectives and adverbs, is more appropriate for this genre. Regarding science texts, we consider that results are affected by the fact of choosing abstracts: summaries tend to contain simple phrases. Thus, we have not found as many segments as expected. In order to have an idea of whether all satellites have been systematically deleted, we have divided Class 1 into two types: Nuclei and Satellites. Table 3 shows the proportion for each segment type. Most of eliminated edus were satellites, however, some nuclei were also deleted.
A. Molina et al.
UNLESS
GRANT
MEANS
PURPOSE
CONDITION
OTHERS
ANTITHESIS
ELABORATION
RESULT
JOINT
CIRCUMSTANCE
CAUSE
0
5
10
15
20
322
Fig. 2. Number of occurences of of DiSeg segments classified by relation
The number of occurences of DiSeg segments (Class 1) classified by relation3 is showed in Figure 2. It is observed that there are some relations more likely to be removed than others. Maybe this information could be used for recognizing candidate elements to be deleted in sentence compression. In the experiments carried out over our corpus, we find that the most eliminated passages correspond with: satellites of Cause (27.85%), satellites of Circumstance (25.32%), nuclei of Joint (20.25%), satellites of Result (13.92%), and satellites of Elaboration (12.66%). 4.2
Qualitative Analysis
After the quantitative analysis of the manually compressed corpus, we carried out a qualitative analysis to understand which elements tend to be removed during the compression. Along this section, we use the term “discourse marker” in a general way. We do not follow more strict classifications as the ones described by [30]. With regard to DiSeg segments (Class 1), we detect several cases. It is interesting to note that it is usually possible to assign a rhetorical relation and nuclearity (that is, if an edu is nucleus or satellite) to a edu, only by analyzing the discourse marker that it contains. In other words, it is not necessary to see the complete sentence. Example 4, showing a satellite of Cause, is an instance of this situation4 . 3 4
In this work, we use the traditional rhetorical relations by [23]: Result, Cause, Condition, etc. Examples in Spanish are real passages of our corpus. Translations are shown for reader understanding.
Discourse Segmentation for Sentence Compression
323
Example 4. [ya que se reducir´ıan las interacciones entre f´ armacos, sus efectos adversos, y favorecer´ıa el cumplimiento de unos tratamientos que cada vez incluyen m´ as pastillas.] Example 4. [since the interactions among drugs would be reduced, their adverse effects, and it would help the fulfillment of some treatments that include more and more pills.]
In some cases, the discourse marker is ambiguous, and it could indicate more than one rhetorical relation. In Example 5, the discourse marker cuando (“when”) could indicate Circumstance or Condition. In these cases, it is necessary to read the complete sentence in order to understand the rhetorical meaning of the edu. Example 5. [Sin embargo, el uso de Internet a edades cada vez m´ as tempranas representa no solamente una herramienta educativa u ´til,][sino tambi´en puede constituir grandes peligros] [cuando su uso est´ a relacionado con contenidos inapropiados para su adecuado desarrollo.] Example 5. [However, the use of Internet at increasing earlier ages represents not only one useful educational tool] [but it can also constitute big dangers] [when its use is related to inappropriate contents for her adequate development.]
Sometimes, the marker is a gerund form, which is ambiguous as well. Examples 6, 7 and 8 show edus marked with a gerund, indicating different rhetorical relations: as Result (ex. 6), Elaboration (ex. 7) and Means (ex. 8). Example 6. [limit´ andose a reducir el factor de comportamiento s´ısmico que controla las resistencias de dise˜ no.] Example 6. [being limited to reducing the factor of seismic behavior that controls the resistances of design.] Example 7. [dise˜ nando mejoras para el equipo el´ectrico tra´ıdo del otro lado del oc´eano gracias a las ideas de Edison.] Example 7. [designing improvements for electrical equipment brought from the other side of the ocean thanks to Edison ideas.] Example 8. [hablando acerca de la prevenci´ on necesaria.] Example 8. [talking about necessary prevention.]
Most of the eliminated edus have an explicit discourse marker, such as ya que (“since”) in Example 4 or cuando (“when”) in Example 5. However, a few edus contain no discourse markers. In these cases, it is more difficult to assign a rhetorical relation to them. Example 9 is an instance of this situation. Example 9. [se incluyeron adem´ as corredores entre las plantas hechos con tepujal, un material que ayuda a conservar la humedad en la tierra] Example 9. [moreover, corridors among the plants made with tepujal, a material that helps keep the humidity in the land, were included]
324
A. Molina et al.
In the majority of the cases, the eliminated edu corresponds to a satellite (exs. 4-8), but sometimes it corresponds to a nucleus (ex. 9). This means that the satellite may not always be eliminated without a corresponding loss of the message. Furthermore, sometimes the nucleus is not essential for understanding the text, as traditional works on rhetorical analysis suggest [24]. With regard to passages removed by human annotators that do not correspond to edus detected by DiSeg, we find two cases: (a) units with discourse meaning and (b) units with no discourse meaning. In the first case (a), we detected three regularities. The first regularity includes cases in which a removed passage starts by a participle form as shown in Example 10. Example 10. [valorado en 40.000 d´ olares.] Example 10. [valued at 40.000 dollars.]
The second regularity includes cases in which a removed passage corresponds to a relative clause as shown in Example 11. Example 11. [que agrupaba los v´ıdeos m´ as vendidos.] Example 11. [that brought together the most sold videos.]
The third regularity contains cases in which a removed passage has a discourse marker but it does not include a verb as shown in Example 12. Example 12. [a causa de la malnutrici´ on durante la ocupaci´ on alemana.] Example 12. [because of the malnutrition during German occupation.]
DiSeg segmentation criteria do not detect passages in case (a) as edus. In spite of this, many of these segments were removed. We consider that the detection of these units would be useful for automatic compression tasks. Thus, an adaptation of DiSeg, in order to detect them, is advisable. In case (b), we observe that human annotators eliminate some short units, such as adverbs (despu´es, “after”), adjectives (relevantes, “relevant”), prepositional phrases (con sus estudios, “with their studies”), nominal phrases (el honor, “the honour”), etc. More exhaustive analysis of the corpus would be necessary in order to detect the regularities that allow these short units to be eliminated in compression tasks.
5
Conclusions and Future Work
We have presented an analysis of the relationship between discourse segmentation and sentence compression for sentences in Spanish. We have found that discourse segmentation could help in identifying segments to be eliminated in the sentence compression task. Furthermore, we have found that DiSeg is able to detect passages to be removed. Our future work, in the short term, is to carry out the adaptation of the discourse segmenter DiSeg in order to detect all
Discourse Segmentation for Sentence Compression
325
of the passages containing discourse sense. To do this, we will develop linguistic rules concerning relative clauses, participles forms and passages with other discourse markers. Our final goal is to create a sentence compression system based on this new adapted version of DiSeg. We propose that, in our future sentence compression approaches, we must consider two granularities. Firstly, in a coarse-grained level, irrelevant discourse segments must be removed. It is proved that some satellites can be deleted without disrupt the sentence. Secondly, in a fine-grained level, some short elements must be deleted, as long as grammar is not affected. However, deletion approaches for short elements are more prone to grammar or sense issues. In order to tackle a fine-grained level in future work it must be necessary to have information about human agreement in shorts elements deletions. Acknowledgments. We thank Yanin Molko and Adriana Careaga for their help with the corpus annotation. This work was partially supported by a CONACyT grant 211963 and projects RICOTERM (FFI2010-21365-C03-01) and APLE (FFI2009-12188-C05-01).
References 1. Witbrock, M.J., Mittal, V.O.: Ultra-Summarization: A Statistical Approach to Generating Highly Condensed Non-Extractive Summaries. In: Proceedings of the 22nd International Conference on Research and Development in Information Retrieval, pp. 315–316. University of California, Berkeley (1999) 2. Jing, H.: Sentence reduction for automatic text summarization. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 310–315. Association for Computational Linguistics (2000) 3. Knight, K., Marcu, D.: Statistics-based summarization-step one: Sentence compression. In: Proceedings of the National Conference on Artificial Intelligence, pp. 703–710. AAAI Press, MIT Press, Menlo Park, Cambridge (2000) 4. Lin, C.: Improving summarization performance by sentence compression-a pilot study. In: Proceedings of the 6th International Workshop on Information Retrieval with Asian Languages, pp. 1–8 (2003) 5. Grefenstette, G.: Producing intelligent telegraphic text reduction to provide an audio scanning service for the blind. In: Working notes of the AAAI Spring Symposium on Intelligent Text summarization, pp. 111–118 (1998) 6. Jing, H., McKeown, K.: The decomposition of human-written summary sentences. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 129–136. ACM (1999) 7. Knight, K., Marcu, D.: Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence 139, 91–107 (2002) 8. Galley, M., McKeown, K.: Lexicalized markov grammars for sentence compression. In: Proceedings of the NAACL/HLT, pp. 180–187 (2007) 9. Clarke, J., Lapata, M.: Modelling compression with discourse constraints. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 1–11 (2007)
326
A. Molina et al.
10. Clarke, J., Lapata, M.: Global inference for sentence compression: An integer linear programming approach. Journal of Artificial Intelligence Research 31, 399–429 (2008) 11. Pitler, E.: Methods for sentence compression. Technical Report MS-CIS-10-20, University of Pennsylvania (2010) 12. Steinberger, J., Jezek, K.: Sentence Compression for the LSA-based Summarizer. In: Proceedings of the 7th International Conference on Information Systems Implementation and Modelling, Citeseer, pp. 141–148 (2006) 13. Dumais, S.: Latent semantic analysis. Annual Review of Information Science and Technology 38, 188–230 (2004) 14. Sporleder, C., Lapata, M.: Discourse chunking and its application to sentence compression. In: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 257–264.Association for Computational Linguistics (2005) 15. Soricut, R., Marcu, D.: Sentence level discourse parsing using syntactic and lexical information. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol. 1, pp. 149–156. Association for Computational Linguistics, Stroudsburg (2003) 16. Waszak, T., Torres-Moreno, J.M.: Compression entropique de phrases contrˆ ol´ee par un perceptron. In: Journ´ees Internationales d’Analyse Statistique des Donn´ees Textuelles (JADT 2008), Lyon, pp. 1163–1173 (2008) 17. Fern´ andez, S., Torres-Moreno, J.M.: Une approche exploratoire de compression automatique de phrases bas´ee sur des crit`eres thermodynamiques. In: Actes de la Conf´erence sur le Traitement Automatique du Langage Naturel (2009) 18. Molina, A., da Cunha, I., Torres-Moreno, J., Velazquez-Morales, P.: La compresi´ on de frases: un recurso para la optimizaci´ on de resumen autom´ atico de documentos. Linguam´ atica 2, 13–27 (2011) 19. da Cunha, I., Molina, A., Torres-Moreno, J., Velazquez-Morales, P.: Optimizaci´ on de resumen autom´ atico mediante compresi´ on de frases. In: Proceedings of the XXVIII Congreso Internacional de la Asociaci´ on Espa˜ nola de Ling¨ u´ıstica Aplicada, AESLA 2010 (2010) 20. Daelemans, W., H¨ othker, A., Sang, E.: Automatic sentence simplification for subtitling in dutch and english. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, Citeseer, pp. 1045–1048 (2004) 21. Alu´ısio, S., Specia, L., Pardo, T., Maziero, E., Fortes, R.: Towards brazilian portuguese automatic text simplification systems. In: Proceeding of the Eighth ACM Symposium on Document Engineering, pp. 240–248. ACM (2008) 22. Tofiloski, M., Brooke, J., Taboada, M.: A syntactic and lexical-based discourse segmenter. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, ACLShort 2009, pp. 77–80. Association for Computational Linguistics, Stroudsburg (2009) 23. Mann, W.C., Thompson, S.A.: Rhetorical structure theory: Toward a functional theory of text organization. Text 8, 243–281 (1988) 24. Marcu, D.: The rhetorical parsing of unrestricted texts: a surface-based approach. Comput. Linguist. 26, 395–448 (2000) 25. da Cunha, I.: Hacia un modelo ling¨ u´ıstico de resumen autom´ atico de art´ıculos m´edicos en espa˜ nol. Tesis; 23. Institut Universitari de Ling¨ u´ıstica Aplicada, Barcelona, Spain (2008)
Discourse Segmentation for Sentence Compression
327
26. Mazeiro, E., Pardo, T., Nunes, M.: Identifica¸ca ˜o autom´ atica de segmentos discursivos: o uso do parser palavras. S´erie de Relat´ orios do N´ ucleo Interinstitucional de Ling¨ u´ıstica Computacional. Universidade de Sao Paulo, S˜ ao Carlos (2007) 27. da Cunha, I., SanJuan, E., Torres-Moreno, J.M., Lloberes, M., Castell´ on, I.: Discourse Segmentation for Spanish Based on Shallow Parsing. In: Sidorov, G., Hern´ andez Aguirre, A., Reyes Garc´ıa, C.A. (eds.) MICAI 2010, Part I. LNCS, vol. 6437, pp. 13–23. Springer, Heidelberg (2010) 28. Afantenos, S.D., Denis, P., Muller, P., Danlos, L.: Learning recursive segments for discourse parsing. CoRR abs/1003.5372 (2010) 29. Atserias, J., Casas, B., Comelles, E., Gonz´ alez, M., Padr´ o, L., Padr´ o, M.: FreeLing 1.3: Syntactic and semantic services in an open-source NLP library. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), pp. 48–55 (2006) 30. Portol´es, J.: Marcadores del discurso. Letras (Editorial Ariel). Ariel (2001)