Text Simplification for Information Extraction

University of Passau Faculty of Computer Science and Mathematics Chair of Digital Libraries and Web Information Systems Prof. Dr. Siegfried Handschuh

Master’s Thesis

Text Simplification for Information Extraction Christina Niklaus March 30, 2016

First Supervisor: Prof. Dr. Siegfried Handschuh Second Supervisor: Prof. Dr. Michael Granitzer Advisor: André Freitas

Abstract We present a text simplification system that takes a natural language text as input and uses a set of hand-crafted transformation rules to create a version of the source text that is supposed to be easier to process for subsequently applied open information extraction tasks. The idea is to structurally simplify each sentence of the input by separating out syntactic constituents that commonly supply no more than some incidental piece of information, and converting them into simpler stand-alone context sentences. That way, the original sentence is reduced to one (or more, if appropriate) core sentences which contain nothing but the key message of the source, while the extracted context sentences provide additional background information about it. Hence, in combination with their associated context sentences, the truncated core sentences are expected to jointly reflect the original meaning of the input sentence, yet in a structurally simplified form that is likely to facilitate a successive task of extracting factual information contained in the source text. Due to a linguistic analysis of hundreds of sentences from the English Wikipedia, we have identified seven types of grammatical constituents which customarily convey mere supplementary information that may be easily eliminated without losing essential information: relative clauses, appositives, participial phrases, adjective and adverb phrases, prepositional phrases, lead noun phrases, as well as intra-sentential attributions. Beyond that, a variety of conjoined clauses and sentences incorporating particular punctuation are decomposed. The rules for disembedding these constructs and transforming them into separate sentences will be detailed in this thesis. Moreover, we have built a domain-independent text simplification corpus based on Wikipedia for evaluating the performance of our framework. This dataset has been constructed by carefully selecting a set of high-quality Wikipedia articles which span a broad range of subjects. By manually simplifying each sentence enclosed in this set - through breaking it down to one or more core sentences and a set of accompanying context sentences -, a gold reference of more than 1,300 aligned complex source and simplified target sentences has been compiled. On the basis of this test collection, we have then evaluated the quality of the outcome produced by our text simplification system using both a number of selected automatic evaluation measures and a detailed manual analysis. This examination has revealed that our framework almost without exception outperforms a state-of-the-art simplification approach which has been used as a baseline system, thus demonstrating improvements over prior works. Finally, aiming at allocating each context sentence that has been separated out of the input to a class which specifies the type of content it describes, a taxonomy of context classes has been devised. Drawing on this classification, our previously gathered gold standard has been expanded by manually assigning each context sentence to its corresponding class label, resulting in a corpus of nearly 3,000 annotated sentences. A supervised classifier has then been trained on this dataset, thus building a model for automatically tagging previously unseen contextual sentences with their respective category.

Contents List of Tables

ix

List of Figures

xiii

Acronyms

xvi

I. Introduction and Motivation

1

1. Basic Concept of Text Simplification

5

2. Proposed Approach 9 2.1. Main Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2. Tools and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

II. Related Work

11

3. Text Simplification 3.1. Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1. Lexical Simplification . . . . . . . . . . . . . . . . . . . . . . . 3.1.2. Syntactic Simplification . . . . . . . . . . . . . . . . . . . . . 3.2. Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1. Hand-crafted Rules . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1.1. Early Work . . . . . . . . . . . . . . . . . . . . . . . 3.2.1.2. Contemporary Systems . . . . . . . . . . . . . . . . . 3.2.2. Data-driven Approaches . . . . . . . . . . . . . . . . . . . . . 3.2.2.1. Training Corpora . . . . . . . . . . . . . . . . . . . . 3.2.2.2. Contemporary Systems . . . . . . . . . . . . . . . . . 3.2.3. Hybrid Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4. Text Simplification as a Preliminary Step in Assisting other NLP Applications . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5. Text Simplification Systems for Languages Other than English 3.2.6. Analysis of Strong and Weak Points of Hand-crafted and Datadriven Approaches . . . . . . . . . . . . . . . . . . . . . . . . 3.3. Evaluating Text Simplification Systems . . . . . . . . . . . . . . . . . 3.3.1. Simplification Data Resources . . . . . . . . . . . . . . . . . . 3.3.2. Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . .

15 15 15 18 22 23 24 26 32 32 34 39

4. Text Summarization

53

42 44 47 48 48 50

Contents 4.1. Statistics-based Sentence Compression . . . . . . . . . . . . . . . . . 54 4.2. Rule-based Sentence Compression . . . . . . . . . . . . . . . . . . . . 55

III. Framework for Syntax-driven Rule-based Sentence Simplification 61 5. Workflow

65

6. Preprocessing Modules: Segmentation and Analysis

67

7. Key Module: Transformation 7.1. Three-Stage Approach . . . . . . . . 7.2. Simplification Rules . . . . . . . . . . 7.2.1. Relative Clauses . . . . . . . . 7.2.2. Appositive Phrases . . . . . . 7.2.3. Participial Phrases . . . . . . 7.2.4. Adjective and Adverb Phrases 7.2.5. Prepositional Phrases . . . . . 7.2.6. Lead Noun Phrases . . . . . . 7.2.7. Intra-Sentential Attributions . 7.2.8. Conjoined Clauses . . . . . . 7.2.9. Punctuation . . . . . . . . . .

69 70 70 70 74 78 81 84 91 92 95 99

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

IV. Evaluation

. . . . . . . . . . .

103

8. Experimental Setup 8.1. Measures for Capturing the System’s Performance 8.1.1. Automatic Evaluation Metrics . . . . . . . 8.1.2. Manual Analysis . . . . . . . . . . . . . . 8.2. Evaluation Dataset . . . . . . . . . . . . . . . . . 8.3. Baseline . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

107 107 108 109 111 111

9. Evaluation Results and Discussion 115 9.1. Results of the Automatic Evaluation . . . . . . . . . . . . . . . . . . 115 9.1.1. Shallow Features . . . . . . . . . . . . . . . . . . . . . . . . . 115 9.1.2. Compression Quality . . . . . . . . . . . . . . . . . . . . . . . 119 9.1.3. Closeness of the System’s Output to the Reference Corpus . . 120 9.2. Results of the Manual Evaluation . . . . . . . . . . . . . . . . . . . . 120 9.2.1. Classification of the Output . . . . . . . . . . . . . . . . . . . 120 9.2.2. Elimination of Sentences Providing a particular Syntactic Structure from the Result Set . . . . . . . . . . . . . . . . . . . . . 136 9.2.3. Informal Comparison to the Baseline . . . . . . . . . . . . . . 140

V. Context Classification

145

10.Taxonomy of Context Classes

149

vi

Contents 10.1. Rhetorical Structure Theory as Basis . . . . . . . . . . . . . . . . . 10.2. Proposed Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1. Context Classes . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1.1. Type ’Scope’ . . . . . . . . . . . . . . . . . . . . . 10.2.1.2. Type ’Motivation’ . . . . . . . . . . . . . . . . . . 10.2.1.3. Type ’Result’ . . . . . . . . . . . . . . . . . . . . . 10.2.1.4. Type ’Mode’ . . . . . . . . . . . . . . . . . . . . . 10.2.1.5. Type ’Statement’ . . . . . . . . . . . . . . . . . . . 10.2.1.6. Type ’Paraphrasing’ . . . . . . . . . . . . . . . . . 10.2.1.7. Type ’Antithesis’ . . . . . . . . . . . . . . . . . . . 10.2.1.8. Type ’Source/Target’ . . . . . . . . . . . . . . . . 10.2.1.9. Type ’Cohesion’ . . . . . . . . . . . . . . . . . . . 10.2.1.10.Discarded Rhetorical Structure Theory Relation Definitions . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2. Application to our Wikipedia-based Test Set . . . . . . . . .

. . . . . . . . . . . .

149 152 155 155 155 156 156 156 156 157 157 157

. 158 . 158

11.Training a Classifier for Automatically Annotating Context Sentences 163

VI. Conclusion

171

12.Contributions

175

13.Summary of the Results and Scope for Improvement

179

14.Future Work

183

Bibliography

192

vii

List of Tables 3.1. Research on text simplification as a preprocessing tool for assisting subsequently applied NLP applications . . . . . . . . . . . . . . . . . 44 3.2. Research on syntactic simplification dealing with languages other than English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3. Strong and weak points of the two main text simplification approaches 48 4.1. 4.2. 4.3. 4.4.

Trimmer algorithm - stage 1 . . . . . . . . . . Trimmer algorithm - stage 2 . . . . . . . . . . Trimmer algorithm - stage 3 . . . . . . . . . . Syntax-based pruning heuristics applied in [27]

8.1. Classification guidelines

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

56 56 57 60

. . . . . . . . . . . . . . . . . . . . . . . . . 110

9.1. Number of sentences processed . . . . . . . . . . . . . . . . . . . . . . 116 9.2. Number of output sentences per input . . . . . . . . . . . . . . . . . 116 9.3. Number of unchanged input sentences . . . . . . . . . . . . . . . . . . 116 9.4. Input word coverage (without punctuation) . . . . . . . . . . . . . . . 117 9.5. Average sentence length (without punctuation) . . . . . . . . . . . . . 117 9.6. Compression ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 9.7. Precision, recall and F1 score obtained by our simplification framework119 9.8. Precision, recall and F1 score obtained by the baseline system . . . . 120 9.9. Bleu scores (using a maximum n-gram order of 4) . . . . . . . . . . 120 P 9.10. "Baseball" ( 420 sentences) . . . . . . . . . . . . . . . . . . . . . . . 121 P 9.11. "Google" ( 377 sentences) . . . . . . . . . . . . . . . . . . . . . . . 121 P 9.12. "Mandela" ( 501 sentences) . . . . . . . . . . . . . . . . . . . . . . . 121 10.1. Nucleus-satellite relations [57] . . . . . . . . . . . . . . . . . . . . . . 152 10.2. Multinuclear relations [57] . . . . . . . . . . . . . . . . . . . . . . . . 152 10.3. Distribution of context classes . . . . . . . . . . . . . . . . . . . . . . 162 11.1. Average scores achieved by the classifier . . . . . . . . . . . . . . . . . 170

List of Figures 0.1. Text simplification as a preprocessing step for IE . . . . . . . . . . .

3

3.1. Lexical simplification pipeline [69] . . . . . . . . . . . . . . . . . . 3.2. Example syntactic simplification operations . . . . . . . . . . . . . 3.3. Syntactic simplification pipeline . . . . . . . . . . . . . . . . . . . . 3.4. Simplification of a relative clause [71] . . . . . . . . . . . . . . . . . 3.5. The structure matched by the pattern (S (?a) (S (?b) (S (?c)) ) ) . 3.6. Simplification of coordinated clauses [71] . . . . . . . . . . . . . . . 3.7. Architecture of Siddharthan’s text simplification system [70] . . . . 3.8. Example for a sentence split resulting in a broken pronominal link . 3.10. Examples of sentence simplifications [71] . . . . . . . . . . . . . . . 3.9. Simplification rules [71] . . . . . . . . . . . . . . . . . . . . . . . . . 3.11. Regeneration issues and text cohesion [71] . . . . . . . . . . . . . . 3.12. Typed dependency representation of the sentence "The cat was chased by the dog." . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.13. Source dependency tree [73] . . . . . . . . . . . . . . . . . . . . . . 3.14. Target dependency tree [73] . . . . . . . . . . . . . . . . . . . . . . 3.15. Examples of aligned sentence pairs extracted from EW and SEW [21, 22] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.16. Examples of source and target parse trees [20] . . . . . . . . . . . . 3.17. Examples of translation rules [20] . . . . . . . . . . . . . . . . . . . 3.18. Grammar rules extracted from the wentence pair in figure 3.16. Each rule rewrites a pair of non-terminals into a pair of subtrees, shown in bracketed notation [20]. . . . . . . . . . . . . . . . . . . . . . . . . . 3.19. Successive simplification steps [61] . . . . . . . . . . . . . . . . . . . 3.20. Example sentence pairs from PWKP [87] . . . . . . . . . . . . . . . 3.21. Example of sentences written at multiple levels of text complexity from the Newsela dataset [87] . . . . . . . . . . . . . . . . . . . . .

17 18 20 24 25 26 26 27 28 29 30

. . . . . . . . . . .

. 31 . 31 . 31 . 33 . 37 . 37

. 38 . 41 . 49 . 50

4.1. Text simplification versus text summarization . . . . . . . . . . . . . 54 5.1. Workflow of our text simplification framework . . . . . . . . . . . . . 66 6.1. Representations generated by the analysis module on a sample sentence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.1. Example simplification rules for non-restrictive relative clauses . . . 7.2. Example of the simplification of a non-restrictive relative clause (’which’) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3. Example of the simplification of a non-restrictive relative clause (’where’) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4. Example of the simplification of a non-restrictive appositive phrase .

72 73 74 75

List of Figures 7.5. Example of the simplification of a non-restrictive conjoined appositive phrase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6. Example of the simplification of a restrictive appositive phrase . . . 7.7. Example of the simplification of a past participle in postmodification 7.8. Example of the simplification of a past participle in premodification . 7.9. Example of the simplification of a present participle in postmodification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10. Simplification rule for adjective phrases . . . . . . . . . . . . . . . . . 7.11. Example of the simplification of an adjective phrase . . . . . . . . . 7.12. Simplification rule for adverb phrases . . . . . . . . . . . . . . . . . . 7.13. Example of the simplification of an adverb phrase . . . . . . . . . . . 7.14. Example of the simplification of an adverb phrase . . . . . . . . . . . 7.15. Most important simplification rules for offset prepositional phrases . 7.16. Example of the simplification of offset prepositional phrases . . . . . 7.17. Example of the simplification of prepositional phrases without segregation through punctuation . . . . . . . . . . . . . . . . . . . . . . . 7.18. Example of the simplification of prepositional phrases without segregation through punctuation resulting in malformed core sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.19. Example of the simplification of prepositional phrases acting as complements of verbs or adjectives . . . . . . . . . . . . . . . . . . . . . 7.20. Example of the simplification of a prepositional phrase containing the preposition "to" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.21. Simplification rule for lead noun phrases . . . . . . . . . . . . . . . . 7.22. Example of the simplification of a lead noun phrase . . . . . . . . . . 7.23. Example simplification rule for intra-sentential attributions . . . . . . 7.24. Example of the simplification of an intra-sentential attribution . . . 7.25. Example of the simplification of an intra-sentential attribution with a premodifying PP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.26. Example of the simplification of a subordinated clause with both prefix and infix conjunction . . . . . . . . . . . . . . . . . . . . . . . . . 7.27. Example of the simplification of a subordinated clause with infix conjunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.28. Example of the simplification of a coordinated clause . . . . . . . . . 7.29. Example of the simplification of sentences conjoined by a colon . . . 7.30. Example of the simplification of sentences conjoined by a semicolon .

76 77 78 79 80 81 81 82 83 83 85 85 87

88 89 90 91 92 93 93 94 96 97 98 100 100

8.1. Example output sentences produced by the baseline algorithm . . . . 113 9.1. Example output sentences presenting a reduced word coverage . . . 9.2. Examples of perfectly resolved sentences incorporating only a single constituent to extract . . . . . . . . . . . . . . . . . . . . . . . . . 9.3. Examples of perfectly resolved sentences incorporating a simple combination of two components to extract . . . . . . . . . . . . . . . . 9.4. Examples of perfectly resolved sentences incorporating a more complex structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5. Examples of negative outcome . . . . . . . . . . . . . . . . . . . . 9.6. Syntax parse tree of a (semi)colon separating two full sentences . . 9.7. Examples of sentence structures that are eliminated . . . . . . . .

xii

. 118 . 122 . 123 . . . .

125 128 136 137

List of Figures 9.8. Examples of long sentences that are perfectly simplified . . . . . . 9.9. Examples of sentences that have been simplified by the baseline system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.10. Concise output produced by the baseline system . . . . . . . . . . . 9.11. Incoherent output produced by the baseline system . . . . . . . . .

. 140 . 142 . 143

10.1. ’Concession’ relation . . . . . . . . . . . . . . . . 10.2. ’Contrast’ relation . . . . . . . . . . . . . . . . . 10.3. Examples of a nucleus-satellite and a multinuclear 10.4. A rhetorical structure tree [57] . . . . . . . . . . . 10.5. Taxonomy of context classes . . . . . . . . . . . . 10.6. Examples of annotated context sentences . . . .

. . . . . .

. . . . . . . . . . . . . . relation [79] . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. 139

150 150 150 150 154 158

11.1. Evaluation of neural networks [16] . . . . . . . . . . . . . . . . . . . . 163 11.2. Evaluation results when using the subclass labels . . . . . . . . . . . 164 11.3. Evaluation results when using the superclass labels . . . . . . . . . . 167 13.1. Example of dispensable prepositional phrase (PP)s that are mistakenly kept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 13.2. Example of an erroneous extraction of a mandatory PP . . . . . . . . 180

xiii

Acronyms ADJP adjective phrase ADVP adverb phrase API application programming interface CD cardinal number DRS Discourse Representation Structure DT determiner EAS Easy Access Sentence EW English Wikipedia FKGL Flesch-Kincaid Grade Level FSG finite state grammar GR grammatical relation HMM Hidden Markov Model IE Information Extraction ILP integer linear programming IN preposition or subordinating conjunction IR Information Retrieval NE named entity NER Named Entity Recognition NL natural language NLP natural language processing NNP proper noun, singular NNPS proper noun, plural NP noun phrase PBMT Phrase Based Machine Translation POS part-of-speech PP prepositional phrase PRP personal pronoun

Acronyms PSET Practical Simplification of English Text QA Question Answering QG Question Generation RB adverb RE Relation Extraction RST Rhetorical Structure Theory S a simple declarative clause, i. e. one that is not introduced by a (possible empty) subordinating conjunction or a wh-word and that does not exhibit subject-verb inversion SBAR a clause introduced by a (possibly empty) subordinating conjunction SEW Simple English Wikipedia SMT Statistical Machine Translation SRL Semantic Role Labeling STSG synchronous tree substitution grammar TAG tree adjoining grammar VB verb, base form VBG verb, gerund or present participle VBN verb, past participle VP verb phrase

xvi

Part I. Introduction and Motivation

"The ability to simplify means to eliminate the unnecessary so that the necessary may speak." Hans Hofmann (1880 - 1966)

To date, Information Extraction (IE) systems customarily operate directly on the - usually linguistically rather complex - original natural language (NL) text when trying to turn the unstructured information embedded in it into structured data [45]. The goal of the work presented in this thesis is to insert a simplification step in-between (see figure 0.1) in which the input text is syntactically modified, resulting in more concise sentences that are supposed to be easier to process for subsequently applied IE systems, thus improving its performance. In other words: text simplification shall be used as a preprocessing step for assisting in enhancing the quality of the outcome returned by an IE framework.

Figure 0.1.: Text simplification as a preprocessing step for IE We will first point out the basic concept of text simplification in chapter 1 and then outline the main idea of our simplification approach in chapter 2.

3

1. Basic Concept of Text Simplification Unlocking the written word by publishing daily news articles at five reading levels to engage students in grades three to twelve in high-interest topics, ranging from immigration and diplomacy to drones and animal extinction, is what Newsela1 , an education technology startup, has set as its goal. For this purpose, each newspaper report is reformulated four times by professional editors, thus expressing a total of five different levels of linguistic complexity, with each of them meeting the readability requirements for children at a specific grade level. An example excerpted from this corpus is presented below. original text (from the ’Los Angeles Times’): Mavis Batey, renowned British code breaker, dies at 92 Batey, a college student studying German linguistics, became one of Bletchley Park’s nimblest decoders. She decrypted a message that led to a stunning British victory over the Italian navy in the Mediterranean. She also was the first to crack the secret messages of the Abwehr, the German intelligence service, a breakthrough that helped ensure the success of the D-day landings. ... One day she was examining a message which was sent on the Italian navy’s Enigma and noticed there was not one L in it. Knowing that the machine never encoded a letter as itself, she suspected that she was looking at a test message a lazy operator had made by repeatedly pressing the L key. ... She was a model for the code breaker played by actress Kate Winslet in "Enigma," the 2001 movie about Bletchley Park.

manually simplified version (addressing fourth graders): Mavis Batey, who helped crack codes during World War II, dies at 92 Batey became one of Bletchley Park’s best decoders. She was 1

https://newsela.com/

1. Basic Concept of Text Simplification a college student. She broke the code on a message that led to a surprise British victory over the Italian navy. She also was the first to crack the secret messages of the Abwehr. The Abwehr was the German spy group. That breakthrough helped the allies successfully invade France on the beaches of Normandy. The invasion was called D-Day. It was one of the most important moments in the war. ... One day Batey was studying a message. It was sent on the Italian navy’s Enigma. She noticed there was not one L in it. Enigma never encoded a letter as itself. So an L in a message was not an L when it was first written. She figured out that it was a test message in which the sender was lazy and just typed L over and over. ... In 2001 Kate Winslet played a character based on Batey in the movie "Enigma". The former passage of text has been taken from an original newspaper article, whereas the latter one depicts its corresponding simplified version which has been created by hand by Newsela staff. This example of a parallel complex source and simple target text illustrates a variety of text simplification techniques, encompassing e. g. the decomposition of coordinated and subordinated clauses, the disembedding of relative clauses or appositive phrases, the conversion from passive to active voice, the avoidance of pre-posed adjuncts, the clarification of discourse relations and the substitution of difficult terms with simpler synonyms. Some examples of such operations, drawn from the previous excerpt, are listed hereafter. • syntactic simplification: – extraction of appositions: Batey, a college student studying German linguistics, became one of Bletchley Park’s nimblest decoders. Batey became one of Bletchley Park’s best decoders. She was a college student. She also was the first to crack the secret messages of the Abwehr, the German intelligence service. She also was the first to crack the secret messages of the Abwehr. The Abwehr was the German spy group.

6

1. Basic Concept of Text Simplification – disembedding of a relative clause: One day she was examining a message which was sent on the Italian navy’s Enigma. One day Batey was studying a message. It was sent on the Italian navy’s Enigma. – decomposition of coordinated clauses: One day she was examining a message and noticed there was not one L in it. One day Batey was studying a message. She noticed there was not one L in it. – separation of a subordinated clause: Knowing that the machine never encoded a letter as itself, she suspected that she was looking at a test message a lazy operator had made by repeatedly pressing the L key. Enigma never encoded a letter as itself. She figured out that it was a test message in which the sender was lazy and just typed L over and over. – conversion from passive to active voice: She was a model for the code breaker played by actress Kate Winslet. Kate Winslet played a character based on Batey. • lexical simplification: – term replacements: ∗ nimblest → best ∗ decrypt → break the code ∗ stunning → suprise ∗ intelligence service → spy group ∗ operator → sender

7

1. Basic Concept of Text Simplification – explanation generations: . . . the success of the D-day landings. . . . successfully invade France on the beaches of Normandy. The invasion was called D-Day. It was one of the most important moments in the war. The machine never encoded a letter as itself. Enigma never encoded a letter as itself. So an L in a message was not an L when it was first written. The syntactic and lexical modifications displayed above are supposed to make the original news report accessible to a much broader audience of readers. However, while Newsela makes use of a group of professional editors for manually re-writing texts, we aim at automating this simplification process by applying a set of handcrafted transformation rules, as will be detailed in the next chapter.

8

2. Proposed Approach In the following sections, we point out the basic concept of our text simplification approach, as well as some tools and resources which we make use of extensively within our framework. Finally, the outline of the rest of this work is presented.

2.1. Main Idea This work focuses on the development and analysis of a framework for syntax-driven rule-based sentence simplification. The approach simplifies NL text by identifying components of a sentence which usually provide mere supplementary information that may be easily extracted without losing essential information. By applying a set of hand-crafted grammar rules, these constituents are disembedded and transformed into self-contained simpler context sentences. The aim is to split a sentence into one or more core sentences comprising those parts of the original sentence that convey central information, and zero or more context sentences consisting of phrases that provide only secondary information. In this manner, the subsequent task of extracting factual information contained in NL text shall be facilitated. In other words: text simplification shall be used as a preprocessing step in order to improve the performance of IE tasks. Since we focus on enhancing the results of a successive IE process rather than ameliorating the readability for human readers, our simplification system is restricted to syntactic simplification, encoded by a set of hand-crafted transformation rules, which have been defined in the course of a rule engineering process based on linguistic characteristics. The examination of whether a preceding syntactic simplification of linguistically complex sentences indeed eases the problem of accessing factual information from NL text is subject to future work.

2.2. Tools and Resources The text simplification framework takes as input a NL text that is preprocessed using various software packages provided by the Stanford NLP Group in order to convert it into a version upon which the transformation rules for simplifying the sentences can then be applied. This includes the following tools: • Stanford Parser2 [49] 2

http://nlp.stanford.edu/software/lex-parser.html

2. Proposed Approach • Stanford POS Tagger3 [80] • Stanford Named Entity Recognizer4 [32] Beyond that, Stanford Phrasal5 [36], a statistical phrase-based machine translation system, is used in the evaluation of the framework’s performance.

2.3. Outline The rest of this thesis is organized as follows: section II summarizes related work by elucidating previous attempts at both text simplification and text summarization; section III then provides an overview of the functionality of our simplification framework and details the rules for separating out negligible components into stand-alone contextual sentences; section IV describes the methodology used to build a domainindependent text simplification corpus on the basis of Wikipedia and provides an experimental analysis of the simplification framework using this dataset; section V outlines the classification of the detached contextual sentences into corresponding classes; and finally, section VI provides a conclusion and describes future work.

3

http://nlp.stanford.edu/software/tagger.html http://nlp.stanford.edu/software/CRF-NER.html 5 http://nlp.stanford.edu/phrasal/ 4

10

Part II. Related Work

The following chapter summarizes previous work on automatic text simplification. On top of that, representing a closely related text-to-text rewriting task, some selected literature on the problem of automatic text summarization will be delineated in chapter 4.

13

3. Text Simplification Text simplification is defined as the process of reducing the linguistic complexity of NL text by utilizing a more readily accessible vocabulary and sentence structure, while still preserving the original information and meaning contained in it [74]. The main goal of simplification is to improve the readability of a text, i. e. to enhance the ease with which it can be understood [86], thus making information easier to comprehend for people with reduced literacy - such as children, non-native speakers or readers suffering from diseases which result in language deficits, like dyslexia, aphasia or deafness -, and hence available to a broader audience [85]. In this regard, substituting difficult words, splitting long sentences, making discourse relations explicit, avoiding pre-posed adverbial clauses and presenting information in cause-effect order, among others, have been shown to effectively improve the reading comprehension for language-impaired humans as sentences become easier to process [74]. Thus, text simplification may be deployed as an assistive technology for specific target reader populations. But not only human readers may benefit from text simplification. In the form of a preprocessing tool in natural language processing (NLP) pipelines, it might also be of use to programs operating on NL data [91], e. g. machine translation or Information Retrieval (IR) tools, since linguistically complex sentences have been identified as a stumbling block for such systems [17]. Previously, it has been proven that applications like text summarization [51, 88], sentence fusion [31], Semantic Role Labeling (SRL) [82], Question Generation (QG) [41], biomedical IE [43] and Relation Extraction (RE) [60] can draw a profit from text simplification (cf. section 3.2.4).

3.1. Basic Principles Commonly, text simplification is achieved via sentence simplification (e. g. [86, 85, 91, 21, 75]), i. e. the simplification system handles only one sentence at a time, while disregarding interactions across neighbouring sentences [17]. Given a sentence, the goal is to produce a simplified version of that sentence with simpler vocabulary and sentence structure while preserving the main ideas. Thus, at sentence level, reading difficulty can be traced back to the use of either a difficult vocabulary or complex syntactic structure. Therefore, sentence simplification is classified into two categories: lexical simplification and syntactic simplification [91].

3.1.1. Lexical Simplification Lexical simplification is the task of identifying and mitigating the vocabulary complexity of a text. This process may be performed by either a term replacement or an

3. Text Simplification explanation generation [74]. While the former is carried out by substituting a word or phrase that is supposed to be difficult to understand with a more comprehensible synonym [89, 10, 6, 15, 24], the latter generates and inserts an explanatory text for them (e. g. a dictionary definition or a specification of hierarchical relations with regard to more common concepts) [29, 89], thus augmenting them with additional information providing some form of context. Hereby, the understandability of unfamiliar terms for an average reader is improved [69]. Typically, both a lexical substitution and the expansion of difficult words is executed in a four-stage pipeline [69]. First of all, it has to be determined which of the terms in the input text are commonly referred to as "complex". So far, various strategies for predicting such unfamiliar terminology have been used in the literature. According to the findings in psycholinguistic research, many approaches use a frequency-based score following the idea that the more common a word is in a large corpus (e. g. the Brown corpus [53] or the Reuters Health E-line newsfeed6 ) the higher the probability that it is recognised by the reader [29, 6]. Whenever a term is below a pre-defined threshold, it is considered unfamiliar and therefore requires simplification [29, 10, 89]. Other work in this field includes context information by calculating a score based on term co-occurrence patterns in a sentence, document, query session and the like [89, 10]. Zeng-Treitler et al. [90], for example, describe a contextual network algorithm which is based on the hypothesis that familiar terms are likely to be found in the context of other familiar terms. Moreover, simple wordbased scores such as the number of syllables or the length of a word are sometimes incorporated as additional indicators for deciding whether or not a term is "difficult" [10, 6]. Once a word or phrase has been identified as complex, a list of potential substitutions or appropriate explanatory definitions, respectively, has to be retrieved. At its simplest, this can be done using Google’s "define:" functionality [29], a dictionary with synonyms [6] or WordNet [6, 24, 15], which represents a lexical database of English containing sets of synonyms (called "synsets") including the relations between them [59]. Besides, Belder et al. [6] propose a Latent Words Language Model that learns for every word a probabilistic set of synonyms and related terms (i. e. the latent words) and then estimates - based on this model - the synonyms that are relevant for the given word in the particular context. Zeng-Treitler et al. [89], in contrast, use both hierarchical (e. g. "is an example of" or "is a type of") and non-hierarchical (e. g. "is part of") relations allocated in a corpus of medical terms, concepts as well as semantic types and relations for finding a more comprehensible related term to generate a phrase which describes the connection between the original and the associated word. A term may have multiple meanings and thus different relevant synonyms and definitions. In order to limit the candidate substitutions to those which will preserve the meaning of the sentence, the synonyms or explanations that have been determined may be refined by deleting those which do not make sense in the given context using a form of word sense disambiguation [69] which is commonly carried out on the basis of context vector similarity measures [6, 10]. Finally, the remaining substitutions must be ranked according to their simplicity, e. g. based on the Kuˆcera-Francis frequency [53] as obtained from the Oxford Psycholinguistic Database [24, 15], the length of the explanatory statement [29] or the 6

http://www.reutershealth.com

16

3. Text Simplification context similarity between the input sentence, of the one part, and the original word and its designated substitute, of the other part [10]. The most simple synonym or definition, respectively, is then used as a replacement or explanation of the original term. This procedure is depicted in the example below (see figure 3.1).

Figure 3.1.: Lexical simplification pipeline [69] input sentence: The cat perched on the mat. (3.1.1) Input

identification of complex terms: input The cat perched on the mat. output The cat perched on the mat. (3.1.2) Phase 1

synonym or explanation generation: input The cat perched on the mat. output perched → rested, sat, roosted, settled (3.1.3) Phase 2

word sense disambiguation: input perched → rested, sat, roosted, settled output perched → rested, sat, roosted, ///////// settled (3.1.4) Phase 3

17

3. Text Simplification

substitution ranking: input perched → rested, sat, settled output 1. sat 2. rested 3. settled (3.1.5) Phase 4

output sentence: The cat sat on the mat. (3.1.6) Output

3.1.2. Syntactic Simplification Syntactic simplification, on the contrary, fully ignores the linguistic level of lexicon. Instead, it focuses on identifying grammatical complexities in a sentence and rewriting these structures into simpler ones [69]. Hence, syntactic simplification operations include but are not limited to splitting long sentences into their component clauses, resolving anaphora, making discourse relations explicit and transforming sentences which use the passive voice into their active counterparts. Corresponding examples are provided below in figure 3.2.

Figure 3.2.: Example syntactic simplification operations splitting long sentences: Obama lived with his mother and sister in Hawaii for three years while his mother was a graduate student in anthropology at the University of Hawaii. • Obama lived with his mother and sister in Hawaii for three years. • This was while his mother was a graduate student in anthropology at the University of Hawaii. (3.2.1) Example 1

18


resolving anaphoric links: Bill Clinton called for his supporters to endorse Obama, and he and Hillary Clinton gave convention speeches in his support. Bill Clinton called for his supporters to endorse Obama, and Bill and Hillary Clinton gave convention speeches in Obama’s support. (3.2.2) Example 2

making discourse relations explicit: This was a historic moment, being the first time that a president mentioned gay rights or the word "gay" in an inaugural address. This was a historic moment because it was the first time that a president mentioned gay rights or the word "gay" in an inaugural address. (3.2.3) Example 3

conversion from passive to active voice: Obama and Joe Biden were formally nominated by former President Bill Clinton. Former President Bill Clinton formally nominated Obama and Joe Biden. (3.2.4) Example 4

As foregoing examples demonstrate, syntactic simplification can be implemented by a set of text-to-text rewriting operations incorporating deletion, insertion, reordering and sentence splitting. The deletion operation removes parts of a sentence that contain only peripheral information in order to make it more succinct (cf. section 4), whereas insertion introduces supplementary information, usually in form of function words to clarify the interrelation between phrases or clauses (see example 3.2.2 and example 3.2.3). The splitting operation, in addition, divides a long sentence into several shorter ones in order to decrease its complexity (see example 3.2.1). Finally, the reordering operation interchanges the order of the split sentences or parts within a sentence in order to preserve text cohesion and discourse structure (see 3.2.4) [85, 91].

19

3. Text Simplification Studies have shown that in this way, reading comprehension can be improved for readers with poor literacy. Particularly people with cognitive impairments arising from aphasia or deafness have trouble comprehending syntactically complex sentences [74]. For example, if confronted with passive constructions like the following: "The boy is pushed by the girl.", aphasics run into difficulties because they randomly assign the thematic roles of agent, theme and goal to the constituents present in the sentence. Accordingly, aforementioned statement might be interpreted as "The boy pushed the girl." by someone with aphasia [37], resulting in a complete loss of meaning. Thus, by simplifying specific linguistic constructs like the one described above, systems that perform a syntactic simplification may assist reader populations with poor reading ability. Commonly, syntactic simplification is carried out in three consecutive steps [69], as illustrated in figure 3.3. First, the text is analyzed in order to identify its structure using flat representations (e. g. chunks or part-of-speech (POS) tags) [17, 70], as well as hierarchical ones such as constituency- [14, 44, 7] or dependency-based parse trees [11, 72, 73]. In the transformation stage, these representations are then modified according to a set of rewriting rules which specify the desired syntactic simplification operations (like sentence splitting, rearrangement or deletion of clauses). There are two notably different approaches of how to define these transformation rules: a lot of simplification systems use hand-crafted rewrite rules, while there are also techniques for automatically inducing them (for more details, see section 3.2.1 and section 3.2.2). Finally, a regeneration phase is executed, during which further adaptations may be made to the already remodelled text in order to improve its cohesion, relevance and readability.

Figure 3.3.: Syntactic simplification pipeline analysis stage: input Obama’s mother, Stanley Ann Dunham, was born in Kansas. output

was born in Kansas Obama’s mother

Stanley Ann Dunham (3.3.1) Phase 1

20


transformation stage: input

was born in Kansas Obama’s mother

Stanley Ann Dunham

output

Obama’s mother

was born in Kansas

Stanley Ann Dunham (3.3.2) Phase 2

21

is Obama’s mother


regeneration stage: input

Obama’s mother

was born in Kansas

Stanley Ann Dunham

is Obama’s mother

output • Obama’s mother was born in Kansas. • Stanley Ann Dunham is Obama’s mother. (3.3.3) Phase 3

3.2. Previous Work The first attempts towards an automatic simplification of text were made in the area of controlled language generation, primarily due to the interest from industries in creating better user manuals, i. e. ones that are less ambiguous and therefore easier to translate by machine translation software [74]. An example is IBM’s EasyEnglish authoring tool [8], which is deployed in the context of a preprocessing step for machine-translating IBM manuals. It aims at detecting ambiguity and complexity as well as ungrammatical structures, making suggestions for rephrasing those parts of a sentence which are affected. That way, the translatability of the underlying document is improved [74]. For example, given the following sentence: "Different system users may operate on different objects using the same application program.", EasyEnglish indicates that there is an ambiguous attachment of the verb phrase (VP) using the same application program and suggests two unambiguous alternatives for the author to choose [8]: 1. Different system users may operate on different objects by using the same application program. 2. Different system users may operate on different objects that use the same application program. Besides, in the early eighties, Boeing has developed a grammar and style checker for assisting authors in the process of writing aircraft manuals [69]. It is based on a set of lexical and syntactic language rules restricting grammar, style and vocabulary

22

3. Text Simplification to a subset of the English language and thus standardising writing. The objective of this so-called Simplified Technical English7 is a clear, unambiguous writing for improving the readability for users of the maintenance documentation and therefore reducing the potential for misunderstanding. Though, such systems do not revise or generate text automatically. Hence, they are considered controlled language generation aids rather than text simplification systems [74]. However, building on the basic principles of those approaches, they are regarded as the point of origin for research on the topic of text simplification in computational linguistics [74], which has started in the second half of the nineties. To date, there are two markedly different approaches for text simplification - those that use a set of hand-crafted transformation rules, and those that are based upon machine learning techniques. The first group of work (see section 3.2.1) applies a number of hand-written grammar rules to perform syntactic simplification operations, e. g. splitting conjoined clauses or separating out appositive phrases and relative clauses. In doing so, early text simplification systems employ shallow processing techniques (chunking and POS tagging) [17, 70, 71], while more recent approaches draw on hierarchical text representations, using constituency-based parses [44, 7, 15] or dependency parses [11, 72, 73]. On the contrary, the second bundle (see section 3.2.2) comprises of data-driven approaches, often referred to as "monolingual translation" [75] due to the fact that sentence simplification can be viewed as an English-to-English translation problem. Here, the model learns simplification rewrites automatically from examples of aligned complex source and simplified target sentences. Therefore, parallel corpora of simplified texts such as Simple English Wikipedia (SEW) are required [91, 22]. Within the class of machine translation approaches, two types of systems can be distinguished: some are predicated on Statistical Machine Translation (SMT) [91, 86, 21], while the others make use of quasi-synchronous grammars for the translation process [20, 85, 77, 26]. Recent approaches in the field of text simplification deal with hybrid systems that combine various techniques for handling different subtasks of the transformation procedure [75, 61].

3.2.1. Hand-crafted Rules Research on automatic text simplification can be traced back to two groups whose work in the late nineties lay the groundwork for future improvements in this area [71]: for one thing Chandrasekar and his colleagues at UPenn [17], for another thing the group around Carroll working on the Practical Simplification of English Text (PSET) project [15]. Though they had rather different motivations - former examined text simplification as a preprocessing step to improve parser performance, while latter focused on simplifying newspaper articles to enhance their readability for people with aphasia [71] -, they followed a similar approach, namely the use of hand-crafted transformation rules for syntactically simplifying NL text. 7

http://www.asd-ste100.org

23

3. Text Simplification 3.2.1.1. Early Work According to Chandrasekar et al. [17], long and complicated sentences represent a stumbling block for systems operating on NL data. Thus, arguing that structurally simplified sentences become easier to process for programs as there is little scope left for ambiguities in the attachment of constituents, they present a method for identifying components of a sentence that may be disembedded, and transforming them into stand-alone sentences. This process is conducted in two consecutive steps. First, a structural representation of the input sentence has to be generated, upon which a sequence of rules to identify and extract certain predetermined components is then applied. Those rules have been manually defined on the basis of so-called articulation points, i. e. points at which sentences may be logically split. These include beginning and ends of phrases, punctuation marks, subordinating and coordinating conjunctions as well as relative pronouns. Hence initially, given a sentence, each word is assigned its corresponding POS tag. Then, chunks - groups of words consisting of a VP or noun phrase (NP) with some attached modifiers - are identified using a finite state grammar (FSG). Afterwards, these chunked sentences are simplified by repeatedly applying the previously defined syntactic simplification rules. An example rule for the simplification of a sentence incorporating a relative clause is shown below: X_N P , rel_pronoun Y , Z

→ X_N P Z . X_N P Y .

(3.1)

This rule may be interpreted as follows: If a sentence starts with a noun phrase (X_N P ), followed by a relative pronoun and an arbitrary sequence of words Y enclosed in commas which is succeeded by some words Z, then the sentence is split into two separate ones by transforming the embedded clause Y into a free-standing sentence consisting of the phrase X followed by Y , resulting in a shortened version of the original sentence, namely X followed by Z. By means of this rule, the simplification depicted in figure 3.4 may be performed. separating out an embedded clause: John, who was the CEO of a company, played golf. John played golf. John was the CEO of a company. Figure 3.4.: Simplification of a relative clause [71] The thus derived sentences are then recursively simplified until no further transformation is possible. Siddharthan [71] criticizes that such linear pattern-matching rules do not work very well in general. Regarding for example the following sentence:

A friend from London, who was the CEO of a company, played golf, usually on Sundays. a decision on both the attachment and the boundaries of the relative clause has to be made, i. e. it has to be determined whether the phrase friend or rather London

24

3. Text Simplification is its antecedent, and whether it ends at the term golf or company. Due to such difficulties, Chandrasekar et al. [17] have proposed a second approach on top of that, using richer syntactic information in terms of both constituency- and dependencybased parse trees to operate on, while still defining by hand a set of rules that map from complex input sentence patterns to simplified ones. Rather than focusing on systems dealing with NL data, the PSET project [15] was aimed at language-impaired readers of English newspaper text, in particular people suffering from aphasia. It has been demonstrated that structural complexities most notably syntactic constructions that deviate from canonical subject-verb-object order, such as sentences in passive voice - may pose problems in comprehensibility for these groups [15]. Therefore, emphasis has been laid on simplifying two syntactic constructs, namely the conversion from passive to active voice on the one hand and the splitting of coordinated clauses on the other hand [71]. In doing so, they roughly refer to [17], making use of unification-based pattern matching of handwritten simplification rules over phrase structure trees that have been generated with the help of a probabilistic LR parser. An exemplary syntactic transformation rule is [71]: (S (?a) (S (?b) (S (?c)) ) ) → (?a) (?c)

(3.2)

which matches structures like the one shown in figure 3.5. It simply eliminates the conjunction (?b) that concatenates the two sentences (?a) and (?c), thus transforming them into separate ones. This rule may be used for performing the simplification illustrated below in figure 3.6 [71]. S S

text (?a)

conj

S

(?b)

text (?c)

Figure 3.5.: The structure matched by the pattern (S (?a) (S (?b) (S (?c)) ) )

Besides this syntactic transformation component, PSET incorporates a second module for lexically simplifying input text, since uncommon words are often not readily available for aphasics [15]. Based on the approach proposed in [24], difficult words are replaced with simpler synonyms, using first WordNet [59] to identify terms that have an equivalent meaning and then word frequency statistics from the Oxford Psycholinguistic Database [65] to determine their relative difficulty [15]. Beyond that, their application of a pronoun resolution algorithm (based on CogNIAC [4]) which substitutes pronouns with their antecedent NP - thus helping aphasic readers to correctly resolve them - has paved the way for studying issues of anaphoric and conjunctive text cohesion in more depth in later work (cf. in particular [71]).

25

3. Text Simplification splitting of coordinated clauses: The proceedings are unfair and any punishment from the guild would be unjustified. The proceedings are unfair. Any punishment from the guild would be unjustified. Figure 3.6.: Simplification of coordinated clauses [71] 3.2.1.2. Contemporary Systems Foregoing attempts at automatic text simplification have been substantially refined by Siddharthan’s doctoral work [70, 71], which describes a pipelined architecture for a simplification framework composed of three modules: analysis, transformation and regeneration (see figure 3.7).

Figure 3.7.: Architecture of Siddharthan’s text simplification system [70] Following Chandrasekar et al. [17], given a text as input, it is first passed through an analysis stage that converts it into a representation that the subsequently invoked transformation and regeneration components can work with. Thus, as the simplification rules that are deployed within the transformation process operate on the level of individual sentences, the analyzer first needs to segment the input text into separate sentences. Thereupon, syntactic structures that may be simplified have to be marked up. In this regard, three subtasks have to be solved: • resolving third person pronouns • deciding clause attachment • determining clause boundaries These problems are tackled using shallow processing techniques in terms of POS tagging and noun chunking, which are favoured over deeper analyses like full parses, since latter are supposed to be less robust and computationally more expensive than shallower analyses. Syntactic transformations carried out at sentence-level may potentially disrupt anaphoric-cohesive relations by breaking pronominal links (see the example in figure 3.8, where the pronoun he becomes difficult to resolve correctly in the simplified version, since the order in which the NPs are introduced into the discourse is changed due to splitting the source sentence). In order to fix such broken connections in the discourse structure, a pronoun resolution algorithm is applied whose goal is to

26

3. Text Simplification identify the respective antecedent of each third person pronoun in a sentence. This is done on the basis of three features: • agreement in number (singular or plural), person (first, second or third), gender (male, female or neuter) and animacy (animate or inanimate) • grammatical function (e. g. subject, direct or indirect object) • salience (taking into account a variety of factors such as sentence recency, subject emphasis or head noun emphasis; see [54] for more details) Thus, while processing a sentence, each NP is annotated with information about aforementioned aspects, thereby forming a co-reference class for each non-pronominal NP. Pronouns are then assigned to the most salient class that agrees in number, person, animacy and gender.

syntactic transformation leading to a disrupted anaphoric cohesive-relation: Dr. Knudson found that some children with the eye cancer had inherited a damaged copy of chromosome No. 13 from a parent, who had necessarily had the disease. Under a microscope he could actually see that a bit of chromosome 13 was missing. Dr. Knudson found that some children with the eye cancer had inherited a damaged copy of chromosome No. 13 from a parent. This parent had necessarily had the disease. Under a microscope he could actually see that a bit of chromosome 13 was missing. Figure 3.8.: Example for a sentence split resulting in a broken pronominal link The second issue that is addressed by the analyzer is the identification of the phrase to which a clause is attached. Since this question may be treated as an anaphora resolution problem as well, it is solved in a similar way as the pronoun resolution task described above. Consequently, based on salience, agreement and a syntactic filter, the head NP of a relative clause or appositive phrase, respectively, is allocated to the most salient NP that matches in number, animacy and gender, considering that the phrase to which an apposition or relative clause refers is usually only separated from it by a PP. As was pointed out above, the last subtask of the analysis module consists in detecting clause boundaries. On the supposition that a relative clause or appositive phrase customarily either extends to the end of a sentence or terminates with a comma, the decision is primarily made on the basis of punctuation, augmented with some heuristics that have been induced from examining hundreds of sample sentences by hand. After having the input text preprocessed in this manner, the second component of the text simplification framework, the transformation module, comes into play. It takes as input the representation produced by the analyzer and applies a set of hand-crafted grammar rules for simplifying the input sentence, using pattern matching techniques. For this purpose, seven transformation rules have been specified by hand, referring to the following grammatical structures: conjoined clauses (see rules

27

3. Text Simplification 3.3, 3.4, 3.5), relative clauses (see rules 3.6 and 3.7) and appositive phrases (see rules 3.8 and 3.9). Figure 3.10 shows some example sentences upon which those rules have been deployed. They are applied recursively on a given sentence, until no further simplification is possible.

Figure 3.10.: Examples of sentence simplifications [71] relative clause: "The pace of life was slower in those days," says [51-year-old Cathy Tinsall]1 from South London, [RCn who#1 had five children, three of them boys]. 1. "The pace of life was slower in those days," says 51-year-old Cathy Tinsall from South London. 2. Cathy Tinsall had five children, three of them boys. (3.10.1) Example 1

appositive phrase: "There’s no question that some of those workers and managers contracted asbestos-related diseases," said Darrell Phillips, vice president of human resources for Hollingsworth & Vose. • "There’s no question that some of those workers and managers contracted asbestos-related diseases," said Darrell Phillips. • Darrell Phillips was vice president of human resources for Hollingsworth & Vose. (3.10.2) Example 2

Appending a third stage - the regeneration phase which addresses the issue of text cohesion - can be seen as the main contribution of Siddharthan’s work, who in fact was the first to study this question in detail. Cohesion has been defined by Halliday and Hasan [40] as the phenomenon where the interpretation of some element of discourse depends on the interpretation of another element and the presupposing element cannot be effectively decoded without recourse to the presupposed element. Such cohesive relations may be either conjunctive or anaphoric and may both be disrupted by syntactic transformations at the level of individual sentences. Thus without resolving these discourse-level issues, reformulation operations could possibly result in a text that is harder to read - in the best case -, or even alter its

28


• prefix subordination: CCn [clausen 1 X] , [clausen 2 Y ] . →

(a) X . (b) Y .

(3.3)

where the conjunction CCn matches one of though, although, when and because • correlative if . . . then and subordinative if construct: [m If] [clausem 1 X] [then|, ] [clausem 2 Y ] . →

(a) X . (b) Y .

(3.4)

• infix coordination and subordination: [clausen 1 X] [, ]? [n CC] [clausen 2 Y ] . →

(a) X . (b) Y .

(3.5)

where the conjunction CC matches one of though, although, but, and, because, since, as, before, after and when • relative clauses: (a) V W X Z . (b) W Y .

(3.6)

(a) V W X . (b) W Y .

(3.7)

U VNxP W , [appos_n X #x Y ] , Z . →

(a) U V W Z . (b) V Aux X Y .

(3.8)

U VNxP W , [appos_n X #x Y ] . →

(a) U V W . (b) V Aux X Y .

(3.9)

V WNx P X [RCn RELP R#x Y ] Z . →

V WNx P X [RCn RELP R#x Y ] . → • appositive phrases:

Figure 3.9.: Simplification rules [71]

29

3. Text Simplification original meaning. Therefore, Siddharthan has developed a number of strategies targeting the problem of preserving both cohesion and meaning of the input text. These encompass • introducing cue-words (like but, or, so) to preserve rhetorical relations between neighbouring sentences • selecting determiners (definite versus indefinite) when rules duplicate NPs • deciding sentence order when clauses are split • generating referring expressions when a rule duplicates a NP, as reproducing the whole NP can make a text stilted (stylistic issue) • preserving pronominal links The first two approaches address the problem of preserving a conjunctive cohesion, whereas the last two concern anaphoric cohesion. Sentence ordering, in contrast, may affect both types of connectedness (see figure 3.11). Siddharthan [71] proposes the use of a framework of Rhetorical Structure Theory (RST) (cf. section 10.1) to maintain conjunctive cohesive-relations, while anaphoric cohesion is addressed through a model of attentional state on the basis of either salience (cf. [54]) or centering (cf. [38, 39]).

Figure 3.11.: Regeneration issues and text cohesion [71]

In more recent years, Siddharthan has experimented with alternative structural representations of the input text that is to be simplified. Arguing that phrasal parse trees are too dependent on the grammar rules employed by the parser that is used in the respective case and therefore require the definition of complicated rules which may incorporate a large number of variables, he has laid the focus on typed dependencies in the end [72, 73]. Such a representation consists of a set of triplets comprising a relation-type and two arguments, thus illustrating dependencies between words of the input sentence (see figure 3.12).

30

3. Text Simplification det(cat-2, The-1) nsubjpass(chased-4, cat-2) auxpass(chased-4, was-3) det(dog-7, the-6) agent(chased-4, dog-7) punct(chased-4, .-8) Figure 3.12.: Typed dependency representation of the sentence "The cat was chased by the dog."

As opposed to phrase structure trees, this formalism allows for the specification of rewrite rules in rather compact form. For instance, to convert the sentence: The cat was chased by the dog. into active voice, the following transformation rule is required [73]: • Match and Delete: – nsubjpass(??X0, ??X1) – auxpass(??X0, ??X2) – agent(??X0, ??X3) • Insert: – nsubj(??X0, ??X3)

Figure 3.13.: Source dependency tree [73]

– dobj(??X0, ??X1)

When applying these operations to the dependencies depicted in figure 3.12, the dependency tree shown in figure 3.14 is created.

Figure 3.14.: Target dependency tree [73] While generating sentences from transformed phrasal structure trees is a rather straightforward task, the order in which to traverse the target tree for creating the simplified output sentence when using typed dependencies has to be explicitly defined for those nodes where a rewrite rule leads to reordered subtrees. Hence, in those cases, rules for tree traversal order have to be specified in addition to deletion and insertion operations - otherwise the original word order can be used.

31

3. Text Simplification Besides, there are two node-level operations which have to be considered, namely lexical substitution for ensuring in particular number and tense agreement, and node deletion which removes a node from the tree and attaches any subtree to the parent node instead. Accordingly, a transfer rule has to specify five lists: • Context: transform only proceeds if this list of GRs can be unified with the input GRs (often equivalent to the Delete list) • Delete: list of GRs to delete from the input • Insert: list of GRs to insert into the input • Ordering: list of nodes with the subtree order specified • Node-operations: list of lexical substitutions and deletion operations on nodes Thus, typed dependency representations allow to capture a reformulation by defining only a fairly small set of transformation rules at the expense of a more complicated sentence generation procedure. Recently, a number of text simplification frameworks that are adapted to languages other than English have been developed. Most of these approaches rely on handcrafted transformation rules for simplifying the input NL text (see section 3.2.5).

3.2.2. Data-driven Approaches In the more recent past, new avenues have been explored in the development of automatic text simplification systems. Motivated by the availability of corpora of simplified texts (e. g. SEW8 ), data-driven approaches based upon machine learning techniques have been established. In this connection, text simplification is regarded as a monolingual translation task where the source sentences need to be translated into simplified versions of the same language [74, 75]. For this to be achieved, a model must be trained on a set of examples of aligned complex source and simplified target sentences, thus automatically learning simplification rewrites. Therefore, parallel corpora of English sentences and their simplified counterparts are required.

3.2.2.1. Training Corpora For the most part, contemporary data-driven systems have been trained (as well as evaluated) on datasets relying on SEW, which is a Wikipedia project consisting of articles that represent a simplified version of the corresponding traditional English Wikipedia (EW)9 entries. They are written in Simple English, a variant of the English language that uses easier vocabulary and sentence structure, according to the following general guidelines [21]: • Use Basic English vocabulary and shorter sentences. This allows people to understand normally complex terms or phrases. 8 9

http://simple.wikipedia.org http://en.wikipedia.org

32

3. Text Simplification • Simple does not mean short. Writing in Simple English means that simple words are used. It does not mean readers want basic information. Articles do not have to be short to be simple; expand articles, add details, but use basic vocabulary. That way, the information contained in a Wikipedia article shall be easier to comprehend for readers, especially for children and non-native speakers [10]. By pairing EW sentences with their SEW equivalents based on sentence-level TF*IDF similarity scores, Zhu et al. [91] and Coster and Kauchak [22] have created two parallel simplification corpora, covering rewording, dropping, insertion, reordering and splitting (see examples in figure 3.15), thus the full range of text simplification operations. The former, the so-called PWKP dataset [91], contains 108K aligned sentences10 , while the one described in [22] is made up of 137K pairs11 . Up to the present, both of them have been widely used for training and evaluating sentence simplification models (e. g. in [21, 91, 85, 86]), as will be seen below.

Figure 3.15.: Examples of aligned sentence pairs extracted from EW and SEW [21, 22] reordering and rewording: EW In 1962, Steinbeck received the Nobel Prize for Literature. SEW Steinbeck won the Nobel Prize in Literature in 1962. (3.15.1) Example 1

rewording: EW Greene agreed that she could earn more by breaking away from 20th Century Fox. SEW Greene agreed that she could earn more by leaving 20th Century Fox. (3.15.2) Example 2

10

publicly available at: https://www.ukp.tu-darmstadt.de/data/sentence-simplification/ simple-complex-sentence-pairs 11 publicly available at: http://www.cs.pomona.edu/~dkauchak/simplification/

33


deletion and rewording: EW Alfonso Perez Munoz, usually referred to as Alfonso, is a former Spanish footballer, in the strike position. SEW Alfonso Perez is a former Spanish football player. (3.15.3) Example 3

sentence splitting and insertion: EW Later he learned French, Latin, Greek, Hebrew, and English, and had an interest in Italian, Spanish and Lithuanian. SEW Later he learned French, Latin, Greek, Hebrew and English. He also had an interest in Italian, Spanish and Lithuanian. (3.15.4) Example 4

insertion: EW Heat engines are often confused with the cycles they attempt to mimic. SEW Real heat engines are often confused with the ideal engines or cycles they attempt to mimic. (3.15.5) Example 5

3.2.2.2. Contemporary Systems The set of machine translation approaches operating on aligned complex-simple corpora can be subdivided into two classes that pursue different strategies: some are predicated on SMT [91, 21, 86], while the others make use of quasi-synchronous grammars for the translation process [26, 20, 85]. In SMT, a model is trained on the basis of large parallel corpora which are aligned at sentence level to find the best translation e˜ of a text in language f to a text in language e by combining a translation model that finds the most likely translation p(f |e) with a language model that outputs the most likely sentence p(e) [86]: e˜ = arg maxe∈e∗ p(f |e) · p(e)

34

(3.10)

3. Text Simplification Following this approach, Zhu et al. [91] present a method for a probabilistic syntaxbased SMT sentence simplification which is able to perform four syntactic and lexical rewrite operations on the syntax parse trees of the input sentences, namely reordering, splitting, deletion and substitution. Their system consists of three components: a language model p(s), a translation model p(s|c) and a decoder. While the language model guarantees that the output of the simplification process is grammatical, the translation model encodes the probabilities of the aforementioned simplification operations which are estimated by running an expectation maximization algorithm that maximizes p(s|c) over the training corpus, a subset of the PWKP dataset. Hence, learning the model is a supervised process with the parse tree of the complex sentence c as input and that of its simple counterpart s as the desired output. Finally, based on the previously generated simplification model, the decoder translates sentences into simpler alternatives by greedily selecting the branch in the source tree with the highest probability and searching for the simplification s that maximizes p(s) · p(s|c). Subsequently developed approaches frequently use an extension of SMT: Phrase Based Machine Translation (PBMT), which is a form of SMT where the translation model aims at translating contiguous sequences of words, i. e. phrases, in one go, thus solving part of the word ordering problem along the way that would be left to the target language model in a traditional word-based SMT system [86]. In general, PBMT is a two-stage approach [74]. The system takes as input a pair of source and target sentences which are aligned at the level of words in the first instance. Then, this lining up of single words is extended to create phrase alignments. That way, a phrase table containing aligned sequences of cohesive words in source and target language is built, including a probability value which signifies the likelihood of the particular phrase translation. This table represents the output of the first stage. In the second phase then, the decoding process is performed. Here, the phrase table and a language model of the target language are used to find the best translation of a given source sentence. Hence, simplifying phrases through paraphrasing is implicitly learned in PBMT, so that no syntactic information needs to be taken into account [86], as was the case in Zhu et al.’s SMT system [91]. An example of a text simplification system relying on PBMT is described in [21] where a PBMT model is extended to include phrasal deletion. Their simplification model is trained on the Wikipedia-based corpus outlined in [22], using a modified version of Moses [52], an open source toolkit for SMT. Its key component is a PBMT model which decomposes the probability of a complex sentence c being transformed to a simple sentence s as the product of the individual phrase translations: p(s|c) =

n Y

p(si |ci )

(3.11)

i=1

where each si is a phrase in the simple sentence (ci is similarly defined over the complex sentence). To model deletion, Coster and Kauchak [21] relax the constraint that the simple phrase must be non-empty by allowing deletion rules of the form p(N U LL|ci ) in the translation model. The phrasal probabilities p(si |ci ) denoting the likelihood that a complex phrase ci is simplified to the corresponding phrase si are calculated using an expectation maximization learned word alignment with the help of GIZA++ [62]. Instead of focusing on the deletion operation, Wubben et al. [86] draws on dissimilarity as the major factor in the simplification process. Here again, the Moses toolkit

35

3. Text Simplification [52] is used to train a PBMT model on the PWKP dataset [91]. The decoding stage is then augmented with a re-ranking heuristic based on dissimilarity with regard to the input. The underlying idea is that as a result a fluent sentence that adequately preserves the meaning of the input, yet that differs from it, shall be output, as such distinctions from the original are supposed to operationalize the desired simplification. Therefore, the goal is to select the output that is as different as possible from the source sentence so that it ideally contains multiple simplifications. This re-ranking is carried out by calculating the Levenshtein Distance at word level on the top n candidate translations provided by Moses, with a small n to ensure that the output sentence is of high quality in terms of grammaticality and adequacy. However, it should be noted that PBMT can only perform a small set of simplification operations, such as lexical substitution, phrasal deletion and simple paraphrasing. It is not adapted for executing reordering or splitting operations, though [74]. Machine translation systems on the basis of synchronous or quasi-synchronous grammars, on the contrary, are able to handle the full range of reformulation operations. Those grammar-driven approaches are traceable to Dras [26], who refers to text simplification as an application for examining the formal properties of generalized synchronous grammars. The main idea is to use such grammars to create monolingual paraphrases. To this end, each sentence is represented in the form of a tree adjoining grammar (TAG) and a Synchronous-TAG formalism is applied to map between two TAGs. In a second step, integer programming is employed in order to generate a text that satisfies externally specified constraints (e. g. regarding its length, readability or in-house style guides) through minimal purely syntactic paraphrasing. For this purpose, a binary variable is inserted for each paraphrase operation, indicating whether or not the operation is executed. The system’s objective function is then to be optimized for the whole input text, not only for a single sentence in isolation [74], as is common in today’s text simplification approaches. Referring to Dras [26], Cohn and Lapata [20] and Woodsend and Lapata [85] present a tree-to-tree transduction method for sentence simplification based on a quasisynchronous tree substitution grammar which allows for a rather loose alignment between source and target trees. Thus, by relaxing the isomorphism constraints of synchronous grammars, they are able to handle structural mismatches and complex rewriting operations. They both operate on the level of individual sentences that are annotated with structural information in the form of constituency-based parse trees. Given a source tree as input, the simplification is carried out in a two-stage process as follows: First, a simplification model on the basis of quasi-synchronous grammar rules is derived, defining the space of valid source and target tree pairs. In a second decoding step, the generated model is then used to find the best target tree that is permitted by the grammar.

36


Figure 3.16.: Examples of source and target parse trees [20]

In [20], the grammar rules forming the respective simplification model are extracted from a corpus of parallel complex-simple sentences which are aligned at the level of words. Initially, this word alignment is expanded to a lining up of constituents by finding pairs of n-grams where at least one word in an n-gram is aligned to a word in the other, but no word in either n-gram is aligned to a word outside its alleged counterpart (see figure 3.16).

Figure 3.17.: Examples of translation rules [20]

Aligned subtree pairs are then generalized by replacing child subtrees with variable nodes (see figure 3.17) in order to allow the reformulation of previously unseen structures. From the thus created pairs of tree fragments, grammar rules are induced, representing the dependencies between source and target subtrees (see figure 3.18).

37


Figure 3.18.: Grammar rules extracted from the wentence pair in figure 3.16. Each rule rewrites a pair of non-terminals into a pair of subtrees, shown in bracketed notation [20].

However, given a source tree, the space of possible target trees licensed by the grammar is usually large with many of them providing neither grammatical nor considerably simplified results. Hence, besides extracting grammar rules as described before, the goal of the training algorithm is to find weights such that the reference target trees are assigned high scores, while the many other target trees that are in theory permissible according to the grammar are given lower scores. Therefore, each rule is allocated a score that is calculated on the basis of the large margin algorithm proposed in [81] which supports a wide range of loss functions ∆(yi , y) (e. g. using the F1 measure, Hamming distance or edit distance between the model’s prediction and the reference) that quantifies the accuracy of the prediction y with respect to the true output value yi . In the end, decoding amounts to finding the maximizing derivation (i. e. resulting in a high score) over the space of possible sequences of rules that yield a target tree with no remaining variables for the given source tree. This problem is solved using a chart-based dynamic program extending the inference algorithm for weighted STSGs presented in [28]. The text simplification approach described in [85] bears resemblance to the one outlined above regarding the model generation procedure, yet shows some considerable differences with respect to the decoding stage. As indicated, the process of constructing a quasi-synchronous grammar for generating possible target translation trees for an input parse tree is similar. Taking a pair of trees as input, their leaf nodes are aligned based on lexical identity. Where more than one child node matches, direct parent nodes are aligned. Grammar rules are then created from aligned inner nodes if all the nodes in the target tree can be attributed to nodes in the source. Each grammar rule thus describes the transformations which are required to convert a source subtree into a target subtree. But instead of assigning weights to the rules using a large margin algorithm, integer linear programming (ILP) is employed in the decoding stage to select the most appropriate simplification from the space of possible rewrites licensed by the grammar, as this technique allows to efficiently search through a large space of rules, while at the same time incorporating constraints relating to e. g. grammaticality of the output - without the added computational cost of integrating a language model. Hence, given a source sentence parse tree,

38

3. Text Simplification all applicable grammar rules are identified through ILP. If there is more than one matching rule, every alternative is generated and incorporated into the target tree. Next, the ILP model operates over this tree and selects the nodes which maximize the output of the following objective function: maxx

X

gi xi + hw + hsy

(3.12)

i∈P

where gi is a rewrite penalty, whereas hw and hsy represent global variables guiding the ILP towards simpler language, based on the Flesch-Kincaid Grade Level (FKGL) index [47], which estimates the readability of a sentence as a combination of the average number of syllables per word and the average number of words per sentence. To sum up, both models deconstruct a given input sentence into component phrases, each of which is simplified through the application of a sequence of quasi-synchronous grammar rules according to the underlying simplification model. Both Cohn and Lapata [20] and Woodsend and Lapata [85] have used a variety of parallel datasets for training their models. Former deployed a number of news corpora, namely the ZiffDavis corpus originating from a collection of news articles on computer products that has been created by matching sentences that occur in an article with those appearing in an abstract, CLwritten which is a corpus sampled from written sources (British National Corpus and American News Text corpus), and finally CLspoken which has been generated from manually broadcast news stories. Woodsend and Lapata [85], on the contrary, have made use of self-generated alignments between EW and SEW following the approach in [22] as well as alignments between edits in the revision history of SEW which are supposed to be less noisy and therefore yielding higher quality grammar rules. The simplification models created in this manner cover a broad range of text simplification operations, both lexical and syntactical, including deletion and reordering of child constituents, insertion of punctuation and function words, as well as substitution of child constituents. The text simplification system presented in [85] even is able to split sentences containing embedded clauses, such as appositives, relative clauses, parenthetical content, coordinate or subordinate clauses; thus it may deal with the full spectrum of standard reformulation operations.

3.2.3. Hybrid Systems Only recently, text simplification systems combining various techniques for handling different subtasks of the simplification process have been developed. That way, the respective advantages of the individual approaches are best exploited. In keeping with this idea, Siddharthan and Mandya [75] describe a simplification framework comprising a small set of hand-crafted grammar rules for the transformation of purely syntactic constructs on the one hand and a much larger set of automatically acquired rules for simplifying lexicalised components on the other hand. The underlying rationale is that former rules are difficult to automatically derive from parallel corpora since complex morphological changes and tense manipulations would have to be learned from specific instances seen in the course of the training procedure, though it is rather easy to manually encode these rules correctly without great effort. Hence, based on the typed dependency parses of aligned SEW and EW sentences, 136 simplification rules have been harvested by hand. They cover the extraction of apposition and relative clauses, the splitting of subordinated

39

3. Text Simplification and coordinated clauses, voice conversion from passive to active, as well as the standardisation of quotations. Those hand-written transformation rules handling the structural aspects of sentences are then complemented with automatically acquired rules for simplifying lexical constructs. Due to their massive number, it is not feasible to specify rules of this kind by hand, while automatically assembling them is straightforward. In fact, they are gathered by comparing the dependency parses of a sentence pair and identifying the differences between them. In doing so, rules involving syntactic structures which require manual definitions (i. e. relative clauses, apposition, coordination and subordination) are filtered out so that only those performing solely lexical simplifications and simple paraphrasing are left. Furthermore, Narayan and Gardent [61] present an unsupervised text simplification framework that is composed of three modules which are inspired by previous work on lexical and syntactic simplification as well as sentence compression, and yet makes use of neither hand-crafted transformation rules nor a training corpus of complex-simple sentence pairs. They rather deploy non-aligned SEW and EW sentences to learn the probabilities for lexical substitutions, splitting points and optional phrases in a sentence. More precisely, to simplify the vocabulary of a given text, the context-aware lexical simplification method proposed in [10] is used, based on a comparison of simple and complex sentences in the PWKP dataset [91], thereby following the intuition that a word from EW can be replaced with one from SEW if they share similar contexts. In a second step, deep semantics rather than phrase structure or dependency trees are applied to determine whether and where to split the input sentence. For this purpose, the probabilities of sequences of thematic role sets which are likely to co-occur in a simplified sentence are learned on the Discourse Representation Structure (DRS) [46] of SEW sentences. Narayan and Gardent [61] argue that the rationale underlying this approach is that such "semantic representations give a clear handle on events, on their associated role sets and on shared elements thereby facilitating both the identification of possible splitting points and the reconstruction of shared elements in the sentences resulting from a split." Hence, every event variable contained in the DRS of the source sentence is considered a possible splitting point. Consequently, for each split possibility between subsequences of events, a score is calculated on the basis of a function which favors not only splits involving common semantic patterns (in terms of sequences of thematic role sets), but also sub-sentences of approximately equal lengths. In the end, those split(s) which result in the highest score are chosen for decomposing the input sentence. Finally, following Filippova and Strube [30], phrasal deletion is handled in the form of an optimization problem that is solved through ILP. Once again operating on the DRS of the source sentence, the deletion module determines for each relation of the DRS graph whether or not the relation and its associated subgraphs are to be deleted. This is done by maximizing an objective function that favors both the use of simple words (i. e. those occurring frequently in SEW) over complex terms and obligatory dependencies over optional ones. The procedure described above is illustrated by means of an example sentence in figure 3.19.

40


Figure 3.19.: Successive simplification steps [61]

41


3.2.4. Text Simplification as a Preliminary Step in Assisting other NLP Applications Occasionally, text simplification systems have been explicitly developed as a preprocessing tool for assisting a predetermined subsequently applied NLP application, including for instance RE [60], sentence fusion [31] or SRL [82]. By removing both syntactic and lexical complexities - such as deeply nested structures, a high sentence length or the use of technical terms (in particular, in the medical or biomedical domain) -, related pieces of information are not only brought closer together, but the thus modified constructs often also exhibit a certain predefined structure. That way, the performance (in terms of accuracy and coverage) of various informationseeking applications may be improved [48], as has been shown within the scope of the approaches depicted below in table 3.1. Year Beneficiary NLP Application 2004 informationseeking applications in general (notably IR, IE, Question Answering (QA) and text summarization)

Title

Simplification Approach

Text Simplification for InformationSeeking Applications (Klebanov et al., 2004) [48]

description of an algorithm to automatically construct a so-called Easy Access Sentence (EAS) for each verb of an input sentence by collecting its dependants using dependency tree structures (requirements for an EAS: • containing exactly one finite verb • maintaining grammaticality • preserving the original meaning • including as many named entities as possible)

42

3. Text Simplification 2008 SRL: task of labeling the semantic arguments, or roles, of a verb contained in a sentence

Sentence Simplification for Semantic Role Labeling (Vickrey and Koller, 2008) [82]

2008 sentence fusion: task of creating a new sentence from a group of related sentences

Sentence Fusion via Dependency Graph Compression (Filippova and Strube, 2008) [31]

2010 biomedical IE: task of turning the unstructured information embedded in biomedical texts into structured data [45]

Sentence Simplification Aids ProteinProtein Interaction Extraction (Jonnalagadda and Gonzalez, 2010) [43]

43

description of an approach that for each verb in the input sentence produces a simple sentence in a particular canonical form relative to that verb by taking its phrase structure tree and applying hand-crafted transformation rules for syntactical simplification, combined with a machine learning approach for determining which rules to prefer in case of ambiguity, i. e. if there is more than one way to simplify a sentence under the given rule set description of an ILP-based method that performs a semantically and syntactically informed phrase aggregation and pruning by extracting a new dependency tree from a graph of aligned trees using syntactic importance and word informativeness scores description of an approach that uses hand-crafted transformation rules operating on constituency-based parse trees to split sentences (inspired by [71]), along with a method for replacing NPs without gene names with meaningful single-word place holders as well as a technique for automatically determining the grammatical correctness of the input sentence

3. Text Simplification 2010 RE: task of finding a relevant semantic relation between two given target entities in a sentence

Entity-Focused Sentence Simplification for Relation Extraction (Miwa et al., 2010) [60]

syntactic simplification of sentences by repeatedly applying hand-written transformation rules on the output of a deep parser to remove noisy information, i. e. information that is deemed unnecessary for RE (in this connection, discrimination between two categories of rules: • clause-selection rules: replacement of a sentence with a simpler sentence, still including the two target entities • entity-phrase rules: substitution of a phrase representing one target entity with a simple mention of that entity)

2010 Factual QG: task of automatically generating factual questions from linguistically complex sentences

Extracting Simplified Statements for Factual Question Generation (Heilman and Smith, 2010) [41]

description of a method for extracting multiple simple factual statements from a complex sentence by separating out specific adjunct modifiers and discourse connectives (e. g. non-restrictive appositives, non-restrictive relative clauses or participial modifiers of NPs) as well as by splitting conjunctions of clauses and VPs (using the linguistic phenomena of semantic entailment and presupposition as underlying rationale)12

Table 3.1.: Research on text simplification as a preprocessing tool for assisting subsequently applied NLP applications

3.2.5. Text Simplification Systems for Languages Other than English Previously presented text simplification systems mainly concentrate on how to simplify texts written in English. However, lately, there have also been efforts to apply 12

source code available at: http://www.cs.cmu.edu/~ark/mheilman/qg-2010-workshop/

44

3. Text Simplification these techniques to other languages. An overview of research papers describing such approaches is given below in table 3.2. Year Language 2004 Dutch

Title Automatic Sentence Simplification for Subtitling in Dutch and English (Daelemans et al.) [23]

2008 Portuguese

Towards Brazilian Portuguese Automatic Text Simplification Systems (Aluísio et al.) [2] Facilita: Reading Assistance for Low-literacy Readers (Watanabe et al.) [84]

2009 Portuguese

2011 Swedish

2012 Spanish

2012 Basque

Automatic Summarization As Means Of Simplifying Texts, An Evaluation For Swedish (Smith and Jönsson) [76] Text Simplification Tools for Spanish (Bott et al.) [11]

Transforming Complex Sentences using Dependency Trees for Automatic Text Simplification in Basque (Aranzabe et al.) [3]

45

Simplification Approach development and comparison of a machine learning approach and a method based on hand-crafted deletion rules for sentence-length reduction with the aim of an automatic generation of TV subtitles for hearing-impaired people study of linguistic phenomena rendering texts complex and manual definition of a set of syntactic operations for simplifying them development of a reading assistance tool for low-literacy users that applies hand-written transformation rules in order to automatically simplify texts by shortening its content and using a less complex linguistic structure development of an extraction based summarizer using a vector space model and a modified version of PageRank

approach for structurally simplifying Spanish texts with the help of hand-crafted rewrite rules using a dependency tree representation algorithm for splitting sentences based on dependency trees and hand-crafted rules

3. Text Simplification 2012 French

Acquisition of Syntactic Simplification Rules for French (Seretan) [68]

2012 Vietnamese

Sentence-Splitting for Vietnamese-English Machine Translation (Hung et al.) [42]

2013 Korean

Enhancing readability of web documents by text augmentation for deaf people (Chung et al.) [18]

2013 Italian

Ernesta: A Sentence Simplification Tool for Children’s Stories in Italian (Barlacchi and Tonelli) [5] Text Modification for Bulgarian Sign Language Users (Lozanova et al.) [56]

2013 Bulgarian

2013 Danish

Simple, readable sub-sentences (Klerke and Søgaard) [50]

2014 French

Syntactic Sentence Simplification for French (Brouwers et al.) [13]

description of the manual acquisition of syntactic simplification rules and its complementation by a semi-automatic detection of structures requiring simplification rule-based technique to split long Vietnamese sentences based on linguistic information with the aim of improving the results of a subsequent machine translation process syntactic simplification approach aimed at deaf readers that first identifies and then relocates both subordinate and embedded clauses, while providing a graphical representation of the relations among them first sentence simplification system for Italian; aimed at improving the comprehension of factual events in stories for children with low reading skills system based on hand-written simplification rules for automatically modifying Bulgarian texts which is intended to facilitate comprehension by deaf people deletion and extraction of clauses based on a loss function for choosing good simplification candidates among randomly sampled sub-sentences of the input sentence manual definition of syntactic simplification rules for French and the establishment of a typology for classifying them (lexicon, discourse, syntax)

Table 3.2.: Research on syntactic simplification dealing with languages other than English

46


3.2.6. Analysis of Strong and Weak Points of Hand-crafted and Data-driven Approaches As foregoing discussion on previous work in the area of text simplification has shown, two approaches expressing quite contrasting characteristics - namely, systems built upon a set of hand-crafted simplification rules on the one hand and those using machine learning techniques on the other hand - have been developed through the years, with both of them being still widely used in the literature. The fact that neither one has superseded the other one yet is likely attributable to the particular benefits and deficiencies inherent in the two approaches, as is illustrated in the table below (see figure 3.3). Approach Hand-crafted Rules

Strong Points

Weak Points

• only alternative for languages without corpora of simplified texts • allowing to encode a wide variety of potentially complex simplification operations

• requiring a complex rule engineering process • limited in scope to syntactic simplification (as it is impossible to enumerate the huge number of lexical simplifications by hand) • restricted to the rewrite operations that have been explicitly defined

47

3. Text Simplification Monolingual Translation

• not requiring a manual effort for the specification of the simplification operations, as they are not explicitly modelled, but rather automatically learned from examples of aligned source-target sentences

• requiring a parallel simplification corpus • experiencing difficulties in handling complex syntactic constructs (e. g. morphological changes, voice conversion, tense manipulations), since such rules are difficult to learn from corpora

• adaptable to any language for which there is a parallel corpus available, as no linguistic knowledge with regard to a particular language is involved • potentially covering a broad linguistic scope (depending on the training corpus) Table 3.3.: Strong and weak points of the two main text simplification approaches

3.3. Evaluating Text Simplification Systems To date, there is no consensus on how text simplification systems should be evaluated, neither with respect to the underlying simplification data resources nor the evaluation methodology used for measuring their performance [74].

3.3.1. Simplification Data Resources Research on automatic text simplification has been dominated by SEW in the last couple of years since the PWKP corpus - which has been compiled by Zhu et al. [91] in 2010 through aligning complex EW and simplified SEW sentences (see section 3.2.2.1) - has become the standard benchmark dataset for training and evaluating simplification systems [87]. Not only Zhu et al. [91], but also many follow-up papers, including [85, 86, 75] and [61], have studied the performance of their systems by comparing them with state-of-the-art approaches on the basis of a sample of 100 sentence pairs from PWKP.

48

3. Text Simplification It is only recently that the use of such SEW-based datasets in the context of assessing the output of automatic text simplification systems has come under fire. In [87], Xu et al. demonstrate by means of a detailed manual and statistical analysis that SEW is inappropriate for the evaluation of simplification approaches, as it is not only prone to sentence alignment errors but also containing many sentences that are only poorly simplified. In fact, by examining 200 randomly sampled sentences from PWKP by hand, they have disclosed that in 50% of the sentence pairs the SEW sentence is either not simplified or not aligned to its dedicated EW source sentence (see figure 3.20).

Figure 3.20.: Example sentence pairs from PWKP [87] Even with regard to the pairs where the SEW sentence indeed represents a "real" simplification of its allocated counterpart, its quality considerably varies; some sentences are simpler by only one word, while the rest of the sentence is still complex. These partial and non-simplifications are ascribed to the particular characteristics of Wikipedia in general, and SEW in particular. First, SEW was created by volunteers without a well-defined objective in mind. Second, as an encyclopedia, Wikipedia is composed of many difficult sentences including a large number of complex technical terms, even in the SEW version. Finally, the SEW articles rarely constitute a complete rewrite of the respective original EW article. Hence, to overcome these issues, the authors propose a new corpus, ’Newsela’13 , which is made up of 1,130 news articles that have been re-written by professional editors in order to meet the readability requirements for children at multiple grade levels. For this purpose, each article has been reformulated four times, thus expressing five different levels of complexity in total (see figure 3.21). Due to a better congruence between simplified and complex articles and the availability of multiple simplified versions for each source sentence, automatically aligning Newsela texts is more straightforward and reliable, thus resulting in a higher accuracy of sentence 13

Newsela corpus can be requested at: https://newsela.com/data/

49

3. Text Simplification alignment.

Figure 3.21.: Example of sentences written at multiple levels of text complexity from the Newsela dataset [87] De facto, based on an extensive statistical analysis, Xu et al. revealed that the PWKP corpus contains significantly longer and thus allegedly more complex words than Newsela, while at the same time the supposedly simplified sentences are more compressed in the latter dataset. Furthermore, the vocabulary size of the simplified sentences is reduced by only 18% as compared to the dedicated complex counterparts in PWKP, whereas Newsela achieves a decrease by 50%. Beyond that, complex syntactic patterns (indicated e. g. by a high proportion of clause-related function words such as which or where and a comparison of syntax patterns that are more strongly associated with complex than simple text using the log-odds-ratio technique) can be found more often in the SEW sentences of PWKP than in the simplified versions of Newsela. To sum up, the simplification of Newsela is more consistent and accurate than SEW. Therefore, Xu et al. recommend to abandon PWKP as the standard benchmark dataset and instead make use of the Newsela corpus for evaluating the performance of automatic text simplification systems.

3.3.2. Evaluation Methods After two decades of research in the area of automatic text simplification, the community still struggles with setting up a standard evaluation procedure. When claiming that "simplicity is intuitively obvious, yet hard to define", Shardlow [69] gets right to the heart of the problem. Numerous widely used measures for judging the difficulty of NL text are predicated on rather shallow textual features, like e. g. average sentence length or number of syllables. However, such metrics may not always serve as an accurate indicator for the complexity of the output. Regarding for instance sentence length, a given source sentence may incorporate many difficult technical terms and anaphora. When simplifying these components by augmenting them with more explicative statements, the newly created version will inevitably be longer, though easier to understand [69]. Moreover, different target audiences require different forms of simplification, as what represents a simplification for one type of user may not be well-suited for another [87]. For example, for deaf or anaphoric people, structurally

50

3. Text Simplification complex sentences pose the most difficulties, while dyslexis encounter considerable problems when reading infrequent or long words. Consequently, former group first and foremost benefits from syntactic simplification, whereas latter one derives more advantage from lexical simplification. Second language learners, on the contrary, profit the most from elaborating the source text with further explanations - without changing the original syntax or lexis beyond that -, since in this manner, an improvement in comprehensibility is obtained without impeding language acquisition by depriving them the natural forms of language [74]. Simplification systems aimed at being used as a preprocessing tool to enhance the performance of subsequently applied NLP applications are to fulfil a range of predominantly structural requirements. With respect to parsing, for example, long sentences are problematic due to their potential high level of ambiguity. Hence, shortening sentences by dropping constituents or splitting clauses before forwarding them to a parser may help to enhance the quality of the output. Truncated sentences have proven useful in the area of machine translation as well, as sentences are easier to translate correctly in an automatic way when operating on smaller units of text [70]. Regardless of those findings, text simplification systems have not only commonly been developed with no particular recipient in mind, but also evaluated without consideration of the designated target reader population. Thus, the most widely recognised evaluation methodology draws on human judges to rate on a 5-point Likert scale [55] the output of a simplification system with respect to the following three criteria [86, 85, 75, 20, 61, 71]: • simplicity: extent to which the output sentence is simpler than the original one • fluency or grammaticality: extent to which the target sentence is proper, grammatical English • adequacy or meaning preservation: extent to which the simplified sentence has the same meaning as the source sentence Customarily, the evaluators are both fluent readers and native English speakers. As a result, this approach is not peculiarly adapted to measure the utility of a system for a specific target audience [87, 74]. Besides, these criteria do not explain which of the three different subtasks that are generally involved in text simplification namely deletion of constituents, sentence splitting and paraphrasing which may encompass reordering, substitution and inserting operations - the system is able to handle properly [87]. Therefore, Xu et al. suggest to take each task into account separately, following not only [71] and [35] who have evaluated sentence splitting on the basis of precision and recall, but also [30] and [67] who have made use of the same metrics for assessing the quality of the deletion operation. In addition, [35] has also resorted to the automatic metrics of precision and recall to judge paraphrasing. Beyond that, authors of text simplification frameworks have taken numerous further attempts at measuring the performance of their systems, with none of them being generally accepted. Though many researchers have adopted machine translation metrics such as Bleu [63] or Nist [25] scores (e. g. [91, 21, 86, 85, 61, 78]), which measure word and word sequence overlap between the output of the system and some manual reference translations, these methods are often criticized by claiming that there are usually many adequate realisations of a sentence and that fluency judgements are more subtle in the monolingual case than in the bilingual machine

51

3. Text Simplification translation task [74]. Moreover, state-of-the-art approaches are frequently compared by using readability scores such as the Flesch Reading Ease Test [33] or n-gram language model perplexity (e. g. in [91, 86, 85, 71]). However, such metrics are often based on surface features like sentence or word length, resulting in the problem described in the beginning of this paragraph. Additionally, they do not estimate how readable, well-written or grammatical a text is, but rather indicate what reading age it is suitable for, based on the assumption that it is well-formed [71]. Therefore, Xu et al. propose to use those metrics only in combination with text quality metrics [87]. As has been demonstrated above, there is no standard evaluation methodology in measuring the performance of text simplification systems yet. Many different approaches have been taken, with all of them showing drawbacks in some form or another.

52


Automatic text summarization represents another text-to-text rewriting task. In fact, it is closely related to the field of text simplification, yet pursuing a slightly different objective, as it aims at producing a shorter version of the original document - that is not necessarily simpler, though - by preserving only the main information contained in the source, while also remaining both grammatical and coherent [51]. Such summaries are commonly bound by a word or sentence limit. Thus, within this margin, the challenge is to capture as much relevant content as possible [64]. Focusing on a scaled down variant of this rather extensive task, Knight and Marcu [51] have solved the simplified problem of sentence compression, an approach adopted by quite a number of follow-up papers in the area of text summarization (e. g. [88, 20, 64, 27, 34]). Instead of abstracting a text on the whole, sentence compression is about generating a grammatical summary of just a single sentence by identifying and removing the redundant and less relevant pieces of information contained in it [51]. Accordingly, by retaining only the most salient parts of the original sentence, a shortened version of the input is created. That way, space is cleared, allowing to include more useful information in length-limited summaries. More formally, sentence compression is defined as follows [51]:

Given an input source sentence of words x = x1 , x2 , . . . , xn , a target compression y is formed by removing any subset of these words.

Hence, the deletion of words and phrases represents the most prevalent operation in sentence compression. Other rewriting operations, such as insertion, reordering or substitution - which are widely used in the area of text simplification - are usually not incorporated. By contrasting the output produced by a text simplification framework with that returned by a system based on sentence compression, the example in figure 4.1 illustrates the differences between these two approaches.

4. Text Summarization example sentence: Google began in January 1996, as a research project by Larry Page, who was soon joined by Sergey Brin, when they were both PhD students at Stanford University in California. • text simplification : Google was started in early 1996 , as a research project by Larry Page , who was soon joined by and Sergey Brin, when they were both two PhD students at Stanford University in California , USA. • text summarization : Google began in January 1996 , as a research project by Larry Page, who was soon joined by Sergey Brin, when they were both PhD students at Stanford University in California. Figure 4.1.: Text simplification versus text summarization

Since the determination and subsequent elimination of less relevant information from an input sentence plays a key role in our text simplification approach, we provide a concise overview of some selected sentence compression techniques - ranging from machine learning approaches to rule-based pruning heuristics - in the remaining two sections of this chapter.

4.1. Statistics-based Sentence Compression One of the first works on sentence compression is described in [51], proposing a probabilistic noisy-channel model that is predicated upon statistical machine translation approaches. This method is based on the assumption that for every long sentence t, there exists a short original sentence s out of which t has been formed by adding some noise in the form of optional phrases. Compression is then viewed as the task of identifying this hypothetical original short string s, which requires the following three problems to be solved [51]: • source model: Every string s must be assigned a probability P (s), denoting the chance that s is generated as an "original short string". • channel model: Every pair of strings < s, t > must be assigned a probability P (t|s), specifying the likelihood of arriving at the long string t when the short string s is expanded. • decoder: Given a long string t, we search for the short string s that maximizes P (s|t), which is equivalent to searching for the string s that maximizes the result of P (s) · P (t|s). Aforementioned probabilities are calculated using a combination of a probabilistic context-free grammar score, computed over the grammar rules producing the tree of the sentence which is currently under consideration, and a word-based bigram

54

4. Text Summarization score, computed over the leaves of this tree. A compression model is then trained on a set of 1035 sentence pairs from the Ziff-Davis corpus, a collection of newspaper articles on computer products that have been aligned with their respective humanwritten abstracts, thus forming a parallel dataset of long source and compressed target sentences. Using the remaining 32 sentence pairs of this corpus for evaluating the performance of the compression system that has been created in this way, it is demonstrated that a compression rate which is on average about a fifth lower than in the human-compressed sentences is achieved. Beyond that, according to human judgements, the sentences returned by the framework not only exhibit a slightly lower grammaticality compared to the human-written target sentences, but it also fails more frequently to select the most important words from the original sentences [51]. Building on the concept of regarding sentence compression as a reverse data transmission through a noisy channel, Zajic et al. [88] present a Hidden Markov Model (HMM)-based statistical noisy-channel approach called HMM Hedge. The underlying idea is to produce an adequate headline for a given document by selecting a subset of words - in order - from this text (with slight morphological variations being allowed), thereby generating multiple candidate sentences that capture relevant, non-redundant information. The goal is to find the most likely headline among the generated set of compressed sentences. Following the approach introduced in [51], HMM Hedge treats the observed data - i. e. the given text - as the result of distorting unobserved data - i. e. the headlines - by transferring them via a noisy channel which adds some extra words and phrases between the original headline terms. This process is represented by a HMM consisting of a start state S, an end state E, an H state for each word in the document, a corresponding G state for each H state, and a Gstart state that emits words which occur before the first headline word in the document. A headline then corresponds to a path through the HMM from a start state S to an end state E. Finally, for selecting the most likely headline out of this set of compressed sentences, the Viterbi algorithm described in [83] is used.

4.2. Rule-based Sentence Compression Aside from the HMM-based statistical noisy-channel approach described above, Zajic et al. [88] outline a second approach to sentence compression which uses a "parseand-trim" method that generates a headline for a given document by condensing its lead sentence according to a linguistically-motivated algorithm operating on syntactic tree structures. By applying 15 hand-crafted syntactic compression rules - in specified order -, grammatical constituents are iteratively removed from the parse tree of the original sentence. That way, less relevant content is successively eliminated from the input, until a certain predefined length threshold is met. This process is carried out in up to three consecutive stages as follows: • stage 1: elimination of low-content units from the parse tree of the input sentence (see table 4.1)

55

4. Text Summarization Step 1 2

3

Task remove temporal expressions select root S node in the parse tree of the sentence that could serve as the root of its compression remove preposed adjuncts

4

remove some determiners

Notes using IdentiFinder [9] A node in a tree is considered a root S node if it is labelled S in the parse tree and has children that are labelled NP and VP, in that order. i. e. constituents that precede the first NP (presumably representing the subject of the sentence) under the root S node i. e. leaf nodes that are assigned the POS tag ’DT’

Table 4.1.: Trimmer algorithm - stage 1

• stage 2: removal of further linguistically peripheral material through successive deletions of constituents (see table 4.2) Step 5

Task remove conjunctions

Notes In the case of a conjunction with two children, the one supposedly carrying less important information will be discarded: – "and": second child is eliminated – "but": first child is abandoned

6

remove modal verbs

7

remove the complementizer "that" apply the XP over XP rule

8

applies to VPs in which the head is a modal verb and the head of the child VP is a form of "have" or "be" removes the term "that" from the input when it occurs as a complementizer – The variable XP may take two values: NP and VP. – In constructions of the form XP XP

...

the second subree (". . . ") is removed. Table 4.2.: Trimmer algorithm - stage 2

• stage 3: elimination of PPs and subordinate clauses (see table 4.3)

56

4. Text Summarization Step 9 10 11 12 13 14 15

Task remove PPs that do not contain NEs remove all PPs under SBARs remove SBARs backtrack to state before step 9 remove SBARs remove PPs that do not contain NEs remove all PPs

Notes In steps 9 to 11, the algorithm first attempts to achieve the desired sentence length by removing allegedly smaller constituents (PPs) before larger ones (SBARs), but if this cannot be accomplished, the smaller constituents are restored (step 12). After that, the system first removes a larger component (step 13) and then resumes the deletion of smaller constituents (steps 14 and 15).

Table 4.3.: Trimmer algorithm - stage 3 Since the rules displayed in table 4.3 are prone to removing meaningful pieces of information, they are executed last, only when there are no other types of rules left to apply. It should be noted that multiple candidate compressions can be generated from a single source sentence by setting the length limit to be very small and storing the state of the outcome after each rule application as a compressed variant. The thus produced candidate compressions can then be used as an input for a multidocument summarization system (see [88] for more details). Inspired by Zajic et al.’s Trimmer algorithm [88], which has been detailed in the previous paragraph, Perera and Kosseim [64] have implemented a variety of sentence compression approaches using syntax-based pruning heuristics. The first one is solely based on syntactic simplification operations that remove specific subtrees of an input sentence without consideration of their particular relevance to a specified query or topic. In fact, the rationale underlying this procedure is that particular syntactic structures per se carry no more than incidental information, so that discarding them is supposed to not deteriorate the content of the summary significantly. Thus, the following syntactic structures are eliminated: • relative clauses: "It’s over", said Tom Browning, an attorney for Newt Gingrich, who was not present at Thursday’s hearing. • adjective phrases: Mark Barton, the 44-year-old day trader at the center of Thursday’s bloody rampage, was described by neighbors in the Atlanta suburb of Morrow as a quiet, churchgoing man who worked all day on his computer. • adverbial phrases: So surely there will be a large number of people who only know us for Yojimbo. • trailing conjuncted VPs: The Southern Poverty Law Center has accumulated enough wealth in recent years to embark on a major construction project and to have assets totaling around $100 million. • selected types of PPs (namely, PPs attached to NPs, to an entire clause, or to VPs that are positioned prior to the verb):

57

4. Text Summarization In the Public Records Office in London archivists are creating a catalog of all British public records regarding the Domesday Book of the 11th century. Arguing that such an exclusively syntax-based pruning of constituents is susceptible to eliminating subtrees that de facto contain some piece of information which is relevant for the summary at hand, in a second step, foregoing framework is enriched with a relevancy-based score. Hence, in addition to the aforementioned syntaxdriven compression rules, a measure is incorporated for determining the relevancy of a subtree which is considered a candidate for pruning according to the heuristics. For example, in case of a query-based summarization, the cosine similarity between the TF*IDF values of the subtree that is to be eliminated subject to the compression rules and the specified query is calculated. If the resulting score is above a certain predefined threshold, the subtree under consideration is not discarded, as it seems to provide relevant content. By driving the pruning process based on the syntactic structure of the source sentence, previously outlined techniques focus on maintaining its grammaticality. However, Perera and Kosseim [64] propose a third approach which concentrates on retaining important pieces of information from the input, rather than ensuring that the output is proper, grammatical English. The idea is to identify those parts of the input that express irrelevant information and remove the subtrees in which they are embedded. Accordingly, for each subtree (except for NPs, VPs and individual words) the TF*IDF-based cosine similarity with a predefined query or topic is calculated. Subtrees whose score is below a certain threshold are eliminated, whereas the others are kept. Dunlavy et al. [27] present another sentence compression approach in the spirit of the works that have been delineated in the prior paragraphs. In order to be able to include a greater number of sentences in a summary - thus potentially gaining additional useful information -, they manually define a set of compression rules for eliminating multiple syntactic structures from an input sentence, based on the assumption that these components usually provide no more than secondary information. For the identification of such syntactic constituents that are to be discarded, they make use of shallow parsing techniques in the form of POS tagging. The resulting pruning heuristics applied within their sentence compression framework are listed in table 4.4. Syntactic Structure to be Removed

POS tag-based Pruning Pattern

sentences that begin with an imperative

The first word of the sentence is tagged as a verb in base form (’VB’).

58


• A personal pronoun (’PRP’) appears within the first eight words of the sentence and it is not preceded by a proper noun. sentences that contain a personal pronoun at or near the start

• AND the preceding sentence (in the original document) has not been selected for the summary. (→ issue exceeding the simple sentence compression problem)

• The gerund clause is at the start of the sentence or immediately follows a comma.

gerund clauses

• AND it has the gerund (’VBG’) as the lead word or the second word following either a preposition (’IN’) or one of the terms "while" or "during" • The end of the clause is defined by a comma or period.

• The clause follows a comma. restricted relative clause appositives

• AND it begins with one of the following pronouns: "who", "when", "where", "which" or "in which". • The end of the clause is defined by a comma or period.

intra-sentential attribution

• The clause contains a verb which can be found in a manually compiled list of about 50 verbs that are customarily used in attributions. • AND it terminates with "that" (if it is located at the start of the sentence) OR it follows the last comma (if it is placed at the end of the sentence).

59


lead adverbs

The adverb (’RB’) is located at the beginning of the sentence.

Table 4.4.: Syntax-based pruning heuristics applied in [27]

60

Part III. Framework for Syntax-driven Rule-based Sentence Simplification

Taking up on the ideas of previous attempts at both text simplification and summarization, this work addresses the issue of simplifying NL text by compressing its sentences one after another to their respective central information, while still preserving the contextual information contained in each source sentence. More precisely, our primary objective is to split a given input sentence into one or more core sentences comprising those parts of the original one that convey key information, and zero or more context sentences including those phrases or clauses that provide only secondary information. For this to be achieved, we identify components of a sentence that may be disembedded - without losing fundamental information -, and transform each of them into self-contained simpler context sentences. The conversions required when separating out incidental constituents of a sentence are performed using a set of hand-crafted simplification rules which are applied repeatedly until no further simplification is possible. That way, each sentence is maximally simplified under the given rule set. Since we focus on enhancing the results of a successive IE step rather than ameliorating the readability of the source text for a human audience, our system is restricted to syntactic simplification and does not offer a treatment of lexical simplification. While the lexical complexity of a text undeniably plays an important role when targeting human readers, NLP applications such as IE systems draw most profit from a simpler syntactic structure of the source sentences (cf. [17, 70]). Therefore, we solely concentrate on simplifying difficult syntactic constructs. Indeed, syntactic and lexical simplification are disparate NLP tasks, requiring different resources, tools and techniques to perform and evaluate [71]. Nevertheless, our system architecture can be easily extended in the future with a component handling lexical simplification. The simplification operations used to reduce the syntactic complexity of the input NL text are encoded by a set of hand-crafted transformation rules which incorporate on the one hand sentence splitting and deletion for condensing the source sentences to their particular core information. On the other hand, paraphrasing - in terms of reordering and insertion - is applied in order to generate context sentences out of the extracted constituents. Hence, the full set of rewriting operations associated with syntactic simplification are employed. These transformation rules have been defined manually in the course of a rule engineering process based on an extensive linguistic analysis of sentences from the English Wikipedia considering the following feature set: POS tags, named entity (NE) tags and phrasal nodes in constituency-based parse trees (as provided by the Stanford Parser). In this manner, seven types of constructs which commonly supply only supplementary information have been identified: • relative clauses • appositives • participial phrases • adjective and adverb phrases • prepositional phrases • lead noun phrases • intra-sentential attributions The rules for separating out these components into stand-alone contextual sentences will be specified in sections 7.2.1 to 7.2.7. Moreover, a variety of conjoined clauses

63

are split into separate sentences. Besides, our text simplification framework also provides a treatment of sentences incorporating particular punctuation. The corresponding heuristics are detailed in sections 7.2.8 and 7.2.9. The next chapters are structured as follows: chapter 5 outlines the workflow of our text simplification system. Chapter 6 then summarizes the functions of the framework’s first component - the analysis module -, whereas the following chapter 7 details its second component - the transformation module -, including a precise specification of the simplification rules that are applied in the process of syntactically simplifying an input NL text.

64

5. Workflow Based on the idea of [71], our text simplification framework comprises three consecutive stages: segmentation, analysis and transformation. Each of these tasks is implemented within a separate module, resulting in the workflow shown in the diagram in figure 5.1. The system takes as input a NL text which is first decomposed into single sentences by the segmentation module and then passed on to the analyzer which converts them into a representation that the transformation module can work with. Following this, one sentence after another is forwarded to the transformation stage where the main work is finally performed (see algorithm 1). By applying a set of hand-crafted grammar rules, each sentence is syntactically simplified into one to n core sentence(s) and zero to m context sentences. When no further simplification is possible, the transformation module calls the analyzer for the next sentence to process. After having thus compressed all sentences of the input data set, the result of the simplification process outlined above is finally output. Algorithm 1 Syntax-based sentence simplification (simplified version) Input: sentence s 1: repeat 2: r ← next rule 3: if r is applicable to s then 4: C, P ← apply rextract to s

5: 6: 7:

for all constituents c ∈ C do context ← apply rparaphrase to c contextSet ← add context

8: until r is null 9: core ← delete tokens in s at positions p ∈ P 10: return core and contextSet

# Null if no more rules # Identify the set of constituents C to extract from s, and their positions P in s # Produce a context sentence # Add it to the core’s set of associated context sentences # Reduce the input to its core # Output the core and its context sentences

5. Workflow

Start

Input: NL text

Segmentation: separation into single sentences

Analysis: generation of three representations per sentence (1) POS tagged variant (2) NE tagged variant (3) constituency-based parse tree

Transformation: application of syntax-based simplification rules

Output: structurally simplified version of each input sentence consisting of (1) 1 to n core sentence(s) (2) 0 to m associated context sentences

Stop Figure 5.1.: Workflow of our text simplification framework

66

6. Preprocessing Modules: Segmentation and Analysis

Following previous work in the field of text simplification, we apply a sentence simplification approach, i. e. our simplification system handles only one sentence at a time, while disregarding interactions across neighbouring sentences. Thus, our transformation rules operate on sentence level. Therefore, first of all, the input NL text has to be segmented into single sentences. This is the purpose of the first module of our framework, the segmentation component. The decomposed dataset is then passed on to the analysis stage whose task is to take in the set of sentences and convert each of them into a representation that the subsequently applied transformation module can work with. Hence, based on the requirements of the simplification rules encoded in the transformer, the analysis component needs to generate three representations per input NL sentence: First, a version in which each word is part-of-speech tagged; second, a variant with NE tags; and finally, a phrasal node representation in the form of a constituency-based parse tree (see example in figure 6.1). The specified output is produced with the help of a number of NLP software provided by the Stanford NLP Group, namely the Stanford POS Tagger, the Stanford Named Entity Recognizer and the Stanford Parser. For more details concerning this matter, see section 2.2.

Figure 6.1.: Representations generated by the analysis module on a sample sentence

original sentence: Obama was named a 2009 Nobel Prize laureate. (6.1.1) Input sentence

6. Preprocessing Modules: Segmentation and Analysis S

NP

.

VP

NNP

.

VP

VBD Obama was VPN named

NP

DT

JJ

NNP

NNP

NN

a

2009

Nobel

Prize

laureate

(6.1.2) Parse tree representation

POS tagged representation: Obama_NNP was_VBD named_VBN a_DT 2009_CD Nobel_NNP Prize_NNP laureate_NN ._. (6.1.3)

NE tagged representation: Obama/PERSON was/O named/O a/O 2009/O Nobel/O Prize/O laureate/O ./O (6.1.4)

68

7. Key Module: Transformation Performing the actual syntactic simplification, the transformation module represents the key component of our text simplification framework. Referring to previous attempts at syntax-based sentence compression [27, 88, 64], the idea is to simplify the original - presumably complex - input sentence by splitting conjoined clauses into separate sentences and by eliminating specific syntactic sub-structures, namely those containing only minor information. However, unlike recent approaches in the field of extractive sentence compression, we do not simply drop these constituents, resulting in a considerable loss of background information, but we rather aim at preserving the full informational content of the original sentence. Thus, on the basis of syntax-driven heuristics, components which typically provide mere secondary information are identified and subsequently transformed into simpler stand-alone context sentences with the help of paraphrasing operations adopted from the text simplification area. Hence, by analyzing the structure of hundreds of sample sentences from the English Wikipedia, we have determined constituents that commonly supply no more than negligible background information. These components comprise the following syntactic elements: • non-restrictive relative clauses (e. g. "The city’s top tourist attraction was the Notre Dame Cathedral, which welcomed 14 million visitors in 2013.") • non-restrictive (e. g. "He plays basketball, a sport he participated in as a member of his high school’s varsity team.") and restrictive appositive phrases (e. g. "He met with prominent foreign figures including former British Prime Minister Tony Blair.") • participial phrases offset by commas (e. g. "The deal, titled Joint Comprehensive Plan of Action, saw the removal of sanctions in exchange for measures that would prevent Iran from producing nuclear weapons.") • adjective and adverb phrases delimited by punctuation (e. g. "Overall, the economy expanded at a rate of 2.9 percent in 2010.") • particular prepositional phrases (e. g. "In December 2008 and in 2012, Time magazine named Obama as its Person of the Year.") • lead noun phrases (e. g. "Six weeks later, Alan Keyes accepted the Republican nomination to replace Ryan.") • intra-sentential attributions (e. g. "He said that both movements seek to bring justice and equal rights to historically persecuted peoples.") • parentheticals (e. g. "He signed the reauthorization of the State Children’s Health Insurance Program (SCHIP).")

7. Key Module: Transformation Besides, both conjoined clauses presenting specific features and sentences incorporating particular punctuation are disconnected into separate ones. After having thus identified syntactic phenomena that generally require simplification, we have determined the characteristics of those constituents, using a number of syntactic features (constituency-based parse trees generated by the Stanford Parser, as well as POS tags provided by the Stanford POS Tagger) that have occasionally been enhanced with the semantic feature of NE tags produced by the Stanford Named Entity Recognizer. Based upon these properties, we have then specified a set of hand-crafted grammar rules for carrying out the syntactic simplification operations which are applied one after another on the given input sentence. By disembedding linguistically peripheral material, a more concise core sentence is produced, augmented by a number of related self-contained contextual sentences. That way, we expect to facilitate a subsequent IE task.

7.1. Three-Stage Approach The transformation module takes as input the three representations the analyzer has previously generated of the sentence that is currently being processed (i. e. its constituency-based parse tree, as well as a POS and NE tagged version). It then applies the simplification rules we have specified one after another to the source sentence, following a three-stage approach. First, clauses or phrases that are to be separated out, including their respective antecedent - where required -, have to be identified by pattern matching. In case of success, a context sentence is constructed by either linking the extractable component to its antecedent or by inserting a complementary constituent that is required in order to make it a full sentence. Finally, the main sentence has to be reduced by dropping the clause or phrase, respectively, that has been transformed into a stand-alone context sentence. It should be noted that the order of application of the simplification rules is rather arbitrary, with the rule for the extraction of PPs which are not delimited by punctuation being the only exception. Suchlike phrases should be handled last, in order to make sure that only specific ones are disembedded from the input (for more information regarding this issue, see section 7.2.5).

7.2. Simplification Rules The following section describes the syntactic phenomena that are tackled by our text simplification framework, together with the simplification rules which are applied for separating them out into stand-alone context sentences. It is not supposed to provide a full specification of the extraction rules upon which our system operates, but rather a detailed discussion of the underlying principles.

7.2.1. Relative Clauses A relative clause is one that is attached to its antecedent by a relative pronoun. There are two types of relative clauses, differing in the semantic relation between

70

7. Key Module: Transformation the clause and the phrase to which it refers that may be either restrictive or nonrestrictive. In the former case, the relative clause is strongly connected to its antecedent, providing information that identifies the noun it modifies (e. g. "Obama criticized leaders who refuse to step off.") Thus, it supplies essential information and therefore cannot be eliminated without affecting the meaning of the sentence. In contrast, non-restrictive relative clauses are parenthetic comments which usually describe, but do not further define their antecedent (e. g. "Obama brought attention to the New York City Subway System , which was in a bad condition at the time."), and hence can be left out without disrupting the meaning or structure of the sentence [66]. In short, restrictive relative clauses are an integral part of the phrase to which they are linked and thus should not be separated out of the sentence. Consequently, only non-restrictive relative clauses are detached from the main clause and transformed into self-contained context sentences by our simplification model. As non-restrictive relative clauses are usually set off by commas - unlike their restrictive counterparts - they can be easily distinguished from one another on a purely syntactic basis. In the course of the previously conducted rule engineering process, the constituency parse trees of hundreds of sentences containing non-restrictive relative clauses have been analyzed in order to identify a general pattern which typically signifies this type of clause. That way, we have determined that the antecedent of such a relative clause must be a NP that is - if any - usually only separated from it by a PP, succeeded by a comma. Thus, the following pattern commonly indicating a non-restrictive relative clause has been deduced: NP [PP]? , SBAR , whereby the subordinate clause has to be introduced by a relative pronoun. In this connection, the relative pronouns who, whom, which and where, as well as a combination of a preposition and one of the pronouns named above are factored in by our text simplification framework. Two exemplary simplification rules treating non-restrictive relative clauses commencing with the pronouns who, whom or which (see figure 7.1.1) and where (see figure 7.1.2), respectively, are shown below.

71

7. Key Module: Transformation

Figure 7.1.: Example simplification rules for non-restrictive relative clauses transformation rule for simplifying non-restrictive relative clauses introduced by the pronouns who(m) or which: input sentence:

NP

[PP]?

,

SBAR WHNP

S

[WP|WDT] [who(m)|which] context sentence: NP + [PP]? + S + . (7.1.1) Relative pronouns who, whom, which

transformation rule for simplifying non-restrictive relative clauses introduced by the pronoun where: input sentence: ,

SBAR WHADVP S WRB [where]

context sentence: ’There’ + S + . (7.1.2) Relative pronoun where

In order to identify whether the sentence which is currently under consideration includes a non-restrictive relative clause, we take its representation in the form of a constituency parse tree and traverse its nodes in search for the aforementioned pattern. In case of having found a corresponding clause, a context sentence containing additional information about the referred phrase is constructed by linking the

72

7. Key Module: Transformation relative clause (without the pronoun) to the phrase that has been identified as its antecedent. At the same time, the main sentence is reduced by dropping the relative clause that hast been extracted (including the preceding comma). Examples 7.2 and 7.3 illustrate the procedure described.

Figure 7.2.: Example of the simplification of a non-restrictive relative clause (’which’) original sentence: Obama was the first U.S. president ever to visit Kenya, which is the homeland of his father. (7.2.1) Input sentence

NP

,

Kenya

,

SBAR

WHNP

S

WDT is the homeland of his father which (7.2.2) Parse tree representation of the relevant part

result of the transformation stage: • core sentence: Obama was the first U.S. president ever to visit Kenya. • context sentence: Kenya is the homeland of his father. (7.2.3) Output

73


Figure 7.3.: Example of the simplification of a non-restrictive relative clause (’where’) original sentence: Obama is a graduate of Harvard Law School, where he served as president of the Harvard Law Review. (7.3.1) Input sentence

,

SBAR

, WHADVP

S

WRB he served as president of the Harvard Law Review where (7.3.2) Parse tree representation of the relevant part

result of the transformation stage: • core sentence: Obama is a graduate of Harvard Law School. • context sentence: There he served as president of the Harvard Law Review. (7.3.3) Output

7.2.2. Appositive Phrases An appositive is a noun phrase that further characterizes the phrase to which it refers. Just like relative clauses, appositions can be classified as restrictive and nonrestrictive, respectively. Non-restrictive appositives are separate information units, marked by their segregation through punctuation [66]. Representing parenthetical information, they can be omitted without changing the meaning of the sentence. For resolving such an apposition, our simplification model uses a pattern which matches

74

7. Key Module: Transformation a NP followed by a comma that is again followed by a NP, with each of the two NPs potentially succeeded by a PP: NP [PP]? , NP [PP]? [,|EOS] . Figure 7.4 gives an example.

Figure 7.4.: Example of the simplification of a non-restrictive appositive phrase original sentence: He returned to Kenya for a visit to his father’s birthplace , a village near Kisumu in rural western Kenya. (7.4.1) Input sentence

NP

,

NP

his father’s birthplace

,

a village near Kisumu in rural western Kenya

(7.4.2) Parse tree representation of the relevant part

result of the transformation stage: • core sentence: He returned to Kenya for a visit to his father’s birthplace. • context sentence: His father’s birthplace was a village near Kisumu in rural western Kenya. (7.4.3) Output

In order to avoid inadvertently mistaking coordinated noun phrases for appositives, the following heuristic is applied: From the phrase that is deemed an appositive by matching the pattern above, we scan ahead, looking one after the other at his sibling nodes in the parse tree. If a conjunction and or or is encountered, the analysis of the appositive is rejected [71]. That way, we avoid wrong analyses like: "Obama has talked about using alcohol, [appos marijuana], and cocaine." At the same time, this rule unfortunately leads to the erroneous denial of conjoined appositives incorporating a conjunction (see example 7.5). A starting point for future improvements regarding this problem might be to take into account the POS tags of the words representing

75

7. Key Module: Transformation the NPs which match the pattern above. If and only if one of them - i. e. either the one before or the one after the comma - incorporates a proper noun as indicated by the tags ’NNP’ or ’NNPS’, the analyzed phrase is assumed to act as an apposition.

Figure 7.5.: Example of the simplification of a non-restrictive conjoined appositive phrase original sentence: Obama returned to Honolulu to live with Madelyn and Stanley Dunham.

his maternal grandparents ,

(7.5.1) Input sentence

NP

,

NP NNP

, PRP$

JJ

CC and

maternal

NNP

NNP

Stanley

Dunham

NNS Madelyn

his

NP

grandparents


result of the transformation stage: • core sentence: Obama returned to Honolulu to live with his maternal grandparents, Madelyn and Stanley Dunham. (7.5.3) Output

An appositive is converted into a contextual sentence by taking the phrase including a proper noun - if any, otherwise simply the one before the comma -, as its first component and appending the appropriate form of the verb "to be" followed by the remaining NP. The second type of appositives, restrictive apposition, does not contain punctuation [66]. These constructs are simplified as well, provided they match a particular sequence of NE and POS tags. To be exact, we first make use of the NE tagged representation of the given input sentence, searching for words marked with either a "PERSON" or "ORGANIZATION" tag. On condition that such entities have been

76

7. Key Module: Transformation detected, we next check the POS tags of the respective preceding words. If we encounter a sequence of nouns or proper nouns, potentially with some prepending adjectives, determiners, numbers and/or possessive pronouns, we assume that this prefix string plays the role of a restrictive appositive phrase. In this case, the constituents ahead of the proper nouns are detached and transformed into an isolated context sentence by linking them via an auxiliary verb (one of is, are, was, were) to the identified person or organization entity, respectively. For an example, see figure 7.6.

Figure 7.6.: Example of the simplification of a restrictive appositive phrase original sentence: Obama defeated Republican nominee John McCain in the general election. (7.6.1) Input sentence

tagged sentence: • NE tags: Obama/PERSON defeated/O Republican/O nominee/O John/PERSON McCain/PERSON in/O the/O general/O election/O ./O • POS tags: Obama_NNP defeated_VBD Republican_JJ nominee_NN John_NNP McCain_NNP in_IN the_DT general_JJ election_NN ._. (7.6.2) NE and POS tags

result of the transformation stage: • core sentence: Obama defeated John McCain in the general election. • context sentence: John McCain was a Republican nominee. (7.6.3) Output

77


7.2.3. Participial Phrases A participial phrase is made up of a present or past participle, together with an introductory adverbial connector (e. g. after, when, although), the object(s) of the participle, or modifiers which can occur in verb phrases [1]. In order to ensure that the participial phrase is not an essential part of the sentence, only those that are set off by commas are extracted and converted into stand-alone context sentences. Corresponding transformations are shown in the examples below (see figures 7.7 to 7.9). Whether or not to eliminate participles operating as pre- or postmodifying adjectives without being separated through punctuation is a challenging question, as in many cases, participial modifiers of this kind resolve referential ambiguities (comparable to the restrictive versions of relative clauses and appositives) and thus cannot be left out without compromising the meaning of the sentence (cf. "President Bush and Congress agreed on the joint resolution authoring the Iraq War.") However, at times, participial phrases that are not set off by commas provide no more than some extractable piece of background information, like e. g. in "Google Inc. is an American multinational technology company specializing in Internet-related services and products." or "Google acquired a 15.7% stake in Arris Group valued at $300 million." Accordingly, specifying rules for dropping such describing participial phrases is left to future work. Participials with an introductory adverbial connector are manipulated in the context of prepositional phrases (see section 7.2.5) and conjoined clauses (see section 7.2.8), respectively.

Figure 7.7.: Example of the simplification of a past participle in postmodification original sentence: His mother , born in Wichita, Kansas, was of mostly English ancestry. (7.7.1) Input sentence

NP

NP

,

His mother

,

VP

,

VBN

PP

born

in Wichita, Kansas


78

,


result of the transformation stage: • core sentence: His mother was of mostly English ancestry. • context sentence: His mother was born in Wichita, Kansas. (7.7.3) Output

Figure 7.8.: Example of the simplification of a past participle in premodification original sentence: Assigned for three months as Obama’s adviser at the firm, Robinson joined him at several group social functions. (7.8.1) Input sentence

S

S

, ,

VP

PP

VBN for three months as Obama’s adviser at the firm Assigned (7.8.2) Parse tree representation of the relevant part

79


result of the transformation stage: • core sentence: Robinson joined him at several group social functions. • context sentence: This was when being assigned for three months as Obama’s adviser at the firm. (7.8.3) Output

Figure 7.9.: Example of the simplification of a present participle in postmodification

original sentence: Obama was reelected president in 2012, defeating Republican nominee Mitt Romney. (7.9.1) Input sentence

S

,

S

, VP

NP

VBG Republican nominee Mitt Romney defeating (7.9.2) Parse tree representation of the relevant part

80


result of the transformation stage: • core sentence: Obama was reelected president in 2012. • context sentence: This was when defeating Republican nominee Mitt Romney. (7.9.3) Output

7.2.4. Adjective and Adverb Phrases An adjective phrase (ADJP) is one whose head is an adjective, optionally complemented by a number of dependent elements. It further characterizes the NP it is modifying [12, 66]. Similar to participial phrases, only those ADJPs that are set off by commas are detached and transformed into contextual sentences with the help of the simplification rule depicted in figure 7.10. Figure 7.11 provides a corresponding example. On the contrary, ADJPs that are not separated by punctuation customarily represent an integral part of the phrase to which they refer (cf. "It is the subdivision responsible for providing emergency services." or "The council plays a largely passive role in the city government.") As a result, such phrases are not extracted from the input sentence. transformation rule for simplifying offset adjective phrases: input sentence:

NP

[PP]?

,

ADJP [,|EOS]

context sentence: NP + [PP]? + aux + ADJP + . Figure 7.10.: Simplification rule for adjective phrases

Figure 7.11.: Example of the simplification of an adjective phrase original sentence: The company announced plans to install thousands of solar panels to provide up to 1.6 megawatts of electricity , enough to satisfy approximately 30% of the campus’ energy needs. (7.11.1) Input sentence

81


NP

PP

,

ADJP

.

up to 1.6 megawatts

of electricity

,

enough . . . needs

.


result of the transformation stage: • core sentence: The company announced plans to install thousands of solar panels to provide up to 1.6 megawatts of electricity. • context sentence: Up to 1.6 megawatts of electricity were enough to satisfy approximately 30% of the campus’ energy needs. (7.11.3) Output

Aside from that, an adverb phrase (ADVP) consists of an adverb as its head, together with an optional pre- or postmodifying complement [12, 66]. To guarantee that a phrase introduced by an adverb is not a fundamental constituent of the currently treated sentence, they are separated out only if they are offset by commas - as is the case with ADJPs. This is done by applying the respective transformation rule (see figure 7.12), resulting in simplifications as shown in the examples in figure 7.13 and figure 7.14.

transformation rule for simplifying offset adverb phrases: input sentence: or ,

ADVP [,|EOS]

S ADVP

context sentence: ’This’ + aux + ADVP + . Figure 7.12.: Simplification rule for adverb phrases

82

,


Figure 7.13.: Example of the simplification of an adverb phrase original sentence: Most French rulers since the Middle Ages made a point of leaving their mark on a city that, contrary to many other of the world’s capitals, has never been destroyed by catastrophe or war. (7.13.1) Input sentence

,

ADVP

,

,

contrary to many other of the world’s capitals

,


result of the transformation stage: • core sentence: Most French rulers since the Middle Ages made a point of leaving their mark on a city that has never been destroyed by catastrophe or war. • context sentence: This was contrary to many other of the world’s capitals. (7.13.3) Output

Figure 7.14.: Example of the simplification of an adverb phrase original sentence: Later in 1981, he transferred as a junior to Columbia College. (7.14.1) input sentence

83


S ADVP

,

Later in 1981

,

(7.14.2) Parse tree

result of the transformation stage: • core sentence: He transferred as a junior to Columbia College. • context sentence: This was later in 1981. (7.14.3) Output

7.2.5. Prepositional Phrases A prepositional phrase is composed of a preposition and a complement which is characteristically a noun phrase (e. g. on the table, in terms of money), a nominal wh-clause (e. g. from what he said) or a nominal -ing clause (e. g. by signing a peace treaty). They may function as a postmodifier in a noun phrase, an adverbial phrase or a complement of a verb or an adjective [66]. Depending on their particular syntactic function, they contribute more or less fundamental information to the sentence which they are part of. Among the different types of PPs mentioned above, those that play the role of adverbial phrases which are offset by commas are by far the easiest to handle, as they may generally be pruned without losing vital information. Assuming that they match one of the constituency parse tree patterns that have been identified for this type of phrase (most notably the ones displayed in figure 7.15), the constituents of the PP under consideration are separated out of the given input sentence and transformed into simpler detached ones with the help of the corresponding rule. An example is shown below in figure 7.16.

84


Figure 7.15.: Most important simplification rules for offset prepositional phrases example transformation rule for simplifying offset prepositional phrases: input sentence:

,

[ADVP]?

PP

[NP|PP]?

[,|EOS]

context sentence: ’This’ + aux + [ADVP]? + PP + [NP|PP]? + . (7.15.1) Parse tree pattern 1

example transformation rule for simplifying offset prepositional phrases: input sentence: S ...

PP [NP]?

,

context sentence: ’This’ + aux + PP + [NP]? + . (7.15.2) Parse tree pattern 2

Figure 7.16.: Example of the simplification of offset prepositional phrases original sentence: On December 10, 2013, Cuban President Raúl Castro, in a significant public moment, shook hands with Obama. (7.16.1) Input sentence

S PP

,

On December 10, 2013

,

(7.16.2) Parse tree

85


result of the transformation stage: • core sentence: Cuban President Raúl Castro shook hands with Obama. • context sentence: This was on December 10, 2013. • context sentence: This was in a significant public moment. (7.16.3) Output

In fact, things get much more complicated in case of PPs used as postmodifying adverbials or NPs without segregation through punctuation, since in this context, they may either provide information that identifies the phrase to which they refer (e. g. "Obama’s election as the first black president of the Harvard Law Review gained national media attention.") or just describes, but not further defines, their antecedent (e. g. "Paris has an extensive road network with more than 2,000 km of highways and motorways.") Distinguishing between such defining and describing PPs is extremely demanding. As a first approach to address this issue, we extract only a subgroup of suchlike PPs which feature specific properties. By examining several hundreds of sample sentences from various Wikipedia articles, we have discovered that in many cases PPs constituting the last component of a sentence, in particular when either relating to a location, person or organization (as indicated by their respective NE tags), or representing a date (as signified by the POS tag ’CD’), may be removed without corrupting the meaning of the source sentence (cf. "Obama formally announced his candidacy in January 2003." or "Obama delivered the keynote address at the 2004 Democratic National Convention.") Hence, we take the current version of the core sentence - after having applied all other simplification rules defined within this framework - and look at its last PP, if any. Provided that it either contains a word which is NE tagged with ’LOCATION’, ’PERSON’ or ’ORGANIZATION’, or ends with a number (i. e. POS tag ’CD’) or a proper noun (i. e. POS tag ’NNP’ or ’NNPS’), we separate it out into a stand-alone context sentence. This process is recursively repeated until a PP that cannot be extracted due to missing aforementioned characteristics is encountered. An example is provided in figure 7.17. In addition, it must be noted that phrases starting with the preposition "of" (cf. "Obama served on the boards of directors of the Woods Fund of Chicago." or "His mother spent most of the next two decades in Indonesia."), as well as those acting as a complement in a participial phrase (e. g. "His remarks were made to a group of Marines preparing for deployment to Afghanistan.") and sentences including an adjective or adverb in comparative or superlative form (like in "Google announced the setting up of its largest campus outside the United States.") are generally not eliminated, even though fulfilling the previously named requirements. The reason for this is that they usually serve as an integral component of the phrase to which they refer and therefore cannot be removed without resulting in sentences that are too curt or even gain a different meaning.

86


Figure 7.17.: Example of the simplification of prepositional phrases without segregation through punctuation original sentence: He met with his Irish cousins in Moneygall in May 2011. (7.17.1) Input sentence

tagged sentence: • NE tags: He/O met/O with/O his/O Irish/O cousins/O in/O Moneygall/LOCATION in/O May/O 2011/O ./O • POS tags: He_PRP met_VBD with_IN his_PRP$ Irish_JJ cousins_NNS in_IN Moneygall_NNP in_IN May_NNP 2011_CD ._. (7.17.2) NE and POS tags

result of the transformation stage: • core sentence: He met with his Irish cousins. • context sentence: This was in May 2011. • context sentence: This was in Moneygall. (7.17.3) Output

Nevertheless, when applying the algorithm described above, we unfortunately lose essential information in a considerable number of sentences (see figure 7.18 for some examples), while occasionally phrases that admittedly contain mere extractable background information, but do not enclose one of the required POS or NE tags are ignored (cf. "Obama defeated John McCain in the general election.") Thus, in order to avoid separating out a PP which represents an indispensable component of the input sentence, further research needs to be conducted on refining the rules that handle the extraction of PPs without segregation through punctuation.

87


Figure 7.18.: Example of the simplification of prepositional phrases without segregation through punctuation resulting in malformed core sentences

negative example 1: • original sentence: Obama also has roots in Ireland. • core sentence: Obama also has roots. • context sentence: This is in Ireland. (7.18.1)

negative example 2: • original sentence: He visited Barack in Hawaii only once. • core sentence: He visited Barack only once. • context sentence: This was in Hawaii. (7.18.2)

negative example 3: • original sentence: Obama’s advisers called for a halt to petroleum exploration in the Arctic. • core sentence: Obama’s advisers called for a halt to petroleum exploration. • context sentence: This was in the Arctic. (7.18.3)

Furthermore, resolving PPs acting as complements of verbs or adjectives is of similar complexity, as they are often closely connected to their antecedent [66] and therefore cannot be left out without corrupting the meaning of the sentence (cf. "Obama would become the first President to have been born in Hawaii.", "This took place in July 2011." or "Radio France is headquartered in Paris’ 16th arrondissement.") However, many times such verb or adjective phrase modifiers contribute no more than

88

7. Key Module: Transformation some form of additional background information which can be eliminated, resulting in a sentence that is still both meaningful and grammatical (e. g. "Obama won with 70 percent of the vote.", "Obama’s parents divorced in March 1964." or "Obama and Joe Biden were formally nominated by former President Bill Clinton.") Beyond that, there are a lot of cases where it is well arguable whether or not to extract PPs complementing verbs or adjectives, as illustrated in figure 7.19. Here, both source sentences are shortened to a grammatical, but terse core sentence. Hence, it might be preferable to preserve at least one of the PPs that have been included in the source.

Figure 7.19.: Example of the simplification of prepositional phrases acting as complements of verbs or adjectives example 1: • original sentence: After high school, Obama moved to Los Angeles in 1979 to attend Occidental College. • core sentence: Obama moved. • context sentence: This was to Los Angeles. • context sentence: This was after high school. • context sentence: This was to attend Occidental College. • context sentence: This was in 1979. (7.19.1)

example 2: • original sentence: Obama resigned from the Illinois Senate in November 2004 following his election to the U.S. Senate. • core sentence: Obama resigned. • context sentence: This was from the Illinois Senate. • context sentence: This was in November 2004. • context sentence: This was following his election to the U.S. Senate. (7.19.2)

89

7. Key Module: Transformation Finally, PPs starting with the preposition "to" have been analyzed in more detail as they incorporate a very important type of extractable constituent, namely phrases describing intentions (cf. "Obama commissioned a poll to assess his prospects in a 2004 U.S. Senate race.") As before, based on the constituency parse trees of numerous sample sentences which contain this kind of phrase, we have been searching for patterns that can typically be separated out from the core, eventuating in compressed, though still informative, grammatical sentences. One of them is shown in the example in figure 7.20 which illustrates the way suchlike sentences are simplified.

Figure 7.20.: Example of the simplification of a prepositional phrase containing the preposition "to" original sentence: Obama sent 275 troops to provide support and security for U.S. personnel and the U.S. Embassy in Baghdad. (7.20.1) Input sentence

S

NP

VP

Obama VBD

NP

sent

275 troops

S VP TO to

VP VB

NP

provide

support . . . Baghdad


90


result of the transformation stage: • core sentence: Obama sent 275 troops. • context sentence: This was to provide support and security for U.S. personnel and the U.S. Embassy in Baghdad. (7.20.3) Output

To sum up, the category of PPs is definitely the most complicated one of all the structures that are treated within this simplification framework, often causing the output core sentences to be either too terse or too verbose. Accordingly, foregoing transformation rules need to be reviewed in the future. It might be worth having a closer look at further selected prepositions in order to develop more fine-grained rules on the level of specific prepositions. Another suggestion would be to incorporate additional features, particularly semantic-based ones like semantic distance vectors measuring the semantic relatedness between components, for deciding whether or not to extract a particular PP.

7.2.6. Lead Noun Phrases Occasionally, sentences may start with an inserted noun phrase, which in the majority of cases indicates a temporal expression. Hence, such a phrase generally represents background information that can be eliminated from the main sentence without resulting in a lack of key information. This is achieved by applying the simplification rule displayed in figure 7.21. Figure 7.22 illustrates an example. transformation rule for simplifying sentences that start with an inserted noun phrase: input sentence: S [ADVP]? NP

,

...

context sentence: ’This’ + aux + [ADVP]? + NP + . Figure 7.21.: Simplification rule for lead noun phrases

91


Figure 7.22.: Example of the simplification of a lead noun phrase original sentence: Six days later, NATO took over leadership of the effort. (7.22.1) Input sentence

S

NP

,

NP

VP

.

Six days later

,

NATO

took over leadership of the effort

.


result of the transformation stage: • core sentence: NATO took over leadership of the effort. • context sentence: This was six days later. (7.22.3) Output

7.2.7. Intra-Sentential Attributions A further sentence structure that has been considered for removal are so-called intra-sentential attributions which may be expressed in various ways, including in particular the following ones: 1. Michelle Obama said that he had successfully quit smoking. 2. Michelle Obama said he had successfully quit smoking. 3. Michelle Obama said : "He had successfully quit smoking." 4. Michelle Obama said that he had successfully quit smoking, and that he doesn’t regret it. 5. He had successfully quit smoking , said Michelle Obama.

92

7. Key Module: Transformation As [27] has already ascertained, identifying such attributions is in general not an easy task. Therefore, we focus on the most prominent structures, namely those upon which the examples 1 and 4 are built. I. e., an attribution must occur at the start of a sentence and terminate with the pronoun "that" - without any preceding punctuation, but potentially connected with a second main clause via a conjunction and another "that". To detect intra-sentential attributions of the aforementioned type, we draw on the syntax parse tree of the respective input sentence. This representation is then traversed in search for the following pattern: A NP that is succeeded by a VP which is made up of a verb complemented by a subordinate clause starting with "that". A suchlike structure and its associated transformation rule is shown in figure 7.23. Hence, provided we come across a matching pattern in the parse tree of the sentence which is currently handled, the components forming the attribution are eliminated from the main sentence and transformed into a stand-alone context sentence with the help of the corresponding simplification rule. This procedure is illustrated by the example depicted in figure 7.24.

transformation rule for simplifying intra-sentential attributions: input sentence: S

...

NP

VP [VBD|VBZ|VBP] SBAR IN [that]

context sentence: ’This’ + aux + ’what’ + NP + VP + . Figure 7.23.: Example simplification rule for intra-sentential attributions

Figure 7.24.: Example of the simplification of an intra-sentential attribution original sentence: Michelle Obama said that he had successfully quit smoking. (7.24.1) Input sentence

93

7. Key Module: Transformation S

VP

NP Michelle Obama

VBD said

SBAR

IN

S

that

he had successfully quit smoking


result of the transformation stage: • core sentence: He had successfully quit smoking. • context sentence: This was what Michelle Obama said. (7.24.3) Output

An important issue when separating out intra-sentential attributions are premodifying PPs, commonly indicating the location or time of the attributing event (cf. " In June 2012, Obama said that the bond between the United States and Israel is "unbreakable".") In order to preserve the original meaning of the input sentence, the components of such a preposed phrase need to be included in the context sentence that specifies the attribution rather than to be transferred into a separate one. For clarification, see the example provided in figure 7.25. Here, the PP "in February 2013" refers to the intra-sentential attribution "Obama said that" instead of the statement enclosed in the core sentence ("The U.S. military would reduce the troop level.") Thus, if it were extracted into a stand-alone contextual sentence, one would assume that it was the reduction of the troop level that took place in February 2013, in lieu of Obama’s declaration. Figure 7.25.: Example of the simplification of an intra-sentential attribution with a premodifying PP original sentence: In February 2013, Obama said that the U.S. military would reduce the troop level in Afghanistan by February 2014. (7.25.1) Input sentence

94

7. Key Module: Transformation S

PP

,

NP

In February 2013

,

Obama

VP VBD said

SBAR

IN

S

that

the U.S. military . . . February 2014


result of the transformation stage: • core sentence: The U.S. military would reduce the troop level. • context sentence: This was what Obama said in February 2013. • context sentence: This was by February 2014. • context sentence: This was in Afghanistan. (7.25.3) Output

The treatment of intra-sentential attributions that occur in reported speech without the subordinate clause being introduced by the pronoun "that" is left to future work, since their determination is a rather difficult task. Moreover, often causing parsing errors, we also ignore direct speech so far.

7.2.8. Conjoined Clauses Conjoined clauses are either independent clauses, i. e. they can stand by themselves as complete sentences, or dependent ones - due to missing some element that would make them full sentences - which are combined by coordinating or subordinating conjunctions [66]. Our model simplifies both coordinated and subordinated clauses, considering not only infix conjunctions, but also those in prefix position. Subordinated clauses with a prefix conjunction match the following general sentence pattern: CC clause1 , clause2 . Subordinating conjunctions that are handled encompass after, before, since as, while, when, if, so, though, although and because. The example in figure 7.26 illustrates how to transform sentences of this type. In contrast, coordinated and subordinated clauses with infix conjunctions correspond to the subsequent pattern: clause1 [, ]? CC clause2 . Our simplification

95

7. Key Module: Transformation model deals with the coordinating conjunctions and, or, but as well as the subordinating ones that have already been examined in prefix position. Examples are shown below in figure 7.27 and figure 7.28. Regarding coordinating conjunctions, conjoined clauses are split only if both clauses represent complete sentences. In particular, coordinated verb phrases like e. g. in "Obama spoke of a five-year freeze in domestic spending, eliminating tax breaks for oil companies and reversing tax cuts for the wealthiest Americans, banning congressional earmarks, and reducing healthcare costs." are not broken up. The reason for this is to avoid fragmenting the input sentence too much [71], possibly resulting in loose chunks lacking their logical connection. Following [71], the list of conjunctions was determined by manually analyzing sentences containing conjunctions in a number of text simplification corpora (namely the Written News Compression Corpus [20], the Broadcast News Compression Corpus [20], and the Guardian Newspaper corpus [71]), as well as multiple Wikipedia articles, and selecting only the prevalent ones where the two conjoined clauses could be separated without corrupting the meaning of the original sentence. Besides, it should be noted that on the level of semantics, there is a major difference between coordination and subordination of clauses: the information contained in a subordinate clause is often placed in the background with respect to the superordinate clause [66]. Therefore, subordinate clauses are transformed into context sentences, while each of the coordinated clauses represents a core sentence, each with its particular set of referring contextual sentences (see figure 7.28 for an example).

Figure 7.26.: Example of the simplification of a subordinated clause with both prefix and infix conjunction original sentence: Although 75% of the Jewish population in France survived the Holocaust during World War II, half the city’s Jewish population perished in Nazi concentration camps, while some others fled abroad. (7.26.1) Input sentence

96


S

,

SBAR

, IN

S

Although

75% of the Jewish population . . . World War II


result of the transformation stage: • core sentence: Half the city’s Jewish population perished in Nazi concentration camps. • context sentence: This was while some others fled abroad. • context sentence: This was although 75% of the Jewish population in France survived the Holocaust during World War II. (7.26.3) Output

Figure 7.27.: Example of the simplification of a subordinated clause with infix conjunction original sentence: He was a community organizer before earning his law degree. (7.27.1) Input sentence

97


]

S

NP

VP

He

PP

VBD

NP

was

a community organizer

IN

S

before

earning his law degree


result of the transformation stage: • core sentence: He was a community organizer. • context sentence: This was before earning his law degree. (7.27.3) Output

Figure 7.28.: Example of the simplification of a coordinated clause original sentence: On May 9, 2012, shortly after the official launch of his campaign for reelection as president, Obama said his views had evolved, and he publicly affirmed his personal support for the legalization of same-sex marriage, becoming the first sitting U.S. president to do so. (7.28.1) Input sentence

98


S

S

,

CC

S

Obama said . . . evolved

,

and

he publicly affirmed . . . marriage


result of the transformation stage: • core sentence: Obama said his views had evolved. • context sentence: This was on May 9, 2012. • context sentence: This was shortly after the official launch of his campaign for reelection as president. • core sentence: And he publicly affirmed his personal support for the legalization of same-sex marriage. • context sentence: This was when becoming the first sitting U.S. president to do so. (7.28.3) Output

7.2.9. Punctuation Finally, a number of rules have been developed in order to resolve sentences incorporating particular punctuation, namely colons (":"), semicolons (";") and parentheticals ("(. . . )"). With regard to the former two constructs, each of the two sentences that are conjoined by a colon or semicolon generally represents fundamental information. Thus, they are both converted into separate core sentences - each with their own set of associated context sentences. The examples in figure 7.29 and figure 7.30 illustrate this procedure.

99


Figure 7.29.: Example of the simplification of sentences conjoined by a colon original sentence: Only 33 percent of principal-residence Parisians own their habitation (against 47 percent for the entire Île-de-France) : the major part of the city’s population is a rent-paying one. (7.29.1) Input sentence

S

:

S

Only 33 percent . . . Île-de-France

:

the major part . . . rent-paying one


result of the transformation stage: • core sentence: Only 33 percent of principal-residence Parisians own their habitation. • core sentence: The major part of the city’s population is a rent-paying one. (7.29.3) Output

Figure 7.30.: Example of the simplification of sentences conjoined by a semicolon original sentence: The Paris Region had a GDP of $624 billion in 2012 ; it is the banking and financial centre of France, containing the headquarters of 29 of the 31 companies in France ranked in the 2015 Fortune Global 500. (7.30.1) Input sentence

100


S

:

S

The Paris Region . . . in 2012

;

it is the banking . . . Global 500


result of the transformation stage: • core sentence: The Paris Region had a GDP of $624 billion. • context sentence: This was in 2012. • core sentence: It is the banking and financial centre of France. • context sentence: This is when containing the headquarters of 29 of the 31 companies in France ranked in the 2015 Fortune Global 500. (7.30.3) Output

In contrast, the information enclosed in brackets is usually placed in the background with respect to the phrase or clause to which it is attached. However, such parenthetical constituents frequently either pose difficulties for the parser, resulting in erroneous parse trees and thus incorrect (in content and/or grammar) derived sentences, or require further semantic knowledge on how to be interpreted (e. g. a parenthetical structure such as in "Nelson Rolihlahla Mandela (18 July 1918 - 5 December 2013)", would require an element for the identification of temporal expressions and subsequent generation of appropriately formatted output [27]). Therefore, we have decided not to try to convert the information contained within brackets into a contextual sentence, but to simply remove it from the core (see figure 7.29). Consequently, we lose some information of the original sentence; however the loss is usually rather small since parentheticals mostly only present further explanations or represent abbreviations. Hence, dropping these constituents is generally acceptable (see also section 9.2.2).

101

Part IV. Evaluation

As detailed in section 3.3, faced with a wide spectrum of difficulties ranging from the specification of common features for judging one text to be simpler than another to the acquisition of an appropriate dataset upon which the performance of state-ofthe-art approaches may be compared, the text simplification community has not yet come to a general agreement on how to evaluate simplification systems. Therefore, we adopt the following approach: Based on the specific goals of our framework, we define explicit criteria for assessing the quality of the resulting output. We then compile a set of metrics which adequately capture the system’s performance with respect to the determined attributes. To demonstrate improvements over previous attempts at text simplification, we also compare the outcome of our framework with that of a baseline system. Beyond that, we assemble our own test set for carrying out the evaluation procedure, since we regard none of the corpora that have been commonly used in prior work as suitable for our purposes. The next chapters are structured as follows: chapter 8 delineates the experimental setup that has been used for evaluating the performance of our text simplification framework. The results of this procedure are then presented in chapter 9.

105

8. Experimental Setup The objective of our text simplification approach is to split a given sentence into one or more core sentences comprising exactly those parts of the original sentence that convey central information, and zero or more associated context sentences consisting of those phrases of the input that provide only secondary information. Thus, the compressed core sentences and their dedicated context sentences are supposed to jointly express the complete content of the source sentences, though in a simplified structural form. The idea is that through breaking up an input sentence into several smaller units of text - without losing any information contained in the source -, the performance of a successively applied IE system operating on this restructured version may be improved. Based upon these goals, we have defined the following criteria for assessing the quality of the output of our text simplification framework: 1. The output core sentences shall be compressed to the main piece of information contained in the source, i. e. all phrases expressing only peripheral information are to be separated out of the input sentence. 2. No datum that has been included in the original sentence shall be lost, but rather transformed into an appendant context sentence, with each of them representing at best precisely one specific piece of information. 3. The original meaning of the input sentence is to be preserved. 4. The resulting sentences shall be proper, grammatical English. 5. The output sentences should present a simplified syntactic structure. To adequately judge the performance of our simplification system with regard to these issues, we have set up an evaluation procedure which will be detailed below.

8.1. Measures for Capturing the System’s Performance For assessing the quality of the sentences generated by our text simplification framework, we conduct a thorough evaluation, taking into account not only a number of automatic measures that are suitable to capture item number one, two and five of the foregoing list (in section 8), but also a detailed manual investigation of the output which is required to properly estimate latter three aspects as well as some facets of the second one.

8. Experimental Setup

8.1.1. Automatic Evaluation Metrics Referring to [41], in particular, as well as [21, 86, 61, 91] and [35], we have computed some basic statistics for the output sentences, thus getting a rough picture of the overall performance of our text simplification framework. The measures we have factored into our investigations concerning this matter encompass the following ones: • number of sentences processed, giving information about the quantity of input sentences which have been discarded by the system due to their specific syntactic structure that is considered too complex to be handled properly • number of output sentences per input, indicating the quantity of resulting sentences that have been generated from the given source sentence, hereby offering clues to the amount of components that have been extracted from the source and transformed into separate sentences, thus allegedly leading to a more condensed, structurally simplified output • number of unchanged input sentences, specifying the amount of output sentences with no edits, meaning that they have not been simplified • input word coverage, signifying the percentage of words from the source sentence that are included in at least one of the output sentences, thereby shedding light on whether constituents that have been originally contained in the input are dropped instead of being converted into a context sentence, hereby likely causing a loss of information • average sentence length, revealing the average number of tokens per output sentence, thus roughly reflecting their syntactic complexity These measures are not to be inspected in isolation, though, as this might lead to deceptive conclusions. The scores on "number of output sentences per input" and "number of unchanged input sentences", for example, heavily depend on the complexity of the source sentence. That is to say, if the input already presents a rather short sentence with a plain structure, then no simplification is required. In fact, splitting the source sentence into a large amount of output sentences might result in incomprehensible, incoherent fragments of information. Consequently, a greater number of output sentences does not imply a higher-quality result in any case. Therefore, these measures have to be handled with due care. However, in combination with the remaining analyses that have been conducted and in particular by comparing the results with our manually compiled gold reference (see section 8.2), they offer valuable clues regarding the general performance of the text simplification system under consideration. For assessing the quality of the compressed core sentences in more detail, some further automatic measures comparing the system’s output to our reference simplification have been computed. These include not only the compression ratio [67], which indicates the fraction of words from the input which is still contained in the condensed main sentences, but also precision, recall and F1 measure [30, 67, 21, 35, 20, 43, 72] calculated over unigrams. In this regard, the F1 score - representing the average precision and recall - is calculated as the amount of tokens shared between the output of our simplification system and the gold standard variant, divided over the total number of tokens in the system’s output and in the human-generated compression, respectively. According to [19], there is a strong relation between the F1

108

8. Experimental Setup measure and human ratings of compression quality. In addition, in keeping with [61, 86, 85, 91, 21], we determined how close the system’s output is to a manually gathered gold standard by computing the Bleu score [63], which is commonly used to evaluate the quality of machine translation output. This metric assesses the system’s result by counting n-gram matches with the reference and indeed has been shown to correlate well with human performance judgements [21]. For the calculation of Bleu, we have made use of Phrasal14 [36], a statistical phrase-based machine translation system provided by the Stanford NLP Group, as evaluation tool.

8.1.2. Manual Analysis To gain deeper insight into the performance of our simplification framework, we have analyzed in depth which sentence structures the system is able to handle properly and which ones, in contrast, cause difficulties and therefore can be considered as sticking points. This was done on large scale by comparing sentence-by-sentence the framework’s output with a manually simplified version. Roughly referring to the widely recognised evaluation methodology of eliciting human ratings on the results that have been generated by a text simplification system [86, 85, 75, 20, 61, 71], each bundle of associated output sentences has been classified into one of five categories, depending on the divergences of the result provided by the framework from the gold reference solution according to the following three criteria: • simplicity: extent to which the output sentence is simpler than the original one • fluency/grammaticality: extent to which a sentence is proper, grammatical English • adequacy/meaning preservation: extent to which the sentence has the same meaning as the source sentence On the basis of these fairly general criteria, more fine-grained classification guidelines have been drawn up for assigning each output produced by our sentence simplification system to its respective category (see table 8.1). Category

1

14

Guidelines • perfectly resolved output according to the gold reference, or presenting only very slight derivations affecting neither the grammaticality nor the simplicity nor the meaning of the result (e. g. minor punctuation errors)

http://nlp.stanford.edu/phrasal/

109


• slightly ungrammatical (regarding in particular tense or number of an inserted auxiliary verb) • insignificant information loss 2

• extraction of slightly too much information out of the core sentence(s) • presenting at most a minimal change in meaning • some further simplification is recommended (though sometimes arguable!), but in general (very) good simplification of the source sentence

• requiring some further extraction of components out of the core sentence(s)

3

• significant information loss (mostly due to dropping information that has been contained within brackets) • slightly ungrammatical with a (minor) negative impact on the comprehensibility of the output • disembedding of too much constituents, resulting in a slight ungrammaticality or change in meaning

4

• in general, acceptable simplification quality, but single sentences of the output show a considerable change in meaning or ungrammaticality • a lot of potential for further extractions

5

• ungrammatical to such an extent that the meaning of the output is barely understandable • no simplification, though a lot of contextual information is contained in the input Table 8.1.: Classification guidelines

In a second step, we have then closely examined the discrepancies between the output of our framework and the gold standard, searching for patterns in the form of syntactic structures that commonly lead to malformed output sentences. Next, we have tried to figure out the causes of such negative results. Is it, for instance,

110

8. Experimental Setup to be attributed to an inaccurate tree structure provided by the parser? Is it rather due to our heuristics? Or are there any other reasons for the divergences between the framework’s output and the desired one? These findings may serve as a good starting point for future improvements of our text simplification system.

8.2. Evaluation Dataset As expounded in section 3.3.1, in the past few years, work on sentence simplification has commonly been evaluated on a parallel corpus of sentences from EW and their allocated SEW counterparts [91], thus operating in a domain-independent manner. Since we aim at using text simplification as a preprocessing step for improving the performance of subsequently applied open IE tasks, we build on the idea of adopting Wikipedia as a domain-independent test collection for evaluating the performance of our text simplification framework. However, as our method is not based on machine learning techniques, but rather uses a set of hand-crafted transformation rules, a parallel simplification corpus is not required. Instead, we lack a dataset consisting of a sequence of complex source sentences, with each of them being assigned not only one to n compressed target sentences that convey the main pieces of information from the input, but also further zero to m associated sentences expressing additional secondary information of the original sentence in the form of contextual content. Therefore, we have decided to assemble our own test set, thereby circumventing most of the limitations of previously compiled Wikipedia-based parallel simplification corpora, which have been pointed out recently in [87] (see section 3.3.1). Hence, in setting up our evaluation set, we have taken the following approach: We have carefully selected a set of high quality Wikipedia articles, spanning a broad range of subjects. In order to ascertain that our simplification framework is able to operate correctly on a large number of structurally varying sentences, we have chosen relatively long articles, incorporating a wide variety of diversely composed sentences, as indicated by their POS tags. Meeting all the above-mentioned requirements, the items on "Baseball", "Google" and "Nelson Mandela" became the articles of choice for the construction of the evaluation set. With each of them containing about 400 to 500 sentences, our test collection adds up to more than 1,300 sentences in total. By manually simplifying each sentence of the test set, a gold standard representing the intended optimal solution has been compiled. With the help of this reference, the output of our text simplification framework has been examined - also in comparison with a baseline system (see section 8.3) -, both by means of automatic evaluation measures and a manual analysis, as has been outlined in the prior section.

8.3. Baseline We compare our text simplification framework with the Simplified Factual Statement Extractor15 described in [41], which draws on a linguistically motivated method for 15

source code available at: http://www.cs.cmu.edu/~ark/mheilman/qg-2010-workshop/

111

8. Experimental Setup extracting multiple simple - both syntactically and semantically correct - factual statements from complex input sentences. The rationale underlying this approach is that the task of separating out simplified sentences from a complex source sentence may be reduced to the task of generating a subset of sentences that one would assume to be true after having read the original version. Thus, the objective is to create a set of textual entailments, predicated on the linguistic phenomena of semantic entailment ("A semantically entails B if and only if for every situation in which A is true, B is also true.") and presupposition ("A presupposes B if and only if: (a) if A is true, then B is true, and (b) if A is false, then B is true.") Based on these concepts, a number of simplification rules operating on phrase structure trees have been manually defined for eliminating both discourse connectives and adjunct modifiers from clauses, VPs and NPs, as well as for splitting conjoined clauses and VPs that are either entailed or presupposed. To be exact, the following syntactic constructions are removed from the input and transformed into a separate simplified sentence: • non-restrictive appositives (e. g. "Jefferson, the third U.S. President, . . . ") • non-restrictive relative clauses (e. g. "Jefferson, who was the third U.S. President, . . . ") • parentheticals (e. g. "Jefferson (1743-1826) . . . ") • participial modifiers of NPs (e. g. "Jefferson, being the third U.S. President . . . ") • VP modifiers offset by commas (e. g. "Jefferson studied many subjects, like the artist Da Vinci.") • modifiers that precede the subject (e. g. "Being the third U.S. President, Jefferson . . . ") • temporal subordinate clauses (e. g. "Before Jefferson was the third U.S. President . . . ") In line with the aforementioned motivation, these conversions aim at preserving the truth conditions of the original sentence, while at the same time producing more concise output sentences (see examples in figure 8.1). Accordingly, the approach proposed in [41] is comparable to our framework in that it deploys a rule-based text simplification algorithm to identify and separate out information that is embedded in various predetermined syntactic constructions into stand-alone simplified sentences. Our system may be viewed as an expansion, as we extract multiple further syntactic constituents - notably PPs, restrictive appositive phrases, intra-sentential attributions and a wider array of subordinate clauses -, while explicitly differentiating between indispensable core information and secondary context information included in the original sentence on top of that, putting emphasis on not disrupting the cohesive relations between the extracted components and the remaining core sentence to which they refer. The results of the comparison between the performance of the two text simplification systems by means of the previously described measures will be detailed in the next section.

112


Figure 8.1.: Example output sentences produced by the baseline algorithm example 1: Born in Honolulu, Hawaii, Obama is a graduate of Columbia University and Harvard Law School, where he served as president of the Harvard Law Review. • Obama is a graduate of Columbia University and Harvard Law School. • Obama is born in Honolulu. • He served as president of the Harvard Law Review at Columbia University and Harvard Law School. • Honolulu is Hawaii. (8.1.1)

example 2: He ordered the closing of the Guantanamo Bay detention camp, but Congress prevented the closure by refusing to appropriate the required funds and preventing moving any Guantanamo detainee into the U.S. or to other countries. • He ordered the closing of the Guantanamo Bay detention camp. • Congress prevented the closure by refusing to appropriate the required funds. • Congress prevented the closure by preventing moving any Guantanamo detainee into the U. S. or to other countries. (8.1.2)

113


example 3: Sonia Sotomayor, nominated by Obama on May 26, 2009, to replace retiring Associate Justice David Souter, was confirmed on August 6, 2009, becoming the first Hispanic Supreme Court Justice. • Sonia Sotomayor was confirmed on August 6, 2009. • Sonia Sotomayor was nominated by Obama on May 26 to replace retiring Associate Justice David Souter. • Sonia Sotomayor became the first Hispanic Supreme Court Justice. (8.1.3)

114

9. Evaluation Results and Discussion We evaluated the performance of our text simplification framework on the Wikipediabased corpus described in section 8.2, using both a number of selected automatic evaluation metrics and a detailed manual analysis. The results of this procedure will be discussed below.

9.1. Results of the Automatic Evaluation As pointed out in section 8.1.1, we have estimated the quality of the output sentences produced by our text simplification framework on the basis of multiple automatic measures. In the course of this process, we have calculated not only some basic statistics on a couple of shallow features of the system’s outcome, but also the compression ratio, recall, precision and F1 measure determining the compression quality of our approach, as well as the Bleu score indicating the extent to which the framework’s output differs from the desired solution that has been specified in the gold reference standard. We will present the outcome of these measurements in the next sections.

9.1.1. Shallow Features As displayed in tables 9.1 to 9.5, the results of our first unit of experiments, which investigate surface features of the output sentences, show that our text simplification framework consistently outperforms the baseline system. Discarding only 3.1% of the sentences from the test set, our framework processes slightly more of the input than the baseline, which abandons 3.6% of the source sentences (see table 9.1). In both cases, such rejections originate from the sentences’ specific syntactic structures that are a priori assumed to be too complex for producing accurate results. While the baseline simply drops every input sentence containing more than 40 tokens, our approach uses a more sophisticated heuristic, as will be elucidated in section 9.2.2. However, accounting for no more than about 3% of the input, the elimination rate is negligibly small in both systems. Thus, they almost reach a full coverage of the source.

9. Evaluation Results and Discussion

Gold Reference Simplification Framework Baseline

"Baseball"

"Google"

"Mandela"

P

445

382

512

1339

420

377

501

1298 (96.9%)

424

374

493

1291 (96.4%)

Table 9.1.: Number of sentences processed Furthermore, our simplification framework generates more output sentences per input than the baseline (see table 9.2). With an average of 2.66 resulting sentences as compared to 1.69 -, our approach creates at an average one additional output for each source, thereby approaching the mean gold reference’s score of 3.10 sentences per input considerably more closely.


"Baseball"

"Google"

"Mandela"

P

∅ 2.78

∅ 2.75

∅ 3.76

∅ 3.10

∅ 2.32

∅ 2.41

∅ 3.25

∅ 2.66

∅ 1.46

∅ 1.55

∅ 2.06

∅ 1.69

Table 9.2.: Number of output sentences per input Nevertheless, neither our system nor the baseline separates out phrasal constituents without cause. That is to say, sentences that already present a rather simple structure (e. g. "Google will continue to be the umbrella company for Alphabet’s Internet interests." or "A run is scored when a player advances around the bases and returns to home plate.") are not simplified in general. Accordingly, 16.0% (our framework) and 17.7% (baseline) of the input sentences, respectively, are left unchanged (see table 9.3). However, these figures are twice as high as the gold standard’s value of 8.0%, signifying that there is still potential for further extractions, both in our simplification framework and in the baseline system. Indeed, this supposition is supported by the findings expounded in section 9.1.2.


"Baseball"

"Google"

"Mandela"

P

48

38

21

107 (8.0%)

93

77

44

214 (16.0%)

109

84

44

237 (17.7%)

Table 9.3.: Number of unchanged input sentences Beyond that, our simplification framework provides a better coverage of the information contained in the source, since 96.7% of the input words appear in at least one of the output sentences, compared to only 83.3% for the baseline (see table

116

9. Evaluation Results and Discussion 9.4). While in our system this loss of tokens from the source is mostly due to the removal of irrelevant function words (such as "which", "who" or "where" that are omitted when transforming relative clauses into stand-alone sentences; see example 9.1.1) and parenthetical constituents, the higher loss rate of the baseline system can be traced back to ignoring a variety of subordinate clauses, yet often carrying meaningful information (see examples 9.1.2 and 9.1.3).


"Baseball"

"Google"

"Mandela"

∅

98.0%

98.0%

97.8%

97.9%

95.5%

97.1%

97.6%

96.7%

85.1%

87.2%

77.5%

83.3%

Table 9.4.: Input word coverage (without punctuation)

Finally, with an average sentence length of 9.92 (our framework) and 12.11 (baseline) words (not including punctuation), the output from both systems is substantially shorter than the input sentences, which are 24.04 words long on average. The sentences produced by our simplification framework are notably shorter, though, thereby approximating the reference’s mark of 9.09 tokens per output sentence (see table 9.5).

Input Gold Reference Simplification Framework Baseline

P

"Baseball" ∅ 24.34

"Google" ∅ 21.04

"Mandela" ∅ 26.73

∅ 24.04

∅ 9.97

∅ 8.72

∅ 8.58

∅ 9.09

∅ 10.95

∅ 9.41

∅ 9.39

∅ 9.92

∅ 13.74

∅ 12.19

∅ 10.39

∅ 12.11

Table 9.5.: Average sentence length (without punctuation)

To sum it up, on average, our text simplification framework splits the input into a greater number of output sentences which are generally shorter and present a higher word coverage of the source than the outcome produced by the baseline. In doing so, our system gets closer to the optimum scores set by the manually compiled gold standard with respect to every feature that has been examined. Thus, all in all, our simplification framework generates results of higher quality as compared to the baseline system.

117


Figure 9.1.: Example output sentences presenting a reduced word coverage example 1: original sentence: The farms, which were developed by NextEra Energy Resources, will reduce fossil fuel use in the region and return profits. • core sentence: The farms will reduce fossil fuel use in the region and return profits. • context sentence: The farms were developed by NextEra Energy Resources. (9.1.1) Our framework

example 2: original sentence: Later that same year, Google purchased GrandCentral for $50 million. • core sentence: Google purchased GrandCentral for $50 million. (9.1.2) Baseline

example 3: original sentence: Although initially committed to non-violent protest, in association with the SACP he co-founded the militant Umkhonto we Sizwe (MK) in 1961, leading a sabotage campaign against the apartheid government. • core sentence: He co-founded the militant Umkhonto we Sizwe in 1961. • context sentence: He led a sabotage campaign against the apartheid government. (9.1.3) Baseline

118


9.1.2. Compression Quality Next, we have analyzed the quality of the compressed core sentences in isolation. As table 9.6 shows, our simplification framework compresses the input sentences by about a third of their original lengths, while the gold reference reaches a compression ratio of approximately 50% on average. Thus in comparison, the percentage of words dropped is somewhat higher in the optimal solution. This implies that there is still some scope for incorporating further rules in our system in order to separate out additional components of the core sentences. With a compression ratio of 59%, the baseline slightly outperforms our approach in this case.


"Baseball"

"Google"

"Mandela"

∅

0.58

0.51

0.44

0.51

0.69

0.62

0.56

0.62

0.66

0.65

0.45

0.59

Table 9.6.: Compression ratio

These findings are indeed confirmed by the second group of automatic evaluation measures assessing the compression quality of our text simplification framework: recall, precision and F1 measure (see table 9.7). With a recall rate of 96% on average, the framework provides an extremely good performance regarding the former, meaning that practically every word that is supposed to be included in the compressed core sentences actually is part of them. Besides, the system’s precision averages out at 0.82, which signifies that the resulting core sentences contain about a fifth of false positives, i. e. words or phrases that should not be contained since they represent contextual rather than fundamental information according to the reference simplification. Hence, serving as a weighted average of precision and recall, the F1 score reaches a value of 0.88, indicating a high accuracy of the compressed core sentences.

"Baseball" "Google"

F1

tp Precision = tp+f p tp Recall = tp+f n measure = 2∗precision∗recall precision+recall

0.83 0.97 0.90

0.82 0.95 0.88

"Mandela"

∅

0.80 0.96 0.87

0.82 0.96 0.88

Table 9.7.: Precision, recall and F1 score obtained by our simplification framework

The corresponding baseline scores are displayed in table 9.8. With an average recall rate of 88% and a precision of 70%, resulting in an F1 score of 0.78, our simplification framework consistently surpasses the baseline system by around 10%.

119

9. Evaluation Results and Discussion "Baseball" "Google"

F1

tp Precision = tp+f p tp Recall = tp+f n measure = 2∗precision∗recall precision+recall

0.75 0.92 0.83

"Mandela"

∅

0.69 0.80 0.74

0.70 0.88 0.78

0.66 0.92 0.77

Table 9.8.: Precision, recall and F1 score obtained by the baseline system

9.1.3. Closeness of the System’s Output to the Reference Corpus In a final step, using the Bleu metric, we have computed the overlapping of ngrams between the machine simplified sentences and our human generated gold reference (see table 9.9). We set the maximum n-gram order to 4, as in this case the best correlation with monolingual human judgements is obtained, according to [63]. That way, we reach a very high Bleu score of 0.81 for our framework, drawing near the upper bound of 1, whereas the baseline averages out at only 0.48, thus scoring significantly lower than our system. Hence, in comparison to the baseline, the outcome produced by our framework is considerably closer to the gold reference simplification.


"Baseball"

"Google"

"Mandela"

∅

1.00

1.00

1.00

1.00

0.80

0.80

0.82

0.81

0.53

0.47

0.43

0.48

Table 9.9.: Bleu scores (using a maximum n-gram order of 4)

9.2. Results of the Manual Evaluation Subsequent to the previously described automatic evaluation procedure, we conducted a profound manual analysis of the outcome provided by our text simplification system, thereby highlighting the aspects of grammaticality, adequacy and simplicity. The results of this survey will be presented below.

9.2.1. Classification of the Output Based on the guidelines listed in table 8.1 - which take into consideration the three aforementioned features of simplicity, grammaticality and meaning preservation -, we have classified each bundle of core and associated context sentences that has been returned by our text simplification framework, depending on their divergences from the gold reference simplification. To be exact, we have assigned the resulting sentences to one of the five predetermined categories, ranging from class 1 - incorporating outcome that is identical or quasi-identical to the desired optimal solution,

120

9. Evaluation Results and Discussion thus representing a very good simplification quality - to class 5, which covers seriously distorted output sentences that present great discrepancies from the gold standard. The distribution arising from such a categorization of the framework’s outcome is shown in tables 9.10 to 9.12. These figures demonstrate that our text simplification system returns positive results (class 1 and 2) in about 60% of the cases that have been examined, while around a further fifth of the output is labelled as "neutral" (class 3), hence signifying sentences that have been reasonably simplified, yet that leave some room for improvements. Consequently, only about 20% of the outcome is to be regarded as a negative one (class 4 and 5). 1 (++) 158 38%

2 (+) 88 21% 59%

3 (◦) 86 20% 20%

Table 9.10.: "Baseball" (

1 (++) 145 38%

2 (+) 89 24% 62%

4 (−) 54 13% 21% P

4 (−) 51 14%

59%

377 sentences)

3 (◦) 91 18% 18%

Table 9.12.: "Mandela" (

5 (−−) 28 7% 21%

P

2 (+) 124 25%

420 sentences)

3 (◦) 64 17% 17%

Table 9.11.: "Google" (

1 (++) 172 34%

5 (−−) 34 8%

4 (−) 75 15%

5 (−−) 39 8% 23%

P

501 sentences)

Having a closer look at the system’s output, it becomes apparent that for the most part, sentences showing a clear, straightforward structure are resolved properly throughout the dataset. More specifically, sentences with a single phrase or clause that is to be separated out are commonly simplified in accordance with the manually annotated version (see examples in figure 9.2). In general, simple combinations of two of the constructs described in section 7.2 are processed correctly as well (see examples in figure 9.3). Beyond that, a considerable number of sentences that incorporate an arbitrary mixture of such constituents are also perfectly simplified by our framework (see examples in figure 9.4).

121


Figure 9.2.: Examples of perfectly resolved sentences incorporating only a single constituent to extract prepositional phrase: original sentence: In December 2013, Alexa listed google.com as the most visited website in the world. • core sentence: Alexa listed google.com as the most visited website in the world. • context sentence: This was in December 2013. (9.2.1) Example 1

conjoined clauses: original sentence: A game comprises nine innings, and the team with the greater number of runs at the end of the game wins. • core sentence: A game comprises nine innings. • core sentence: And the team with the greater number of runs at the end of the game wins. (9.2.2) Example 2

relative clause: original sentence: Players on the batting team take turns hitting against the pitcher of the fielding team, which tries to prevent runs by getting hitters out in any of several ways. • core sentence: Players on the batting team take turns hitting against the pitcher of the fielding team. • context sentence: The fielding team tries to prevent runs by getting hitters out in any of several ways. (9.2.3) Example 3

122


Figure 9.3.: Examples of perfectly resolved sentences incorporating a simple combination of two components to extract appositive and prepositional phrase: original sentence: He rejected the offer and later criticized Vinod Khosla, one of Excite’s venture capitalists, after he negotiated Brin and Page down to $750,000. • core sentence: He rejected the offer and later criticized Vinod Khosla. • context sentence: This was after he negotiated Brin and Page down to $750,000. • context sentence: Vinod Khosla was one of Excite’s venture capitalists. (9.3.1) Example 1

prepositional phrases: original sentence: The satellite was launched from Vandenberg Air Force Base on September 6, 2008. • core sentence: The satellite was launched. • context sentence: This was on September 6, 2008. • context sentence: This was from Vandenberg Air Force Base. (9.3.2) Example 2

123


prepositional phrase and conjoined clauses: original sentence: Before 2004, Schmidt made $250,000 per year, and Page and Brin each received an annual salary of $150,000. • core sentence: Schmidt made $250,000 per year. • context sentence: This was before 2004. • core sentence: And Page and Brin each received an annual salary of $150,000. (9.3.3) Example 3

apposition and intra-sentential attribution: original sentence: In a talk at Stanford University, Marissa Mayer, Google’s Vice President of Search Products and User Experience until July 2012, showed that half of all new product launches in the second half of 2005 had originated from the Innovation Time Off. • core sentence: Half of all new product launches in the second half of 2005 had originated from the Innovation Time Off. • context sentence: This was what Marissa Mayer showed in a talk at Stanford University. • context sentence: Marissa Mayer was Google’s Vice President of Search Products and User Experience until July 2012. (9.3.4) Example 4

124


appositive and prepositional phrase: original sentence: In 1845, Alexander Cartwright, a member of New York City’s Knickerbocker Club, led the codification of the so-called Knickerbocker Rules. • core sentence: Alexander Cartwright led the codification of the so-called Knickerbocker Rules. • context sentence: This was in 1845. • context sentence: Alexander Cartwright was a member of New York City’s Knickerbocker Club. (9.3.5) Example 5

Figure 9.4.: Examples of perfectly resolved sentences incorporating a more complex structure appositive and prepositional phrases: original sentence: His father, Gadla Henry Mphakanyiswa, was a local chief and councillor to the monarch; he had been appointed to the position in 1915, after his predecessor was accused of corruption by a governing white magistrate. • core sentence: His father was a local chief and councillor to the monarch. • context sentence: Gadla Henry Mphakanyiswa was his father. • core sentence: He had been appointed to the position. • context sentence: This was in 1915. • context sentence: This was after his predecessor was accused of corruption by a governing white magistrate. (9.4.1) Example 1

125


appositive and participial and prepositional phrase: original sentence: Intending to gain skills needed to become a privy councillor for the Thembu royal house, Mandela began his secondary education at Clarkebury Methodist High School Engcobo, a Western-style institution that was the largest school for black Africans in Thembuland. • core sentence: Mandela began his secondary education. • context sentence: This was at Clarkebury Methodist High School Engcobo. • context sentence: Clarkebury Methodist High School Engcobo was a Western-style institution that was the largest school for black Africans in Thembuland. • context sentence: This was when intending to gain skills needed to become a privy councillor for the Thembu royal house. (9.4.2) Example 2

participial and prepositional phrases: original sentence: Earning a small wage, Mandela rented a room in the house of the Xhoma family in the Alexandra township; despite being rife with poverty, crime and pollution, Alexandra always remained "a treasured place" for him. • core sentence: Mandela rented a room in the house of the Xhoma family. • context sentence: This was when earning a small wage. • context sentence: This was in the Alexandra township. • core sentence: Alexandra always remained "a treasured place" for him. • context sentence: This was despite being rife with poverty, crime and pollution. (9.4.3) Example 3

126


prepositional and participial phrases and conjoined clauses: original sentence: In early 1947, his three years of articles ended at Witkin, Sidelsky and Eidelman, and he decided to become a full-time student, subsisting on loans from the Bantu Welfare Trust. • core sentence: His three years of articles ended. • context sentence: This was in early 1947. • context sentence: This was at Witkin, Sidelsky and Eidelman. • core sentence: And he decided to become a full-time student. • context sentence: This was when subsisting on loans from the Bantu Welfare Trust. (9.4.4) Example 4

intra-sentential attribution and prepositional phrase and relative clause: original sentence: In July 2013, it was reported that Google had hosted a fundraising event for Oklahoma Senator Jim Inhofe, who has called climate change a "hoax". • core sentence: Google had hosted a fundraising event. • context sentence: This was for Jim Inhofe. • context sentence: This was what was reported in July 2013. • context sentence: Jim Inhofe has called climate change a "hoax". • context sentence: Jim Inhofe was Oklahoma Senator. (9.4.5) Example 5

127


intra-sentential attribution and appositive and prepositional phrase: original sentence: In a post on Google’s blog, Google Chief Executive and cofounder Larry Page revealed that the acquisition was a strategic move to strengthen Google’s patent portfolio. • core sentence: The acquisition was a strategic move. • context sentence: This was to strengthen Google’s patent portfolio. • context sentence: This was what Larry Page revealed in a post on Google’s blog. • context sentence: Larry Page was Google Chief Executive and cofounder. (9.4.6) Example 6

In contrast, in approximately a fifth of the sentences from the test set, the outcome has been classified as "negative examples". Generally, such rejected output can be traced back to either parsing errors (causing around one third of the errors observed in the simplified sentences) - which themselves mostly occur due to a particularly complex syntactic structure of the input sentence -, or the heuristics that do not perform well in the concrete example (representing the source of error in about two thirds of all cases). Due to a more in-depth analysis of the problematic sentences, a number of errorprone features have been identified. These primarily include complex coordinated NPs as well as long appositions containing VPs, since constructs of this type are often parsed incorrectly and therefore cannot be resolved properly by our system. Moreover, there are some syntactic structures requiring a revision of the specified heuristics. This most notably applies to PPs which are not set off by commas. Beyond that, a more sophisticated heuristic for discriminating coordinated NPs from appositives is required. The examples presented in figure 9.5 illustrate negative output incorporating components of aforementioned types.

Figure 9.5.: Examples of negative outcome error-prone features: complex coordinated NPs long appositions containing VPs PPs without segregation through punctuation discrimination of NPs from appositives (9.5.1) Legend

128


complex coordinated noun phrases: original sentence: Recreational amenities are scattered throughout the campus and include a workout room with weights and rowing machines, locker rooms, washers and dryers, a massage room, assorted video games, table football, a baby grand piano, a billiard table, and ping pong. • core sentence: Recreational amenities are scattered throughout the campus and include a workout room with weights and rowing machines. • context sentence: Locker rooms were rowing machines. • context sentence: Washers and dryers were locker rooms. • context sentence: A massage room was washers and dryers. • context sentence: Assorted video games were a massage room. • context sentence: Table football was assorted video games. • context sentence: A baby grand piano was table football. • context sentence: A baby grand piano was a billiard table, and ping pong. (9.5.2) Example 1

129


(9.5.3) Parse tree of example 1

complex coordinated noun phrases: original sentence: The service is Google’s "email redesigned", with realtime editing, the ability to embed audio, video, and other media, and extensions that further enhance the communication experience. • core sentence: The service is Google’s "email redesigned", the ability to embed audio, video, and other media, and extensions that further enhance the communication experience. • context sentence: This is with realtime editing. (9.5.4) Example 2

130



complex coordinated noun phrases: original sentence: Of a population of 40 million, around 23 million lacked electricity or adequate sanitation, 12 million lacked clean water supplies, with 2 million children not in school and a third of the population illiterate. • core sentence: Around 23 million lacked electricity or adequate sanitation lacked clean water supplies. • context sentence: This was of a population of 40 million. • context sentence: This was with 2 million children not in school and a third of the population illiterate. • context sentence: Adequate sanitation was 12 million. (9.5.6) Example 3

131



long apposition containing a verb phrase: original sentence: The complex became known as the Googleplex, a play on the word googolplex, the number one followed by a googol zeroes. • core sentence: The complex became known as the Googleplex, the number one followed by a googol zeroes. • context sentence: The Googleplex was a play on the word googolplex. (9.5.8) Example 4

132



prepositional phrase and a complex apposition: original sentence: In 1994, free healthcare was introduced for children under six and pregnant women, a provision extended to all those using primary level public sector health care services in 1996. • core sentence: Free healthcare was introduced for children under six and pregnant women, a provision extended to all those using primary level public sector health care services. • context sentence: This was in 1994. • context sentence: This was in 1996. (9.5.10) Example 5

133



prepositional phrases: original sentence: Deciding on the need for a youth wing to mass mobilise Africans in opposition to their subjugation, Mandela was among a delegation that approached ANC President Alfred Bitini Xuma on the subject at his home in Sophiatown; the African National Congress Youth League (ANCYL) was founded on Easter Sunday 1944 in the Bantu Men’s Social Centre in Eloff Street, with Lembede as President and Mandela as a member of the executive committee. • core sentence: Mandela was among a delegation that approached Alfred Bitini Xuma on the subject at his home. • context sentence: This was when deciding on the need for a youth wing to mass mobilise Africans in opposition to their subjugation. • context sentence: Alfred Bitini Xuma was ANC President. • context sentence: This was in Sophiatown. • core sentence: The African National Congress Youth League was founded on Easter Sunday 1944 in the Bantu Men’s Social Centre in Eloff Street as President and Mandela as a member of the executive committee. • context sentence: This was with Lembede. (9.5.12) Example 6

134


prepositional phrases: original sentence: In 2005, he founded the Nelson Mandela Legacy Trust, to speak before the Brookings Institution and the NAACP on the need for economic assistance to Africa. • core sentence: He founded the Nelson Mandela Legacy Trust, to speak before the Brookings Institution and the NAACP on the need for economic assistance. • context sentence: This was in 2005. • context sentence: This was to Africa. (9.5.13) Example 7

coordinated noun phrases versus appositives: original sentence: As of September 2013, Google’s East Coast office is located at 76 Ninth Ave, New York City, New York. • core sentence: Google’s East Coast office is located. • context sentence: This is as of September 2013. • context sentence: 76 Ninth Ave is New York City, New York. • context sentence: This is at 76 Ninth Ave. (9.5.14) Example 8

coordinated noun phrases versus appositives: original sentence: The couple had two daughters, both named Makaziwe Mandela. • core sentence: The couple had two daughters, both named Makaziwe Mandela. (9.5.15) Example 9

135

9. Evaluation Results and Discussion On the basis of these findings on error-prone syntactic structures, our simplification rules are to be improved in the future.

9.2.2. Elimination of Sentences Providing a particular Syntactic Structure from the Result Set By carefully examining the outcome of thousands of input sentences from our Wikipedia test set, we have noticed that there are some further syntactic structures which almost always lead to malformed output. In order to avoid producing a distorted result - predicated on these features -, we apply an elaborate heuristic for passing over input sentences incorporating these specific structural components. As mentioned in section 7.2.9, bracketed constituents are separated out a priori, since the parser frequently fails to correctly attach them to their respective antecedents (for an example, see the parse tree displayed in figure 9.7.2). Hence, instead of transforming these components into stand-alone context sentences, we simply delete them from the main sentence, which inevitably results in a slight loss of information (see example in figure 9.7.1). However, this is usually acceptable, as parenthetical constituents generally describe only some incidental piece of information. Moreover, we skip sentences containing a colon or semicolon that do not separate two full sentences, i. e. that do not present the syntactic structure depicted in figure 9.6, but rather comprise a NP representing an enumeration of components or an explicative statement in terms of keywords as its second component. In those cases, too, the parser regularly fails to derive the proper referring phrase (for an example, see the parse tree depicted in figure 9.7.3). Therefore, sentences comprising such a syntactic structure are ignored by our simplification system as well.

S S

:

S

[;|:] Figure 9.6.: Syntax parse tree of a (semi)colon separating two full sentences

Finally, outcome including a core sentence that starts with a participle (as denoted by the POS tag ’VBN’ or ’VBG’, respectively) is removed from the result set, since this has been shown as a strong indicator for an inaccurately split input (like e. g. in "Replacing Mbeki as Deputy President, Mandela and the Executive supported the candidacy of Jacob Zuma." that is supposed to represent the resulting simplified core sentence).

136


Figure 9.7.: Examples of sentence structures that are eliminated bracketed constituents: original sentence: Mandela began work on a Bachelor of Arts (BA) degree at the University of Fort Hare. • core sentence: Mandela began work on a Bachelor of Arts degree at the University of Fort Hare. (9.7.1) Example 1


137


(9.7.3) Parse tree of an example sentence containing a colon

That way, no more than about 3% of the input sentences from the underlying test set are disregarded by our text simplification framework (see table 9.1), corresponding to a total of 41 sentences. Thus, only a negligible small number of the input sentences are rejected by our system, while plenty of distorted output is prevented. On the contrary, by simply skipping every input sentence that contains more than 40 tokens, the baseline deploys a rather straightforward heuristic for eliminating specific input. However, we do not consider sentence length a reliable attribute for avoiding malformed outcome, as there are many long sentences in the test set which our framework is able to resolve perfectly (see examples in figure 9.8).

138


Figure 9.8.: Examples of long sentences that are perfectly simplified example 1: original sentence: Although dismantling press censorship, speaking out in favour of freedom of the press, and befriending many journalists, Mandela was critical of much of the country’s media, noting that it was overwhelmingly owned and run by middle-class whites and believing that it focused too much on scaremongering around crime. • core sentence: Mandela was critical of much of the country’s media. • context sentence: This was when noting that it was overwhelmingly owned and run by middle-class whites and believing that it focused too much on scaremongering around crime. • context sentence: This was although dismantling press censorship, speaking out in favour of freedom of the press, and befriending many journalists. (9.8.1)

example 2: original sentence: In September 1953, Andrew Kunene read out Mandela’s "No Easy Walk to Freedom" speech at a Transvaal ANC meeting; the title was taken from a quote by Indian independence leader Jawaharlal Nehru, a seminal influence on Mandela’s thought. • core sentence: Andrew Kunene read out Mandela’s "No Easy Walk to Freedom" speech. • context sentence: This was in September 1953. • context sentence: This was at a Transvaal ANC meeting. • core sentence: The title was taken from a quote. • context sentence: This was by Jawaharlal Nehru. • context sentence: Jawaharlal Nehru was a seminal influence on Mandela’s thought. • context sentence: Jawaharlal Nehru was an Indian independence leader. (9.8.2)

139


9.2.3. Informal Comparison to the Baseline

When comparing the output sentences provided by our text simplification framework with those returned by the baseline system, some considerable differences are to be discerned. It is particularly noticeable that the latter tends to drop a larger amount of constituents from the input - mainly subordinate clauses -, instead of transforming them into separate sentences (see examples in figure 9.9).

Figure 9.9.: Examples of sentences that have been simplified by the baseline system

dropping of a participial phrase and a bracketed constituent: original sentence: Influenced by Marxism, he secretly joined the South African Communist Party (SACP). • core sentence: He secretly joined the South African Communist Party. corresponding result of our text simplification framework: • core sentence: He secretly joined the South African Communist Party. • context sentence: This was when being influenced by Marxism. (9.9.1) Example 1

140


dropping of an adverb and a prepositional phrase: original sentence: Originally, Google ran under Stanford University’s website, with the domains google.stanford.edu and z.stanford.edu. • core sentence: Google ran under Stanford University’s website. corresponding result of our text simplification framework: • core sentence: Google ran. • context sentence: This was under Stanford University’s website. • core sentence: This was with the domains google.stanford.edu and z.stanford.edu. • context sentence: This was originally. (9.9.2) Example 2

dropping of a participial phrase: original sentence: Found guilty, he was sentenced to five years’ imprisonment; as he left the courtroom, supporters sang Nkosi Sikelel iAfrika. • core sentence: He was sentenced to five years’ imprisonment. • context sentence: Supporters sang Nkosi Sikelel iAfrika. • context sentence: He left the courtroom. corresponding result of our text simplification framework: • core sentence: He was sentenced to five years’ imprisonment. • context sentence: This was when being found guilty. • core sentence: Supporters sang Nkosi Sikelel iAfrika. • context sentence: This was as he left the courtroom. (9.9.3) Example 3

141

9. Evaluation Results and Discussion Beyond that, the baseline sets no great value upon preserving the coherence of the individual split sentences that have been generated from the input. This can be concretely attributed to its goal of producing concise outcome maintaining the truth conditions of the original sentence which can then - independently of each other be readily transformed into questions (see example in figure 9.10). However, this approach often leads to shortened, syntactically simplified, yet rather loose sentences that lack any correlation (see example in figure 9.11). In contrast, in order to preserve the information content of the input as accurately as possible, our simplification framework aims at retaining cohesive relations between a core and its associated context sentences (see examples in figure 9.11 and figure 9.9.3). positive baseline simplification example: original sentence: That year, he began his autobiography, which was smuggled to London, but remained unpublished at the time; prison authorities discovered several pages, and his study privileges were stopped for four years. • core sentence: He began his autobiography. • context sentence: Prison authorities discovered several pages. • context sentence: His study privileges were stopped for four years. • context sentence: His autobiography was smuggled to London. • context sentence: His autobiography remained unpublished at the time. corresponding result of our text simplification framework: • core sentence: He began his autobiography, but remained unpublished at the time. • context sentence: This was that year. • context sentence: His autobiography was smuggled to London. • core sentence: Prison authorities discovered several pages. • core sentence: And his study privileges were stopped for four years.

Figure 9.10.: Concise output produced by the baseline system In the future, it might be worth to contrast the quality of the simplifications of our framework with that of the baseline system more precisely on the basis of an established procedure. Hence, it is recommended to follow the widely recognised evaluation methodology used in [86, 85, 75, 20, 61, 71], which draws on a small scale comparison of the outcome of multiple simplification systems by human judges that rate on a 5-point Likert scale [55] the output sentences with respect to the

142

9. Evaluation Results and Discussion example for loose simplification outcome generated by the baseline: original sentence: He rejected the offer and later criticized Vinod Khosla, one of Excite’s venture capitalists, after he negotiated Brin and Page down to $750,000. • core sentence: He rejected the offer. • context sentence: He later criticized Vinod Khosla. • context sentence: He negotiated Brin and Page down to $750,000. • context sentence: Vinod Khosla was one of Excite’s venture capitalists. corresponding result of our text simplification framework: • core sentence: He rejected the offer and later criticized Vinod Khosla. • context sentence: This was after he negotiated Brin and Page down to $750,000. • context sentence: Vinod Khosla was one of Excite’s venture capitalists. Figure 9.11.: Incoherent output produced by the baseline system three criteria of simplicity, fluency and adequacy. That way, both systems may be carefully examined in contrast, allowing a more well-founded statement about their performance in relation to each other.

143

Part V. Context Classification

In order to increment the value of our text simplification framework for subsequent IE tasks, we went one step further. The idea was to allocate each context sentence to one of a number of predefined categories specifying the type of content it describes. That way, the correlations between a particular context sentence and the core sentence to which it is attached shall be easier to capture for an IE system that is processing the output of our text simplification framework. Hence, first of all, a taxonomy of context classes had to be designed. Using the characterization of relations between parts of a text that is applied in RST [58] for an orientation, we created a classification consisting of nine categories, with each of them divided into a set of subordinate classes. On the basis of this taxonomy, we have then manually assigned each context sentence that is contained in our previously gathered gold reference (see section 8.2) to its corresponding class label, resulting in a corpus of almost 3,000 annotated sentences. Finally, a supervised classifier has been trained on this dataset, thus building a model for automatically tagging new contextual sentences with their respective category. The next chapters are structured as follows: chapter 10 first gives an overview of the concept of RST. Hereupon, we present the taxonomy of context classes which has been designed on the basis of this approach. Chapter 11 then reports on the results of training a classifier on the corpus we have compiled using this classification of context types.

147

10. Taxonomy of Context Classes Pursuing the objective of automatically allocating each context sentence that has been separated out of an input sentence by our text simplification system, first of all a taxonomy of context classes is required. Therefore, based upon the set of discourse relations defined within the framework of RST, we have devised a classification of context types, which will be detailed in the next sections.

10.1. Rhetorical Structure Theory as Basis RST is a theory of text organization that has been created in the 1980s by a group of researchers interested in NL generation. To date, it is a widely used framework for structuring discourse in NL generation [71], as well as in other fields of computational linguistics, where it is often employed to plan coherent text and to parse textual structures [79]. RST characterizes the structure of a text in terms of relations that hold between its parts (so-called ’spans’). Based on the rationale that naturally occurring texts commonly present a hierarchical, connected arrangement of their components, in which every part of the text has a specific role - a function to play - relative to the other ones, it offers a way of reasoning about textual coherence [79]. In other words: assuming that a coherent text should not exhibit any gaps, a text is regarded as fully connected if every span has a particular purpose and is linked with the rest of the text by means of some relation. In RST, two main types of suchlike relations are to be distinguished: nucleus-satellite and multinuclear ones. The former reflect hypotactic syntactic structures, as the nuclei are supposed to represent the most important units of a text, while their associated satellites provide secondary information contributing to their respective nucleus [57]. Multinuclear relations, in contrast, mirror a paratactic syntax where no span is considered more central than the other one. Corresponding examples are depicted in figure 10.3. The first relation, ’concession’ (see figure 10.1), represents a nucleus-satellite relationship where the nucleus ("we shouldn’t embrace every popular issue that comes along") is considered as more important than the information described in its satellite ("tempting as it may be"). The second one, ’contrast’ (see figure 10.2), however, forms a multinuclear relation, joining together two units that are acknowledged equal relevance [79].

10. Taxonomy of Context Classes

Figure 10.2.: ’Contrast’ relation Figure 10.1.: ’Concession’ relation Figure 10.3.: Examples of a nucleus-satellite and a multinuclear relation [79]

RST has been designed to enable the analysis of texts by making available a set of rhetorical relations to annotate a given text [57]. Hence, when examining a text, the analyst (or ’observer’) allocates - on the basis of plausibility judgements - a specific role to each text span, thereby constructing a rhetorical structure tree with one top-level relation that encompasses the remaining relations at lower levels, thus including every part of the analyzed text in one connected whole [79]. An example of such an analysis is illustrated in figure 10.4. By definition, the arrow points away from the satellite, while its head is directed towards the nucleus to which it refers [79].

Figure 10.4.: A rhetorical structure tree [57] As mentioned above, RST relations identify relationships that can hold between two units of a text. They are specified by means of the following four fields [71]: 1. constraints on the nucleus (N ) 2. constraints on the satellite (S) 3. constraints on the combination of nucleus and satellite (N + S) 4. effect achieved on the reader and writer’s intentions, respectively (W )

150

10. Taxonomy of Context Classes For example, the ’evidence’ relation is defined as follows [57]: 1. constraints on N : The reader might not believe the nucleus to a degree satisfactory to the writer. 2. constraints on S: The reader believes the satellite or will find it credible. 3. constraints on both: The reader’s comprehending of the satellite increases the reader’s belief of the nucleus. 4. W ’s intentions: The reader’s belief of the nucleus is increased. These relation definitions are solely based on functional and semantic criteria, not on morphological or syntactic patterns, since no reliable or unambiguous linguistic signal for any of the relations were discovered [79]. By now, multiple different lists of rhetorical relations exist. The original set, the so-called ’Classical RST’, was defined in 1988 in [58] and consists of 24 relations on the whole (see table 10.1 and table 10.2). More recent work has added definitions for ’list’, ’means’, ’preparation’, ’unconditional’ and ’unless’. Besides, the ’restatement’ relation has been split up into a nuclear and a multinuclear variant. Accordingly, the set of relations meanwhile amounts to 30 in total. In fact, the authors still encourage analysts to further modify and extend the existing collection of definitions where the current one is inadequate [79]. Relation Name Antithesis

Nucleus ideas favored by the author

Background

text whose understanding is being facilitated text expressing the events or ideas occurring in the interpretive context situation affirmed by author

Circumstance

Concession

Elaboration Enablement

action or situation whose occurrence results from the occurrence of the conditioning situation basic information an action

Evaluation

a situation

Evidence

a claim

Interpretation Justify

a situation text

Condition

151

Satellite ideas disfavored by the author text for facilitating understanding an interpretive context of situation or time situation which is apparently inconsistent but also affirmed by author conditioning situation

additional information information intended to aid the reader in performing an action an evaluative comment about the situation information intended to increase the reader’s belief in the claim an interpretation information supporting the writer’s right to express the text

10. Taxonomy of Context Classes Motivation

an action

Non-volitional Cause

a situation

Non-volitional Result

a situation

Otherwise (anti conditional)

Purpose

action or situation whose occurrence results from the lack of occurrence of the conditioning situation an intended situation

Restatement

a situation

Solutionhood

Summary

a situation or method supporting full or partial satisfaction of the need text

Volitional Cause

a situation

Volitional Result

a situation

information intended to increase the reader’s desire to perform the action another situation which causes that one, but not by anyone’s deliberate action another situation which is caused by that one, but not by anyone’s deliberate action conditioning situation

the intent behind the situation a reexpression of the situation a question, request, problem, or other expressed need a short summary of the text another situation which causes that one, by someone’s deliberate action another situation which is caused by that one, by someone’s deliberate action

Table 10.1.: Nucleus-satellite relations [57] Relation Name Contrast Joint Sequence

Nucleus one alternate (unconstrained) an item

Satellite the other alternate (unconstrained) a next item

Table 10.2.: Multinuclear relations [57]

10.2. Proposed Taxonomy Guided by the aforementioned characterization of relations between units of a text that are defined within the framework of RST, we have developed a taxonomy of context classes for assigning each context sentence to one of a number of predetermined categories which denote the type of content it describes. That way, the relationship between a context sentence and the core sentence to which it refers is to be captured adequately.

152

10. Taxonomy of Context Classes In building up a comprehensive classification of context labels that may be used to properly specify for every context sentence in the dataset the functional or semantic link to the core sentence to which it is attached, we drew upon the original set of relation definitions, the Classical RST, comprising the 24 relationship types depicted in figures 10.1 and 10.2. As it turned out that this categorization is not suitable to its full extent for our purposes, though, we modified it according to our specific needs, resulting in the taxonomy of context classes displayed below (see figure 10.5).

153

10. Taxonomy of Context Classes scope

motivation

result

mode

paraphrasing

Context Classes

statement

definition exemplification

attribution elaboration

cause purpose

synonym

time location

explanation

naming

condition

characterization

abbreviation

degree extent

evidence

contrast

concession

antithesis

Figure 10.5.: Taxonomy of context classes

source/target

origin

recipient

cohesion

conjunction addition

conjunction opposition

conjunction consequence

conjunction exemplification

conjunction extent

154


10.2.1. Context Classes On taking a closer look at our taxonomy, it becomes apparent that we have considerably restructured the original set of RST relation definitions by grouping together closely linked relationship types under a common superclass, which leads to a twotier hierarchical order. At the same time, we have discarded some selected definitions appertaining to the Classical RST collection due to not considering them convenient for our purposes. On the other hand, whenever a more fine-grained view seemed reasonable, a variety of extra classes have been added. By justifying these adaptations and contrasting the resulting taxonomy with the Classical RST relation definitions, our classification of context types will be detailed in the next sections. 10.2.1.1. Type ’Scope’ Encompassing the subtypes of ’time’, ’location’, ’degree’ and ’extent’, the context class of ’scope’ describes the range of the event or idea expressed in the core sentence, thereby answering one of the following questions: • When? (’time’) • Where? (’location’) • At which rate/intensity? (’degree’) • Comprising which entities? (’extent’) Thus, ’scope’ - in particular its temporal and local subcategories - resembles the ’circumstance’ relation in RST. However, we decided to break it down to the aforementioned subclasses, in order to obtain a more fine-grained classification which allows for a more equal distribution of the context types. Regarding our Wikipediabased test set, for example, just over a third of the context sentences contained in it belong to the ’scope’ category. Hence, without further subdividing this class, a rather coarse result would be attained at this point. 10.2.1.2. Type ’Motivation’ ’Motivation’ refers to context sentences disclosing the incentive of the situation or action presented in its associated core sentence. This class is separated into the three subtypes listed below: • ’cause’: indicating the reason for what is stated in the core sentence • ’purpose’: signifying the intentions or objectives that are pursued with the idea or event depicted in the main sentence • ’condition’: denoting the prerequisites for the situation displayed in the core sentence to occur Accordingly, in the ’motivation’ category of our taxonomy we have brought together several RST relation definitions, namely ’volitional’ and ’non-volitional cause’, ’purpose’, ’condition’ and ’otherwise’. The former two classes are combined into just one single category, as well as the latter two, since we consider partitioning them not beneficial in our case as it would lead to a rather fragmented result, hereby impeding the interpretation of the outcome.

155

10. Taxonomy of Context Classes 10.2.1.3. Type ’Result’ The ’result’ class deployed in our taxonomy of context types - specifying the consequences of the action or situation reported in the core sentence - merges the RST relations of ’volitional’ and ’non-volitional result’. Here again, we joined both categories together in order to prevent disintegrating the classification too much. 10.2.1.4. Type ’Mode’ A context class ’mode’ has been introduced in our taxonomy. It indicates the manner of how the event or situation expounded in the main sentence is handled and the means which are applied for its treatment, respectively. In the Classical RST, a comparable relation is not included. However, as it has proven useful in annotating our test set, we have expanded our collection of context types by this category. 10.2.1.5. Type ’Statement’ Under the category of ’statement’, a variety of different types of declarations are subsumed: • ’attribution’: the entity to which the statement enunciated in the core sentence is ascribed • ’elaboration’: a statement that provides additional information about an entity presented in the core sentence • ’explanation’: a statement made to clarify the idea or event expressed in the core sentence and make it understandable • ’characterization’: a statement that describes the individual quality of a person or thing • ’evidence’: a statement intended to support the claim made in the core sentence While the classes of ’elaboration’ and ’evidence’ show a direct counterpart in the RST relation definitions, ’explanation’ and ’characterization’ may roughly reflect the categories of ’interpretation’ and ’evaluation’, respectively. The ’attribution’ type, however, has been incorporated into our taxonomy of context classes to fill in the gaps we have experienced when trying to allocate each sentence from the Wikipedia corpus to an appropriate context class. 10.2.1.6. Type ’Paraphrasing’ The context type of ’paraphrasing’, which we have included in our taxonomy, roughly simulates the RST relation of ’restatement’. However, we deemed a more finely graduated partitioning of this class expedient, as otherwise the assignment process would result in a too rough-grained outcome, thereby obstructing a gain in knowledge. Thus, the ’paraphrasing’ category has been divided into the following five subclasses:

156

10. Taxonomy of Context Classes • ’definition’: representing a concise statement that describes an entity contained in the sentence to which it refers, thereby making it definite, distinct or clear • ’exemplification’: indicating a declaration illustrating a situation delineated in its appendant sentence • ’synonym’: specifying a word or expression that has the same meaning as a particular term mentioned in an associated sentence • ’naming’: providing the name of a person, place or thing that has been brought up in a related sentence • ’abbreviation’: presenting a shortened form of a word or phrase included in a sentence to which it is attached

10.2.1.7. Type ’Antithesis’ We have adopted the ’concession’ and ’contrast’ relations from the set of definitions provided in Classical RST, grouping them under the category of ’antithesis’. In accordance with their corresponding RST specifications, they designate either an alternative to the situation cited in the appendant core sentence (in case of ’contrast’) or a situation which is admittedly inconsistent with the one presented in its respective main sentence, but still holds true (in case of ’concession’).

10.2.1.8. Type ’Source/Target’ The ’source/target’ category constitutes another class we have missed in the original set of RST relation definitions when annotating the sentences from our test set with their respective contextual labels. Therefore, we have added the following two classes to our taxonomy: • ’origin’: the author or source from which the event or idea expressed in the core sentence derives • ’recipient’: the addressee, beneficiary or target of the event or idea presented in the core sentence

10.2.1.9. Type ’Cohesion’ With ’conjunction addition’, ’conjunction opposition’, ’conjunction consequence’, ’conjunction exemplification’ and ’conjunction extent’, the context class of ’cohesion’ incorporates a variety of conjunctive-cohesive relations. These are not part of the Classical RST definitions, yet they have shown to represent a valuable extension of our classification on various occasions.

157

10. Taxonomy of Context Classes 10.2.1.10. Discarded Rhetorical Structure Theory Relation Definitions

When comparing the context classes which our taxonomy is composed of (see figure 10.5) with the relation definitions forming the Classical RST (see tables 10.1 and 10.2), it becomes obvious that we have abandoned quite a number of RST relationships, encompassing ’solutionhood’, ’background’, ’enablement’, ’motivation’, ’justify’, ’antithesis’, ’summary’, ’sequence’ and ’joint’. These types were discarded since we are firmly convinced that factoring them into our classification would add no value, but rather result in a too fine-grained, thus disintegrated, outcome that raises difficulties with regard to its interpretation.

10.2.2. Application to our Wikipedia-based Test Set On the basis of the taxonomy of context classes described above, we have manually assigned an adequate class label to each context sentence that is contained in the previously compiled gold standard (cf. section 8.2), resulting in a corpus of almost 3,000 annotated sentences. Some examples taken from this dataset are displayed below (see figure 10.6). Beyond that, the table depicted in figure 10.3 reveals the distribution of the individual context classes throughout our corpus.

Figure 10.6.: Examples of annotated context sentences example 1: core sentence: Mandela had become a Class A prisoner. • context sentence: This was by 1975. (time) • context sentence: This was allowing greater numbers of visits and letters. (result) • core sentence: He corresponded with anti-apartheid activists. (elaboration) nested core sentence: core sentence: He corresponded with anti-apartheid activists. context sentence: This was like Mangosuthu Buthelezi and Desmond Tutu. (exemplification) (10.6.1)

158


example 2: core sentence: The BCM called for militant action. • context sentence: This was when seeing the ANC as ineffectual. (cause) • core sentence: But many BCM activists were imprisoned. (contrast) nested core sentence: core sentence: But many BCM activists were imprisoned. • context sentence: This was on Robben Island. (location) • context sentence: This was following the Soweto uprising of 1976. (time) (10.6.2)

example 3: core sentence: Google and Fox Interactive Media of News Corporation entered into a $900 million agreement. • context sentence: This was to provide search and advertising on MySpace. (purpose) • context sentence: This was increasing its advertising reach even further. (result) • context sentence: MySpace was a then-popular social networking site. (definition) (10.6.3)

159


example 4: core sentence: "Baseball is the one closest in evolutionary descent to the older individual sports". • context sentence: This is what Michael Mandelbaum argues when contrasting the game with both football and basketball. (attribution) • context sentence: Michael Mandelbaum is a scholar. (characterization) (10.6.4)

example 5: core sentence: A given stadium may acquire a reputation as a pitcher’s park or a hitter’s park. context sentence: This is if one or the other discipline notably benefits from its unique mix of elements. (condition) (10.6.5)

example 6: core sentence: The earliest known reference to baseball is in a 1744 British publication. • context sentence: This is by John Newbery. (origin) • context sentence: A Little Pretty Pocket-Book is a 1744 British publication. (naming) (10.6.6)

160


example 7: core sentence: Integration proceeded slowly. core sentence: Only six of the 16 major league teams had a black player on the roster. (evidence) nested core sentence: core sentence: Only six of the 16 major league teams had a black player on the roster. context sentence: This was by 1953. (time) (10.6.7)

example 8: core sentence: Fair territory between home plate and the outfield boundary is baseball’s field of play. • context sentence: This is though significant events can take place in foul territory, as well. (concession) (10.6.8)

Context Class Scope Time Location Degree Extent Motivation Cause Purpose Condition Result Mode Statement Attribution Elaboration Explanation

"Baseball" "Google" "Mandela" 267 162 65 23 17 81 30 20 31 29 30 210 13 134 28

253 177 51 9 16 59 20 37 2 23 39 171 46 61 24 161

458 280 137 22 19 150 92 50 8 103 81 396 42 196 36

P

P

(by absolute numbers)

(in percentage terms)

978 619 253 54 52 290 142 107 41 155 150 777 101 391 88

34.17% 21.63% 8.84% 1.89% 1.82% 10.13% 4.96% 3.74% 1.43% 5.42% 5.24% 27.15% 3.53% 13.66% 3.07%


Characterization Evidence Paraphrasing Definition Exemplification Synonym Naming Abbreviation Source/Target Origin Recipient Antithesis Concession Contrast Cohesion Conjunction addition Conjunction opposition Conjunction consequence Conjunction exemplification Conjunction extent

28 7 95 25 29 9 21 11 10 8 2 37 12 25 25 5

37 3 85 45 13 1 14 12 20 8 12 12 4 8 4 2

89 33 93 10 24 4 35 20 34 24 10 90 54 36 7 0

154 43 273 80 66 14 70 33 64 40 24 139 70 69 36 7

5.38% 1.50% 9.54% 2.80% 2.31% 0.49% 2.45% 1.15% 2.24% 1.40% 0.84% 4.86% 2.45% 2.41% 1.26% 0.24%

7

1

6

14

0.49%

6

1

1

8

0.28%

3

0

0

3

0.10%

4

0

0

4

0.14%

P

784

666

1412

2862

100%

Table 10.3.: Distribution of context classes

162

11. Training a Classifier for Automatically Annotating Context Sentences After having build the taxonomy of context classes that has been elucidated in the prior chapter, we have trained a supervised classifier for automatically tagging previously unseen contextual sentences with their respective categories. For this purpose, Lucid Science 16 , a program identifying discourse elements in scientific texts, has been deployed. Using a Convolutional Neural Network that operates on the features of lemmas, POS tags, supersense tags and word vectors, it is able to learn and predict the semantic or functional category to which an input sentence belongs.

Figure 11.1.: Evaluation of neural networks [16] The Lucid Science classifier has been trained on our Wikipedia-based corpus comprising 3,000 manually annotated sentences (cf. section 10.2.2) - once using the whole range of context classes in the form of the fine-grained subcategories (apart from cohesion), and once again using just their coarser supertypes as class labels. After that, based on the metrics of accuracy, F1 score and confusion matrix, the modelling, generalization and confidence performance of the classifier with respect to our dataset have been evaluated (see figure 11.1). The results of these computations are depicted below (see figure 11.2 for the subclasses, and figure 11.3 for the superclasses). 16

source code available at: https://gitlab.com/textminingprojectgroup/lucidscience

11. Training a Classifier for Automatically Annotating Context Sentences Figure 11.2.: Evaluation results when using the subclass labels

(11.2.1) Modelling performance

164

11. Training a Classifier for Automatically Annotating Context Sentences

(11.2.2) Generalization performance

165


(11.2.3) Confidence performance

166


Figure 11.3.: Evaluation results when using the superclass labels

(11.3.1) Modelling performance

167


(11.3.2) Generalization performance

168


(11.3.3) Confidence performance

Table 11.1 sums up the evaluation results by presenting the average modelling, generalization and confidence scores in terms of both accuracy and the F1 measure that are achieved by the classifier when using the more general superclass tags and the more specific subclass labels, respectively, for training its classification model.

169


Modelling Score Generalization Score Confidence Score

accuracy F1 score accuracy F1 score accuracy F1 score

subclass labels 0.41 0.35 0.34 0.27 0.54 0.35

superclass labels 0.55 0.56 0.46 0.47 0.83 0.60

Table 11.1.: Average scores achieved by the classifier This scheme reveals that the classifier trained on the smaller and more generic set of superclass labels consistently outperforms the one using the full range of categories from our taxonomy of context classes for learning its classification model. However, with an accuracy of only 0.34 (subclass-trained model) and 0.46 (superclass-trained model) - when referring to the generalization performance -, both models yield rather moderate results indicating a fairly poor reliability of the classifier, even though the superclass variant - which is able to classify almost half of the given input sentences correctly on average - surpasses its subclass counterpart to a considerable degree.

170

Part VI. Conclusion

In the following sections, this thesis will be concluded with a summary of the results achieved by our text simplification approach. Thus first of all, in chapter 12, the contributions of our work will be outlined. Hereupon, we discuss the performance of our framework - including potential scope for improvement - in chapter 13. Finally, some directions for future work will be pointed out in chapter 14.

173

12. Contributions In this work, we have presented a framework which performs a syntax-driven rulebased sentence simplification by disembedding those structural components of a sentence that customarily supply mere secondary information, and transforming them into stand-alone context sentences, while at the same time reducing the original sentence to one or more core sentences comprising only those constituents of the input that convey central information. Following previous attempts at syntax-based sentence compression, our approach is based on the assumption that specific structural constituents of a sentence commonly provide no more than some incidental piece of information and thus may be separated out without losing its key message. However, instead of simply discarding these components - which would inevitably result in a considerable loss of information -, we make use of syntactic text simplification operations to transform them into self-contained associated context sentences. While previous work in the area of sentence compression concentrates solely on identifying and dropping peripheral informative syntactic constructs in order to clear space for including more relevant content in length-limited summaries, syntactic text simplification approaches generally focus on generating structurally simplified outcome without explicitly differentiating between output sentences expressing important information and those disclosing less meaningful content. Hence, to the best of our knowledge, we are the first to merge these two approaches by simplifying the syntax of a sentence through splitting it up into several smaller ones, with the aim of producing a single (or more, if appropriate) core sentences containing only the gist of the input, and a set of accompanying context sentences that supply additional background information about the fact presented in the main sentence to which they refer. That way, the original information of the input sentence is to be preserved to its full extent, in the form of multiple affiliated, though structurally simplified sentences which are subdivided into a concise core statement and related context sentences expressing appertaining secondary information. The thus created compact output sentences are likely to be easier to process for subsequently applied IE systems. However, the examination of whether a preceding syntactic simplification indeed eases the problem of accessing factual information from NL text is subject to future work (see section 14). More specifically, the contributions of this thesis are: • the definition of a set of syntax-based transformation rules for simplifying NL text: Building upon prior approaches in the area of both sentence compression and syntactic text simplification which deploy a set of hand-crafted grammar rules to transform input sentences, we have conducted an extensive linguistic analysis of sentences from the English Wikipedia. On the basis of this study, we have adopted (and modified, where necessary) those heuristics presented in previous works that seemed promising for our purposes. Beyond that, we have

12. Contributions expanded the existing rule set by a variety of heuristics handling formerly disregarded syntactic constructs, notably restrictive appositive phrases, lead noun phrases, as well as some selected punctuation. Moreover, we strike a new path with regard to the treatment of prepositional phrases. In this connection, we describe how sentences can be split into core and associated context sentences using the set of linguistically motivated transformation rules we have specified. • their implementation within a sentence-level text simplification framework: Predicated on our set of syntax-based grammar rules, we have implemented a text simplification system which is coded in Java17 . In our implementation, we make use of the Stanford Parser to first detect sentence boundaries, and then tokenize and produce a parse tree representation of each sentence. In addition, the Stanford POS Tagger and the Stanford Named Entity Recognizer are applied for creating further representations which are exploited by some of the rules we have defined. By means of these sentence representations, the structural constituents that our algorithm operates on are identified and manipulated according to the specified rules, using the Stanford Parser’s API which allows, among other things, for inserting and deleting nodes in the parse tree of a sentence, as well as for changing its labels. • the construction of a domain-independent text simplification corpus based on Wikipedia articles: As none of the corpora that have been previously applied in the evaluation of text simplification or sentence compression approaches was deemed adequate for assessing the performance of our framework, we have compiled our own test set, consisting of 1,300 sentences from three carefully selected English Wikipedia articles that have been manually simplified into core and related context sentences. In fact, this dataset is the first to comprise parallel complex-simple sentence pairs, with each simplified version being split into core and accompanying context sentences, thus clearly distinguishing between fundamental and incidental information. • the evaluation of the performance of the specified system on the basis of this dataset: Using this Wikipedia-based corpus of 1,300 aligned complex source and simplified target sentences, we have conducted a thorough evaluation for assessing the quality of the outcome produced by our sentence simplification approach, taking into account not only a number of automatic measures, but also conducting a large scale manual analysis on the output sentences which have been generated by our system. • the specification of a taxonomy of context classes for training a classifier that automatically assigns the extracted context sentences to their respective contextual categories: On the basis of the set of discourse relations defined within the framework of RST, we have devised a taxonomy of context classes with the aim of assigning each context sentence that is separated out of an input sentence an adequate 17

available at: https://gitlab.com/cnik/sentence-simplification

176

12. Contributions class label signifying the type of content it describes. With the help of this classification, we have then extended our previously gathered dataset by allocating each context sentence to one of the predetermined categories contained in our taxonomy, resulting in a set of approximately 3,000 manually annotated context sentences. Finally, this dataset has been used for training a supervised classifier to attribute a previously unseen contextual sentence to its respective context class.

177

13. Summary of the Results and Scope for Improvement We now summarize the outcome of our work by making assertions about the capacity of our text simplification system and demonstrating - on the basis of the results obtained in the evaluation procedure - to what extent it indeed complies with these predications, thereby discussing a number of problems we have encountered in the course of assessing the framework’s performance.

• Assertion 1 : Using a set of linguistically motivated syntax-driven transformation rules, structural components of a sentence that provide only some peripheral piece of information can be captured and separated out of the source, resulting in one to n core sentences that are compressed to the key information of the original sentence. Aforementioned statement is strongly supported by the findings of the assessment of the quality of the compressed core sentences which have been examined irrespective of their associated context sentences (see section 9.1.2). This survey has demonstrated that only about 20% of the words included in the resulting core sentences represent dispensable components in the form of false positives, according to our human-gathered gold reference. As the manual analysis presented in section 9.2.1 has revealed, a good portion of these unessential constituents are to be attributed to PPs without segregation through punctuation, which our heuristics frequently fail to capture as constructs that are to be extracted. Indeed, removing such grammatical constituents based solely on syntactic information has proven to be particularly likely to make mistakes, i. e. PPs that do not contain necessary information are kept in the core sentences on a regular basis (see the example illustrated in figure 13.1), while mandatory ones are discarded (see the example depicted in figure 13.2). Thus in large, the 4% of false negatives which are on average contained in the output core sentences can be traced back to an incorrect treatment of PPs as well. Consequently, our rules handling the extraction of PPs need to be enriched with further information, e. g. by incorporating semantic knowledge in the form of semantic distance vectors. Beyond that, it has been determined that only about 16% of the input sentences are returned without any change. Accordingly, almost every source sentence is simplified in one way or another. However, this rate is twice as high as the gold reference’s mean score, hence providing a further indicator that occasionally there is still room for extracting further syntactic constructs from the original sentences.

13. Summary of the Results and Scope for Improvement outcome produced by our text simplification framework: original sentence: The Author’s Guild filed a class action suit in a New York City federal court against Google in 2005 over this service. core sentence: The Author’s Guild filed a class action suit in a New York City federal court against Google in 2005 over this service. favoured solution: • core sentence: The Author’s Guild filed a class action suit against Google. • context sentence: This was in a New York City federal court. • context sentence: This was in 2005. • context sentence: This was over this service. Figure 13.1.: Example of dispensable PPs that are mistakenly kept outcome produced by our text simplification framework: original sentence: Mandela’s second wife also came from the Transkei area. • core sentence: Mandela’s second wife also came. • context sentence: This was from the Transkei area. favoured solution: core sentence: Mandela’s second wife also came from the Transkei area. Figure 13.2.: Example of an erroneous extraction of a mandatory PP • Assertion 2: By means of a set of heuristics operating on the syntactic structure of a sentence, every constituent of the source expressing mere secondary information can be transformed into a self-contained context sentence. This declaration is substantiated by a variety of automatic evaluation measures that have been calculated to estimate the performance of our text simplification framework - encompassing in particular input word coverage, number of output sentences per input, as well as precision rate (see sections 9.1.1 and 9.1.2). In fact, our system reaches an input word coverage of about 97%, signifying that almost every constituent that is part of the source sentence is enclosed in the outcome as well. Thus, virtually no component of the input is dropped, but either maintained in the core sentence or transformed into a dedicated context

180

13. Summary of the Results and Scope for Improvement sentence. In combination with a fairly high precision of more than 80%, the quasi-full input word coverage suggests that most of the words and phrases that are supposed to be extracted from the core due to representing peripheral information are truly eliminated from the input. In addition, an average score of 2.66 output sentences per input indicates that approximately 3 grammatical constituents are disembedded from each source sentence and transformed into separate context sentences, disclosing precisely one specific piece of background information.

• Assertion 3: When splitting an input sentence by dint of applying a predefined set of syntax-based simplification rules, the resulting core and its associated context sentences jointly reflect the original meaning of the source. • Assertion 4: With the help of syntactic transformation rules, a complex input sentence can - where appropriate - be itemized into several smaller output sentences, with each of them constituting a proper, grammatical English sentence. • Assertion 5: Syntax-based grammar rules can be deployed for producing output sentences that present a simplified syntactic structure. The three predications described above are rather difficult to automatically assess. Admittedly, average sentence length may serve as a rough indicator for the structural complexity of a sentence, based on the assumption that the longer a sentence, the more potential there is for presenting a complicated syntax. The number of output sentences per input, too, may portend the level of syntactic intricacy of the outcome, since the more target sentences are generated from a source, the more embedded structures are separated out, resulting in more concise, syntactically simplified sentences. Besides, a high input word coverage may signal a low information loss, as very few - if any - components of the input are abandoned. Hence, with an average sentence length of approximately 10 words (compared to 24 in the input), 2.66 output sentences per input and an input word coverage of 97% (see section 9.1.1), our simplification framework seems to produce a structurally simplified outcome which captures the entire content of the source. However, these measures must be handled with due care, especially when interpreted in isolation. Thus, the features of adequacy, grammaticality and simplicity have been examined primarily on the basis of a comprehensive manual investigation (see section 9.2.1). This analysis has disclosed that about 60% of the input sentences are positively resolved with respect to the three criteria mentioned above (according to the classification guidelines listed in table 8.1), while around a fifth of the output is labelled as neutral, hence portending outcome providing a reasonable simplification result, yet leaving some room for improvements. Only about 20% of the resulting sentences are to be regarded as negative ones. However, these three aspects have not been explored independently of each other yet. Therefore, it is recommended to prospectively follow the widely recognised evaluation methodology that draws on a small scale evaluation by

181

13. Summary of the Results and Scope for Improvement human judges rating the output sentences on a 5-point Likert scale with respect to the aforementioned three criteria, with each of them being determined in isolation. That way, a more well-founded statement on the quality of the outcome regarding the individual features is facilitated. In addition, a rather high Bleu score of 0.81 on average implies that on the whole the sentences returned by our text simplification framework deviate only slightly from our manually gathered gold standard which represents the desired solution, suggesting that in general our system produces an outcome of very high quality.

182

14. Future Work As noted before, our text simplification approach caters to improve the performance of a subsequently applied IE system. Concerning this matter, we have demonstrated so far that the output sentences produced by our framework are customarily more succinct - in comparison with the input text - to the effect that in large part they present a simplified syntactic structure, while still preserving the meaning of the original sentence. Hence, split into smaller units - yet jointly reflecting the information contained in the input -, there is less scope for ambiguities when operating on the modified data. Therefore, we assume sentences which have been simplified by means of our framework in a prior step to be likely to be easier to process for IE systems. However, whether or not our text simplification system may assist in enhancing the quality of the outcome returned by an IE framework needs to be verified by future investigations. For this purpose, we envisage to compare the output generated by an IE system with and without a preceding simplification of an input text which is to be performed by our framework. Moreover, the outcome produced by our text simplification system is returned in the form of a sequence of affiliated natural language sentences by now. In this connection, we aim at specifying a lightweight representation of text to further facilitate the processing and interpretation of the resulting sentences. Beyond that, as discussed in the previous section, there is still scope for ameliorating the quality of the output sentences which are generated by our framework. To achieve a better coverage of dispensable syntactic constituents holding an incidental piece of information, it is necessary to revise some of the heuristics we have specified in this work. These include in particular the rules for handling PPs, since in their current version they frequently eliminate requisite ones, while others that are well disposable are maintained in the compressed core sentences (see the examples displayed in figures 13.1 and 13.2). Furthermore, a rule for extracting selected participial phrases that are not set off by commas should be incorporated (e. g. "Google Inc. is an American multinational technology company specializing in Internet-related services and products.") Here again - like in the case of PPs -, a heuristic which is solely syntax-driven is rather prone to make mistakes. Thus, a more sophisticated transformation rule involving further features is required. Moreover, the simplification of direct speech may be addressed by separating out the entity to which a statement is attributed (e. g. "Eric Schmidt said, ’It’s called capitalism’.") In addition, a range of further conjunctions - apart from the most prevalent ones which are taken into account at present - are to be included in our examinations so that a greater number of conjoined clauses may be decomposed (e. g. "The update allows users ask the search engine a question in natural language rather than entering keywords into the search box.") Besides, as has been pointed out in section 9.2.2, there are several specific syntactic structures which commonly pose difficulties for our text simplification system, encompassing in particular complex coordinated NPs as well as long appositive phrases that enclose a VP. Hence, these constructs

14. Future Work need to be carefully reinvestigated. Finally, a more advanced heuristic for differentiating between appositions and coordinated NPs is to be devised (see section 9.2.2). To sum it up, for improving the performance of our text simplification framework, a variety of our hand-crafted transformation rules will have to be refined in the future. It would be well worth an attempt to leave the path of purely operating on syntactic information and enrich our heuristics with additional features, notably in the form of semantic knowledge.

184

Bibliography [1] Abraham, R. G. Field Independence-Dependence and the Teaching of Grammar. Teachers of English to Speakers of Other Languages, Inc. (TESOL), 1985. [2] Aluísio, S. M., Specia, L., Pardo, T. A., Maziero, E. G., and Fortes, R. P. Towards brazilian portuguese automatic text simplification systems. In Proceedings of the Eighth ACM Symposium on Document Engineering (New York, NY, USA, 2008), DocEng ’08, ACM, pp. 240–248. [3] Aranzabe, M. J., de Ilarraza, A. D., and Gonzalez-Dios, I. Transforming complex sentences using dependency trees for automatic text simplification in basque. Procesamiento del Lenguaje Natural 50 (2013), 61–68. [4] Baldwin, B. Cogniac: High precision coreference with limited knowledge and linguistic resources. In Proceedings of a Workshop on Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts (Stroudsburg, PA, USA, 1997), ANARESOLUTION ’97, Association for Computational Linguistics, pp. 38–45. [5] Barlacchi, G., and Tonelli, S. Ernesta: A sentence simplification tool for children’s stories in italian. In Proceedings of the 14th International Conference on Computational Linguistics and Intelligent Text Processing - Volume 2 (Berlin, Heidelberg, 2013), CICLing’13, Springer-Verlag, pp. 476–487. [6] Belder, J. D., Deschacht, K., and Moens, M.-F. Lexical simplification. In Proceedings of Itec2010 : 1st International Conference on Interdisciplinary Research on Technology, Education and Communication, Kortrijk, Belgium, 25-27 May 2010 (2010). [7] Belder, J. D., and Moens, M.-F. Text simplification for children. In Proceedings of the SIGIR Workshop on Accessible Search Systems, Geneva, 23 July 2010 (2010), ACM, pp. 19–26. [8] Bernth, A. Easyenglish: A tool for improving document quality. In ANLP (1997), pp. 159–165. [9] Bikel, D. M., Schwartz, R., and Weischedel, R. M. An algorithm that learns what’s in a name. Machine Learning 34, 1-3 (1999), 211–231. [10] Biran, O., Brody, S., and Elhadad, N. Putting it simply: A contextaware approach to lexical simplification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers - Volume 2 (Stroudsburg, PA, USA, 2011), HLT ’11, Association for Computational Linguistics, pp. 496–501. [11] Bott, S., Saggion, H., and Mille, S. Text simplification tools for spanish. In Proceedings of the Eighth International Conference on Language Resources

Bibliography and Evaluation (LREC’12) (Istanbul, Turkey, 2012), European Language Resources Association (ELRA). [12] Brinton, L. J. The Structure of Modern English: A linguistic introduction. John Benjamins B.V., Amsterdam, The Netherlands, 2000. [13] Brouwers, L., Bernhard, D., Ligozat, A.-L., and François, T. Syntactic sentence simplification for french. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR) at EACL 2014 (2014), pp. 47–56. [14] Canning, Y. Syntactic simplification of text. Ph.D. thesis (2002). [15] Carroll, J., Minnen, G., Pearce, D., Canning, Y., Devlin, S., and Tait, J. Simplifying english text for language-impaired readers. In Proceedings of the 9th Conference of the European Chapter of the ACL (1999), EACL’99, pp. 269–270. [16] Cetto, M., Ivanova, D., and Tanveer, M. M. Lucid science: Identification of the main discourse elements in scientific texts, 2016. [17] Chandrasekar, R., Doran, C., and Srinivas, B. Motivations and methods for text simplification. In Proceedings of the 16th Conference on Computational Linguistics - Volume 2 (Stroudsburg, PA, USA, 1996), COLING ’96, Association for Computational Linguistics, pp. 1041–1044. [18] Chung, J.-W., Min, H.-J., Kim, J., and Park, J. C. Enhancing readability of web documents by text augmentation for deaf people. In Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics (New York, NY, USA, 2013), WIMS ’13, ACM, pp. 30:1–30:10. [19] Clarke, J., and Lapata, M. Models for sentence compression: A comparison across domains, training requirements and evaluation measures. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (Stroudsburg, PA, USA, 2006), ACL-44, Association for Computational Linguistics, pp. 377–384. [20] Cohn, T., and Lapata, M. Sentence compression as tree transduction. J. Artif. Int. Res. 34, 1 (2009), 637–674. [21] Coster, W., and Kauchak, D. Learning to simplify sentences using wikipedia. In Proceedings of the Workshop on Monolingual Text-To-Text Generation (Stroudsburg, PA, USA, 2011), MTTG ’11, Association for Computational Linguistics, pp. 1–9. [22] Coster, W., and Kauchak, D. Simple english wikipedia: A new text simplification task. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers Volume 2 (Stroudsburg, PA, USA, 2011), HLT ’11, Association for Computational Linguistics, pp. 665–669. [23] Daelemans, W., Höthker, A., and Sang, E. F. T. K. Automatic sentence simplification for subtitling in dutch and english. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC

186

Bibliography 2004, May 26-28, 2004, Lisbon, Portugal (2004), European Language Resources Association. [24] Devlin, S., and Tait, J. The use of a psycholinguistic database in the simplification of text for aphasic readers. Linguistic Databases (1998), 161–173. [25] Doddington, G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research (San Francisco, CA, USA, 2002), HLT ’02, Morgan Kaufmann Publishers Inc., pp. 138–145. [26] Dras, M. Tree adjoining grammar and the reluctant paraphrasing of text. Ph.D. thesis (1999). [27] Dunlavy, D. M., Conroy, J. M., Schlesinger, J. D., Goodman, S. A., Okurowski, M. E., O’Leary, D. P., and van Halteren, H. Performance of a three-stage system for multi-document summarization. Proceedings document understanding conference (2003). [28] Eisner, J. Learning non-isomorphic tree mappings for machine translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2 (Stroudsburg, PA, USA, 2003), ACL ’03, Association for Computational Linguistics, pp. 205–208. [29] Elhadad, N. Comprehending technical texts: Predicting and defining unfamiliar terms. In AMIA Annual Symposium Proceedings (2006), AMIA’06, pp. 239–243. [30] Filippova, K., and Strube, M. Dependency tree based sentence compression. In Proceedings of the Fifth International Natural Language Generation Conference (Stroudsburg, PA, USA, 2008), INLG ’08, Association for Computational Linguistics, pp. 25–32. [31] Filippova, K., and Strube, M. Sentence fusion via dependency graph compression. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (Stroudsburg, PA, USA, 2008), EMNLP ’08, Association for Computational Linguistics, pp. 177–185. [32] Finkel, J. R., Grenager, T., and Manning, C. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (Stroudsburg, PA, USA, 2005), ACL ’05, Association for Computational Linguistics, pp. 363–370. [33] Flesch, R. A new readability yardstick. Journal of Applied Psychology 32, 3 (1948), 221–233. [34] Gagnon, M., and Sylva, L. D. Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI 2006, Québec City, Québec, Canada, June 7-9, 2006. Proceedings. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, ch. Text Compression by Syntactic Pruning, pp. 312–323. [35] Gasperin, C., Maziero, E. G., Specia, L., Pardo, T. A., and Aluísio, S. M. Natural language processing for social inclusion: a text simplification

187

Bibliography architecture for different literacy levels. In XXXVI Seminário Integrado de Software e Hardware (Bento Gonçalves, Brazil, 2009), SEMISH, pp. 387–401. [36] Green, S., Cer, D., and Manning, C. Phrasal: A toolkit for new directions in statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation (Baltimore, Maryland, USA, 2014), Association for Computational Linguistics, pp. 114–121. [37] Grodzinsky, Y. Language deficits and the theory of syntax. In Brain & Language (1986), vol. 27, pp. 135–159. [38] Grosz, B. J., and Sidner, C. L. Attention, intentions, and the structure of discourse. Computational Linguistics 12, 3 (1986), 175–204. [39] Grosz, B. J., Weinstein, S., and Joshi, A. K. Centering: A framework for modeling the local coherence of discourse. Computational Linguistics 21, 2 (1995), 203–225. [40] Halliday, M. A., and Hasan, R. Cohesion in English. Longman Group Ltd., London, U.K., 1976. [41] Heilman, M., and Smith, N. A. Extracting simplified statements for factual question generation. In Proceedings of the QG2010: The Third Workshop on Question Generation (2010), pp. 11–20. [42] Hung, B. T., Minh, N. L., and Shimazu, A. Sentence splitting for vietnamese-english machine translation. In Proceedings of the 2012 Fourth International Conference on Knowledge and Systems Engineering (Washington, DC, USA, 2012), KSE ’12, IEEE Computer Society, pp. 156–160. [43] Jonnalagadda, S., and Gonzalez, G. Sentence simplification aids proteinprotein interaction extraction. CoRR abs/1001.4273 (2010). [44] Jr., A. C., Maziero, E., Gasperin, C., Pardo, T. A. S., Specia, L., and Aluísio, S. M. Supporting the adaptation of texts for poor literacy readers: A text simplification editor for brazilian portuguese. In Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications (Stroudsburg, PA, USA, 2009), EdAppsNLP ’09, Association for Computational Linguistics, pp. 34–42. [45] Jurafsky, D., and Martin, J. H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd ed. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2009. [46] Kamp, H. A Theory of Truth and Semantic Representation. Blackwell Publishers Ltd, 2008, pp. 189–222. [47] Kincaid, J. P., Fishburne, R. P., Rogers, R. L., and Chissom, B. S. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. Tech. rep., 1975. [48] Klebanov, B. B., Knight, K., and Marcu, D. On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2004, Agia Napa,

188

Bibliography Cyprus, October 25-29, 2004. Proceedings, Part I. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004, ch. Text Simplification for Information-Seeking Applications, pp. 735–747. [49] Klein, D., and Manning, C. D. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1 (Stroudsburg, PA, USA, 2003), ACL ’03, Association for Computational Linguistics, pp. 423–430. [50] Klerke, S., and Søgaard, A. Simple, readable sub-sentences. In Proceedings of the ACL Student Research Workshop (2013), Association for Computational Linguistics, pp. 142–149. [51] Knight, K., and Marcu, D. Statistics-based summarization - step one: Sentence compression. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on on Innovative Applications of Artificial Intelligence, July 30 - August 3, 2000, Austin, Texas, USA. (2000), pp. 703–710. [52] Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., and Herbst, E. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (Stroudsburg, PA, USA, 2007), ACL ’07, Association for Computational Linguistics, pp. 177–180. ˆera, H., and Francis, W. N. Computational Analysis of Present-Day [53] Kuc American English. Brown University Press, Providence, RI, USA, 1967. [54] Lappin, S., and Leass, H. J. An algorithm for pronominal anaphora resolution. Computational Linguistics 20, 4 (1994), 535–561. [55] Likert, R. A technique for the measurement of attitudes. Archives of Psychology 22, 140 (1932), 1–55. [56] Lozanova, S., Stoyanova, I., Leseva, S., Koeva, S., and Savtchev, B. Text modification for bulgarian sign language users. In Proceedings of the 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations (2013), Association for Computational Linguistics, pp. 39–48. [57] Mann, W. C., and Taboada, M. Rst website. http://www.sfu.ca/rst/, 2005. [Online; last accessed 14-March-2016]. [58] Mann, W. C., and Thompson, S. A. Rhetorical structure theory: Toward a functional theory of text organization. Text 8, 3 (1988), 243–281. [59] Miller, G. A. Wordnet: A lexical database for english. Communications of the ACM 38, 11 (1995), 39–41. [60] Miwa, M., Sætre, R., Miyao, Y., and Tsujii, J. Entity-focused sentence simplification for relation extraction. In Proceedings of the 23rd International Conference on Computational Linguistics (Stroudsburg, PA, USA, 2010), COLING ’10, Association for Computational Linguistics, pp. 788–796. [61] Narayan, S., and Gardent, C. Unsupervised sentence simplification using deep semantics. CoRR abs/1507.08452 (2015).

189

Bibliography [62] Och, F. J., and Ney, H. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (Stroudsburg, PA, USA, 2000), ACL ’00, Association for Computational Linguistics, pp. 440–447. [63] Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Stroudsburg, PA, USA, 2002), ACL ’02, Association for Computational Linguistics, pp. 311–318. [64] Perera, P., and Kosseim, L. Evaluating syntactic sentence compression for text summarisation. In Natural Language Processing and Information Systems - 18th International Conference on Applications of Natural Language to Information Systems, NLDB 2013, Salford, UK, June 19-21, 2013. Proceedings (2013), pp. 126–139. [65] Quinlan, P. The Oxford Psycholinguistic Database. Oxford University Press, 1992. [66] Quirk, R., Greenbaum, S., Leech, G., and Svartvik, J. A Comprehensive Grammar of the English Language. Longman, London, 1985. [67] Riezler, S., King, T. H., Crouch, R., and Zaenen, A. Statistical sentence condensation using ambiguity packing and stochastic disambiguation methods for lexical-functional grammar. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1 (Stroudsburg, PA, USA, 2003), NAACL ’03, Association for Computational Linguistics, pp. 118–125. [68] Seretan, V. Acquisition of syntactic simplification rules for french. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, May 23-25, 2012 (2012), pp. 4019–4026. [69] Shardlow, M. A survey of automated text simplification. In International Journal of Advanced Computer Science and Applications, Special Issue on Natural Language Processing (2014), IJACSA ’14, pp. 58–70. [70] Siddharthan, A. An architecture for a text simplification system. In Proceedings of the Language Engineering Conference (LEC’02) (Hyderabad, India, 2002), IEEE Computer Society, pp. 64–71. [71] Siddharthan, A. Syntactic simplification and text cohesion. Technical Report, Number 597 (2004). [72] Siddharthan, A. Complex lexico-syntactic reformulation of sentences using typed dependency representations. In Proceedings of the 6th International Natural Language Generation Conference (Stroudsburg, PA, USA, 2010), INLG ’10, Association for Computational Linguistics, pp. 125–133. [73] Siddharthan, A. Text simplification using typed dependencies: A comparison of the robustness of different generation strategies. In Proceedings of the 13th European Workshop on Natural Language Generation (Stroudsburg, PA, USA, 2011), ENLG ’11, Association for Computational Linguistics, pp. 2–11. [74] Siddharthan, A. A survey of research on text simplification. ITL - International Journal of Applied Linguistics 165, 2 (2014), 259–298.

190

Bibliography [75] Siddharthan, A., and Mandya, A. Hybrid text simplification using synchronous dependency grammars with hand-written and automatically harvested rules. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, April 26-30, 2014, Gothenburg, Sweden (2014), pp. 722–731. [76] Smith, C., and Jönsson, A. Automatic summarization as means of simplifying texts, an evaluation for swedish. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NoDaLiDa-2010) (2011). [77] Smith, D. A., and Eisner, J. Quasi-synchronous grammars: Alignment by soft projection of syntactic dependencies. In Proceedings of the Workshop on Statistical Machine Translation (Stroudsburg, PA, USA, 2006), StatMT ’06, Association for Computational Linguistics, pp. 23–30. [78] Specia, L. Computational Processing of the Portuguese Language: 9th International Conference, PROPOR 2010, Porto Alegre, RS, Brazil, April 2730, 2010. Proceedings. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, ch. Translating from Complex to Simplified Sentences, pp. 30–39. [79] Taboada, M., and Mann, W. C. Rhetorical structure theory: Looking back and moving ahead. Discourse studies 8, 3 (2006), 423–459. [80] Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. Featurerich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1 (Stroudsburg, PA, USA, 2003), NAACL ’03, Association for Computational Linguistics, pp. 173–180. [81] Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research 6 (2005), 1453–1484. [82] Vickrey, D., and Koller, D. Sentence simplification for semantic role labeling. In ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA (2008), pp. 344–352. [83] Viterbi, A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13, 2 (2006), 260–269. [84] Watanabe, W. M., Junior, A. C., Uzêda, V. R., de Mattos Fortes, R. P., Pardo, T. A. S., and Aluísio, S. M. Facilita: Reading assistance for low-literacy readers. In Proceedings of the 27th ACM International Conference on Design of Communication (New York, NY, USA, 2009), SIGDOC ’09, ACM, pp. 29–36. [85] Woodsend, K., and Lapata, M. Learning to simplify sentences with quasisynchronous grammar and integer programming. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (Stroudsburg, PA, USA, 2011), EMNLP ’11, Association for Computational Linguistics, pp. 409– 420.

191

Bibliography [86] Wubben, S., van den Bosch, A., and Krahmer, E. Sentence simplification by monolingual machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1 (Stroudsburg, PA, USA, 2012), ACL ’12, Association for Computational Linguistics, pp. 1015–1024. [87] Xu, W., Callison-Burch, C., and Napoles, C. Problems in current text simplification research: New data can help. TACL 3 (2015), 283–297. [88] Zajic, D., Dorr, B. J., Lin, J., and Schwartz, R. Multi-candidate reduction: Sentence compression as a tool for document summarization tasks. Information Processing and Management 43, 6 (2007), 1549–1570. [89] Zeng-Treitler, Q., Goryachev, S., Kim, H., Keselman, A., and Rosendale, D. Making texts in electronic health records comprehensible to consumers: A prototype translator. In AMIA 2007, American Medical Informatics Association Annual Symposium, Chicago, IL, USA, November 10-14, 2007 (2007), AMIA. [90] Zeng-Treitler, Q., Goryachev, S., Tse, T., Keselman, A., and Boxwala, A. A. Research paper: Estimating consumer familiarity with health terminology: A context-based approach. JAMIA 15, 3 (2008), 349–356. [91] Zhu, Z., Bernhard, D., and Gurevych, I. A monolingual tree-based translation model for sentence simplification. In Proceedings of the 23rd International Conference on Computational Linguistics (Stroudsburg, PA, USA, 2010), COLING ’10, Association for Computational Linguistics, pp. 1353–1361.

192

Eidesstattliche Erklärung Ich erkläre hiermit, dass ich die vorliegende Arbeit selbstständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel verwendet habe. Die Arbeit wurde in gleicher oder ähnlicher Form bisher keiner anderen Prüfungsbehörde vorgelegt. Passau, den 30.03.2016

Christina Niklaus