LEARNING PARAPHRASES FROM TEXT by Rahul ... | ResearchGate

1 downloads 20 Views 2MB Size Report
chant, Dragos Munteanu, Tom Murray, Anish Nair, Oana Nicolov, Doug Oard, Feng. Pan, Siddharth ... Ashish Vaswani, Jens-Soenke Voeckler, and Vishnu Vyas.
LEARNING PARAPHRASES FROM TEXT

by

Rahul Bhagat

A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE)

August 2009

Copyright 2009

Rahul Bhagat

Dedication To My Parents...

ii

Acknowledgements My dissertation work has benefitted greatly from the help, support, and advice of my colleagues, friends, and family. I am indebted to my advisor, Ed Hovy, for his valuable guidance and support throughout my time at the University of Southern California. Ed convinced me to get a PhD, back when I had no intentions of getting one. Throughout my research, he helped me maintain my focus and provided honest advice and criticism. Many thanks go also to the other members of my committee, especially Patrick Pantel. Patrick taught me the basics of writing good papers and has consistently guided me in my research. The other members of my committee—Jerry Hobbs, Kevin Knight, Dennis McLeod, and Daniel O’Leary—have also provided valuable feedback. I am also grateful to Deepak Ravichandran. Deepak taught me the value of out-of-the-box thinking and has provided constant guidance and help in my research. I was fortunate to have a wonderful officemate in Donghui Feng who was always willing to discuss research ideas with me. I have also benefitted a lot from discussions with Dekang Lin, Marius Pasca, and Ellen Riloff. All these discussions have helped me improve various aspects of this dissertation for which I am thankful.

iii

I want to thank my colleagues William Chang, Dirk Hovy, Jon May, Sujith Ravi, and Jason Riesa for helping me in doing important (and boring) annotations. I have also enjoyed interacting with my other colleagues at USC and elsewhere including but not limited to Jafar Adibi, Jose Luis Ambite, Erika Barragan-Nunez, Gully Burns, Congxing Cai, David Chiang, Yao-Yi Chiang, Tim Chklovski, Bonaventura Coppola, Hal Daume III, Steve Deneefe, Teresa Dey, Mike Fleischman, Victoria Fossum, Alex Fraser, Sudeep Gandhe, Paul Groth, Tommy Ingulfsen, Ulf Hermjakob, Gunjan Kakani, Soo-min Kim, Zori Kozareva, Kathy Kurinsky, Namhee Kwon, Brent Lance, Jina Lee, Sean Lee, Kristina Lerman, Chin-Yew Lin, Shou-de Lin, Stacy Marsella, Chirag Merchant, Dragos Munteanu, Tom Murray, Anish Nair, Oana Nicolov, Doug Oard, Feng Pan, Siddharth Patwardhan, Marco Pennacchiotti, Fernando Pereira, Andrew Philpot, David Pynadath, Delip Rao, Nishit Rathod, Marta Recasens-Potau, Joe Resinger, Tom Russ, Mei Sei, Mark Shirley, Partha Talukdar, David Traum, Benjamin Van Durme, Ashish Vaswani, Jens-Soenke Voeckler, and Vishnu Vyas. I also want to thank the “Venice beach happy hour crowd” for the fun times. I am blessed to have been surrounded by some great friends from my early days as a student in the US. I want to thank Amar Athavale, Arushi Bhargava, Vineet Bhargava, Siddharth Bhavsar, Nirav Desai, Nitin Dhavale, Deepa Jain, Amit Joshi, Neha Kansal, Kiran Meduri, Mingle Mehta, Prasanth Nittala, Aditya Pandharpurkar, Jigish Patel, Parikshit Pol, Sagar Shah, Rahul Srivastava, and Jigesh Vora.

iv

I am eternally indebted to my parents and my late grandparents for their unconditional affection, understanding, and support in all my endeavors. Their backing has been a source of great strength for me. I want to thank my brother, Sarang, whose affection, friendship, and advice I have always valued greatly. I want to thank Neha for her love, understanding, and for always believing in me.

v

Table of Contents

Dedication

ii

Acknowledgements

iii

List Of Tables

xi

List Of Figures

xiii

Abstract Chapter 1: Introduction 1.1 Motivation . . . . . . 1.2 Goal and Approach . 1.3 Major Contributions . 1.4 Thesis Outline . . . .

xv

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Chapter 2: Related Work 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Linguistic Theories of Paraphrases . . . . . . . . . . . . . . . 2.3 Learning Paraphrases . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Learning Paraphrases using Multiple Translations . . . 2.3.2 Learning Paraphrases using Parallel Bilingual Corpora 2.3.3 Learning Paraphrases using Comparable Corpora . . . 2.3.4 Learning Paraphrases using Monolingual Corpora . . . 2.4 Learning Selectional Preferences and Directionality . . . . . . 2.4.1 Learning Selectional Preferences . . . . . . . . . . . . 2.4.2 Learning Directionality . . . . . . . . . . . . . . . . . 2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Paraphrases for Learning Extraction Patterns . . . . .

. . . .

. . . . . . . . . . . .

. . . .

. . . . . . . . . . . .

. . . .

. . . . . . . . . . . .

. . . .

. . . . . . . . . . . .

. . . .

1 1 3 7 8

. . . . . . . . . . . .

10 10 11 15 15 17 18 19 20 21 22 23 24

vi

2.5.2 2.5.3

Paraphrases for Learning Patterns Extraction . . . . . . . . . . . . . Paraphrases for Learning Patterns mation Extraction . . . . . . . . .

for . . for . .

Open-Domain Relation . . . . . . . . . . . . . . 25 Domain-Specific Infor. . . . . . . . . . . . . . 26

Chapter 3: Paraphrases 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Paraphrasing Phenomenon Explained . . . . . . . . . . 3.2.1 Lexical Perspective . . . . . . . . . . . . . . . . 3.2.2 Structural Perspective . . . . . . . . . . . . . . 3.3 Analysis of Paraphrases . . . . . . . . . . . . . . . . . . 3.3.1 Distribution of Lexical Changes . . . . . . . . . 3.3.2 Human Judgement of Lexical Changes . . . . . 3.3.3 Sentence-level Distribution of Structural Changes 3.3.4 Results . . . . . . . . . . . . . . . . . . . . . . 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4: Inferential Selectional Preferences 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 4.2 Selectional Preference Models . . . . . . . . . . . . . 4.2.1 Relational Selectional Preferences . . . . . . . 4.2.1.1 Joint Relational Model (JRM) . . . . 4.2.1.2 Independent Relational Model (IRM) 4.2.2 Inferential Selectional Preferences . . . . . . . 4.2.2.1 Joint Inferential Model (JIM) . . . . 4.2.2.2 Independent Inferential Model (IIM) 4.2.3 Filtering Inferences . . . . . . . . . . . . . . . 4.3 Experimental Methodology . . . . . . . . . . . . . . . 4.3.1 Quasi-paraphrase rules . . . . . . . . . . . . . 4.3.2 Semantic Classes . . . . . . . . . . . . . . . . 4.3.3 Evaluation Criteria . . . . . . . . . . . . . . . 4.4 Experimental Results . . . . . . . . . . . . . . . . . . 4.4.1 Experimental Setup . . . . . . . . . . . . . . . 4.4.1.1 Model Implementation . . . . . . . 4.4.1.2 Gold Standard Construction . . . . . 4.4.1.3 Baselines . . . . . . . . . . . . . . . 4.4.2 Filtering Quality . . . . . . . . . . . . . . . . 4.4.2.1 Performance and Error Analysis . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

28 28 30 31 44 45 46 47 48 49 50

. . . . . . . . . . . . . . . . . . . . .

55 55 58 59 60 62 63 63 64 65 66 67 67 69 70 70 70 71 72 73 74 78

vii

Chapter 5: Learning Directionality 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 5.2 Learning Directionality of Quasi-paraphrase Rules . . 5.2.1 Underlying Assumption . . . . . . . . . . . . 5.2.2 Selectional Preferences . . . . . . . . . . . . . 5.2.2.1 Joint Relational Model (JRM) . . . . 5.2.2.2 Independent Relational Model (IRM) 5.2.3 Plausibility and Directionality Model . . . . . 5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . 5.3.1 Quasi-paraphrase Rules . . . . . . . . . . . . 5.3.2 Semantic Classes . . . . . . . . . . . . . . . . 5.3.3 Implementation . . . . . . . . . . . . . . . . . 5.3.4 Gold Standard Construction . . . . . . . . . . 5.3.5 Baselines . . . . . . . . . . . . . . . . . . . . 5.4 Experimental Results . . . . . . . . . . . . . . . . . . 5.4.1 Evaluation Criterion . . . . . . . . . . . . . . 5.4.2 Result Summary . . . . . . . . . . . . . . . . 5.4.3 Performance and Error Analysis . . . . . . . . 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . Chapter 6: Learning Semantic Classes 6.1 Introduction . . . . . . . . . . 6.2 Incorporating Constraints . . . 6.3 Word Similarity and Algorithm 6.3.1 Word Similarity . . . . 6.3.2 Algorithm . . . . . . . 6.4 Experiments . . . . . . . . . . 6.4.1 Evaluation Criterion . 6.4.2 Data and Methodology 6.5 Results and Discussion . . . . 6.5.1 Results . . . . . . . . 6.5.2 Discussion . . . . . . 6.6 Conclusion . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

Chapter 7: Paraphrases for Learning Surface Patterns 7.1 Introduction . . . . . . . . . . . . . . . . . . 7.2 Acquiring Paraphrases . . . . . . . . . . . . 7.2.1 Distributional Similarity . . . . . . . 7.2.2 Paraphrase Generation Model . . . . 7.2.3 Locality Sensitive Hashing . . . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . .

. . . . . . . . . . . . . . . . . .

80 80 81 82 84 84 86 87 89 89 90 91 91 93 94 94 94 95 100

. . . . . . . . . . . .

102 102 105 108 108 110 111 111 112 114 114 117 118

. . . . .

119 119 121 121 122 124

viii

7.3 7.4

7.5

7.6

Learning Surface Patterns . . . . . . . 7.3.1 Surface Patterns Model . . . . Experimental Methodology . . . . . . 7.4.1 Paraphrases . . . . . . . . . . 7.4.2 Surface Patterns . . . . . . . 7.4.3 Relation Extraction . . . . . . Experimental Results . . . . . . . . . 7.5.1 Baselines . . . . . . . . . . . 7.5.2 Evaluation Criteria . . . . . . 7.5.2.1 Paraphrases . . . . 7.5.2.2 Surface Patterns . . 7.5.2.3 Relation Extraction 7.5.3 Gold Standard . . . . . . . . 7.5.3.1 Paraphrases . . . . 7.5.3.2 Surface Patterns . . 7.5.3.3 Relation Extraction 7.5.4 Result Summary . . . . . . . 7.5.5 Discussion and Error Analysis Conclusion . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

Chapter 8: Paraphrases for Domain-Specific Information Extraction 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Learning Broad-Coverage Paraphrase Patterns . . . . . . . . . . . . . 8.2.1 Learning Surface-Level Paraphrase Patterns . . . . . . . . . . 8.2.2 Learning Lexico-Syntactic Paraphrase Patterns by Conversion 8.2.3 Learning Lexico-Syntactic Paraphrase Patterns Directly . . . 8.3 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Paraphrase Patterns . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Result 1 and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Comparison of Broad-Coverage Paraphrase Patterns . . . . . 8.4.2 Discussion and Error Analysis . . . . . . . . . . . . . . . . . 8.5 Result 2 and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Comparison of Broad-Coverage and Domain-Specific Patterns 8.5.2 Discussion and Error Analysis . . . . . . . . . . . . . . . . . 8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

125 126 127 127 128 129 130 130 130 131 131 132 132 132 133 134 134 136 139

. . . . . . . . . . . . . . . .

140 140 142 142 145 147 149 149 151 152 153 153 155 156 156 160 162

ix

Chapter 9: Conclusion and Future Work 9.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Inferential Selectional Preferences using Tailored Classes . 9.2.2 Knowledge Acquisition . . . . . . . . . . . . . . . . . . 9.2.3 Paraphrases for Machine Translation . . . . . . . . . . . . 9.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography

. . . . . .

. . . . . .

. . . . . .

163 163 165 165 166 167 169 170

Appendix Example Paraphrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 2 List of Paraphrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

x

List Of Tables

3.1

Distribution of lexical changes in MTC paraphrase set . . . . . . . . . . 51

3.2

Distribution of lexical changes in MSR paraphrase set . . . . . . . . . . 52

3.3

Human Judgement of lexical changes . . . . . . . . . . . . . . . . . . 53

3.4

Distribution of structural changes in MTC paraphrase set . . . . . . . . 53

3.5

Distribution of structural changes in MSR paraphrase set . . . . . . . . 54

4.1

Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2

Filtering quality of best performing systems according to the evaluation criteria defined in Section 4.3.3 on the TEST set. . . . . . . . . . . . . 73

4.3

Confusion matrix for ISP.IIM.∨ — best accuracy . . . . . . . . . . . . 74

4.4

Confusion matrix for ISP.JIM — best 90%-Specificity . . . . . . . . . . 75

5.1

Summary of results on the test set . . . . . . . . . . . . . . . . . . . . 95

6.1

Classes from S2500 set . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.2

Example clusters with the corresponding largest intersecting VerbNet classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.1

Quality of paraphrases . . . . . . . . . . . . . . . . . . . . . . . . . . 135

xi

7.2

Example paraphrases . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

7.3

Quality of extraction patterns . . . . . . . . . . . . . . . . . . . . . . . 135

7.4

Example extraction patterns . . . . . . . . . . . . . . . . . . . . . . . 136

7.5

Quality of instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

7.6

Example instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.1

List of pattern templates . . . . . . . . . . . . . . . . . . . . . . . . . 150

xii

List Of Figures

4.1

ROC curves for our systems on TEST . . . . . . . . . . . . . . . . . . 76

4.2

ISP.IIM.∨ (Best system’s) performance variation over different values of the τ threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1

Confusion Matrix for the best performing system, IRM using CBC with α = 0.15 and β = 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2

Accuracy variation for IRM with different values of α and β . . . . . . 97

5.3

Accuracy variation in predicting correct versus incorrect quasiparaphrase rules for different values of α . . . . . . . . . . . . . . . . . 98

5.4

Accuracy variation in predicting directionality of correct quasiparaphrase rules for different values of β . . . . . . . . . . . . . . . . . 99

6.1

Example classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.2

Example constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.3

HMRF model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.4

EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.5

Learning curve for S2500 set . . . . . . . . . . . . . . . . . . . . . . . . 115

6.6

Learning curve for S250 set . . . . . . . . . . . . . . . . . . . . . . . . 115

xiii

8.1

Paraphrase patterns in terrorism domain . . . . . . . . . . . . . . . . . 154

8.2

Paraphrase patterns in disease-outbreaks domain . . . . . . . . . . . . . 154

8.3

Paraphrase patterns in corporate-acquisitions domain . . . . . . . . . . 155

8.4

Paraphrase based vs traditional IE systems in terrorism domain . . . . . 159

8.5

Paraphrase based vs traditional IE systems in disease-outbreaks domain 159

8.6

Paraphrase based vs traditional IE systems in corporate-acquisitions domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

xiv

Abstract Paraphrases are textual expressions that convey the same meaning using different surface forms. Capturing the variability of language, they play an important role in many natural language applications including question answering, machine translation, and multi-document summarization. In linguistics, paraphrases are characterized by approximate conceptual equivalence. Since no automated semantic interpretation systems available today can identify conceptual equivalence, paraphrases are difficult to acquire without human effort. The aim of this thesis is to develop methods for automatically acquiring and filtering phrase-level paraphrases using a monolingual corpus. Noting that the real world uses far more quasi-paraphrases than the logically equivalent ones, we first present a general typology of quasi-paraphrases together with their relative frequencies. To our knowledge the first one ever. We then present a method for automatically learning the contexts in which quasi-paraphrases obtained from a corpus are mutually replaceable. For this purpose, we use Relational Selectional Preferences (RSPs) that specify the selectional preferences of the syntactic arguments of phrases

xv

(usually verbs or verb phrases). From the RSPs of individual phrases, we learn Inferential Selectional Preferences (ISPs), which specify the selectional preferences of a pair of quasi-paraphrases. We then apply the learned ISPs to the task of filtering incorrect inferences. We achieve an accuracy of 59% for this task, which is a statistically significant improvement over several baselines. Knowing that quasi-paraphrases are often inexact because they contain semantic implications which can be directional, we present an algorithm called LEDIR to learn the directionality of quasi-paraphrases using the (syntactic argument based) RSPs for phrases. Learning directionality allows us to differentiate the strong (bidirectional) from the weak (unidirectional) paraphrases. We show that the directionality of the quasi-paraphrases can be learned with 48% accuracy. This is again a significant improvement over several baselines. In learning the context and directionality of quasi-paraphrases, we have encountered the need for semantic concepts: Both RSPs and ISPs are defined in terms of semantic concepts. For learning these semantic concepts from text, we use a semi-supervised clustering algorithm HMRF-KMeans. We show that compared to the commonly used unsupervised clustering approach, this algorithm performs much better. Applying the semi-supervised clustering algorithm to the task of discovering verb classes, we obtain precision scores of 54% and 37% and corresponding recall scores of 53% and 38% for our two test sets. These are large improvements over the baseline.

xvi

We next investigate the task of learning surface paraphrases, i.e., paraphrases that do not require the use of a syntactic interpretation. Since one would need a very large corpus to find enough surface variations, we start with a really large but unprocessed corpus of 150GB (25 billion words) obtained from Google News. We rely only on distributional similarity to learn paraphrases from this corpus. To scale paraphrase acquisition to this large corpus, we apply only simple POS tagging and randomized algorithms. We build a paraphrase resource containing more than 2.5 million phrases. In the resource, 71% of the quasi-paraphrases are correct. Having learned the surface paraphrases, we investigate their utility for the task of relation extraction. We show that these paraphrases can be used to learn surface patterns for relation extraction. The extraction patterns obtained by using the paraphrases are not only more precise (more than 80% precision for both our test relations), but also have higher relative recall compared to a state-of-the-art baseline. This method also delivers more extraction patterns than the baseline. Applying the learned extraction patterns to the task of extracting relation instances from a test corpus, our system takes a hit in relative recall as compared to the baseline, but results in a much higher precision (more than 85% precision for both our test relations). Finally, we use paraphrases to learn patterns for domain-specific information extraction (IE). Since the paraphrases are learned from a large broad-coverage corpus, our patterns are domain-independent, making the task of moving to new domains very

xvii

easy. We empirically show that patterns learned using (broad-coverage corpus based) paraphrases are comparable in performance to several state-of-the-art domain-specific IE engines. Thus, in this thesis we define quasi-paraphrases, present methods to learn them from a corpus, and show that quasi-paraphrases are useful for information extraction.

xviii

Chapter 1

Introduction

1.1

Motivation

Variability is a common phenomenon in language: The meaning conveyed by a sentence or phrase can be expressed in several different ways. For example, the pairs of sentences (1) and (2): The shuttle Discovery landed in Florida on Saturday.

(1)

The shuttle Discovery touched down in Florida on Saturday.

(2)

or the phrases (3) and (4): X landed in Y

(3)

X touched down in Y

(4)

1

respectively express the same meaning. Such semantically equivalent sentences and phrases are called paraphrases. Formally, Webster dictionary defines a “paraphrase” as “a restatement of a text, passage, or work giving the meaning in another form”. WordNet (Fellbaum, 1998) defines a “paraphrase” as “rewording for the purpose of clarification”. In linguistics, De Beaugrande and Dressler (1981) define paraphrases as “Approximate conceptual equivalence among outwardly different material”. In general, paraphrases are simply expressions that communicate the same meaning using different words. The ability to paraphrase gives language the flexibility of articulation. It is hard to imagine a language without this facility: it’ll become monotonous. Since paraphrasing is such a common phenomenon, every Natural Language Processing (NLP) application has to deal with paraphrases. But automatically capturing the semantic phenomenon of “approximate conceptual equivalence”, that defines paraphrases is hard. Hence, most state-of-the-art systems choose to not use and deal with paraphrases explicitly. However, several people have attempted to harness the power of paraphrases to improve various NLP applications and have shown promising results. For example, in question answering, paraphrases have been used to find multiple patterns that pinpoint the same answer (Ravichandran & Hovy, 2002); in statistical machine translation, they have been used to find translations for unseen source language phrases (Callison-Burch et al., 2006); in multi-document summarization, they

2

have been used to identify phrases from different sentences that express the same information (Barzilay et al., 1999); in information retrieval they have been used for query expansion (Anick & Tipirneni, 1999). With such a wide range of applications, the knowledge of what constitutes paraphrases and how they can be learned automatically is important. This is the motivation for the work we undertake in this thesis. The thesis presents a set of methods to automatically learn paraphrases from text.

1.2

Goal and Approach

This thesis aims to answer the following main question: How can we learn paraphrases from a monolingual corpus? Finding a paraphrase requires one to ensure preservation of core meaning (semantics). Since there are no adequate semantic interpretation systems available today, paraphrase acquisition techniques use some other mechanism as a kind of pivot to indirectly (help) ensure semantic equivalence. Each pivot mechanism selects phrases with like meaning in a different characteristic way. One method uses a bilingual dictionary or translation table as pivot mechanism: all source language words or phrases that translate to a given foreign word/phrase are deemed to be paraphrases of one another (Bannard &

3

Callison-Burch, 2005). Another popular method uses the instantiations of syntactic arguments of paths in syntax trees (context) as pivots for learning paraphrases: syntactic paths that have overlapping contexts are considered paraphrases of each other (Lin & Pantel, 2001). Of the two methods, since it needs no resource other than a large monolingual corpus, the second method is currently easier to use with available data. Hence in this thesis, we use the basic principle behind this method, the so-called distributional hypothesis (Harris, 1954), to learn phrase level paraphrases. The distributional hypothesis states that “Two words that appear in similar contexts have similar meanings”. We extend this idea to address several issues involved in learning paraphrases. This thesis takes the view that perfect synonymy is hard to achieve. With the exception of paraphrases that are obtained by direct syntactic transformations, like active voice to passive voice, phrase or sentence variations are likely to have slightly different shades of meanings. In fact, a large number of paraphrases used in the real world are quasi-paraphrases. For example, consider the sentence pair (5) and (6): US journalist Daniel Pearl was killed by Al-Qaeda.

(5)

US journalist Daniel Pearl was beheaded by Al-Qaeda. (6) Sentences (5) and (6) are not equivalent in the logical sense: “killed” and “beheaded” are not synonymous. However, for all practical purposes, they can be considered paraphrases or quasi-paraphrases.

4

Another observation is that paraphrases or quasi-paraphrases are not mutually replaceable in all contexts, i.e., a pair of expressions may be considered paraphrases in certain contexts but not in others. For example, consider the phrases: X was killed by Y

(7)

X was beheaded by Y

(8)

These phrases are not paraphrases in all contexts, but can be considered quasiparaphrases when X is a “person” and Y is a “terrorist” or “terrorist organization” as in (7) and (8). However, given the sentence: The bill was killed by the Senate Health Care Strategies Committee. it can’t be plausibly paraphrased as: The bill was beheaded by the Senate Health Care Strategies Committee. This thesis deals with quasi-paraphrases, of which perfectly synonymous phrases is a small subset. In the context of this thesis, the term “paraphrases” (even without the prefix “quasi”) means “quasi-paraphrases”. We define quasi-paraphrases in detail in Chapter 3.

5

An important consideration when learning paraphrases is their granularity. As is clear from their definition, paraphrases occur at different levels in text, i.e., at the discourse level, sentence level, and phrase level. In this thesis, we focus only on the phrase1 level paraphrases, i.e., paraphrases of the form (3) and (4) or (7) and (8). To undertake the work in this thesis given the above background, we first define quasi-paraphrases. We then learn the contexts in which a pair of quasi-paraphrases are mutually replaceable. To do this, we present a set of methods called Inferential Selectional Preferences (ISPs) (Pantel et al., 2007). We then develop an algorithm called LEDIR (Bhagat et al., 2007) that uses the contextual information of phrases to learn the directionality of quasi-paraphrases. We use this information to differentiate strong from weak paraphrases. In accomplishing these two tasks, we find that semantic classes play an important role: the nature of the semantic classes affects the effectiveness of both ISP and LEDIR. Hence, we suggest using a state of the art semi-supervised algorithm to obtain semantic classes and show that it finds better semantic classes to suit the user preference than those obtained by a commonly used unsupervised algorithm. Next, we work on obtaining surface paraphrases. We show that we can obtain high precision surface paraphrases using distributional similarity and show that we can use them to learn high precision patterns for relation extraction (Bhagat & Ravichandran, 2008). Finally, 1

The word “phrase” here only means a sequence of words, i.e., n-grams.

6

we show that paraphrases can be used to learn patterns for domain- specific information extraction without relying on a domain-specific corpus (Bhagat et al., 2009).

1.3

Major Contributions

The major contributions of this thesis are: • We define a typology of quasi-paraphrases along with their relative frequencies. • We develop a method for learning the contexts in which a pair of quasiparaphrases are mutually replaceable. • We present a method for learning the directionality of quasi-paraphrases to distinguish the strong from the weak paraphrases. • We show that high quality surface level paraphrases can be learned from a large monolingual corpus using distributional similarity. These paraphrases are then shown to be useful for large scale relation extraction. • We show that broad-coverage paraphrases, can be used to learn patterns for domain-specific information extraction without the help of a domain-specific corpus.

7

1.4

Thesis Outline

The remainder of the thesis is organized as follows: • Chapter 2 discusses the previous work that has been done in the field on paraphrase learning and other related things. • Chapter 3 defines quasi-paraphrases in detail. • Chapter 4 discusses the Inferential Selectional Preferences that define the contexts in which quasi-paraphrases are mutually replaceable and shows that we can use this information to filter out incorrect inferences. • Chapter 5 presents our algorithm for learning the directionality of quasiparaphrases. • Chapter 6 shows the effectiveness of using an off-the-shelf state-of-the-art semisupervised algorithm for learning semantic classes and suggests it as a method for tailoring the semantic classes to suit the user’s preferences. • Chapter 7 shows that we can use distributional similarity to learn surface level paraphrases from a large corpus and further shows that we can use these surface paraphrases to learn high precision surface patterns for relation extraction.

8

• Chapter 8 presents our work on using paraphrases learned from a large broadcoverage corpus for the task of domain-specific information extraction. • Chapter 9 presents the concluding remarks and outlines several directions for future work.

9

Chapter 2

Related Work

2.1

Introduction

In this chapter we present and discuss the previous work done on paraphrases and learning them from text. We broadly classify the previous work into four main categories: 1. Linguistic Theories of Paraphrases 2. Learning Paraphrases 3. Learning Selectional Preferences and Directionality 4. Applications We discuss all these categories in detail below, and point out the differences of the previous approaches with that of ours.

10

2.2

Linguistic Theories of Paraphrases

Over the years, the phenomenon of paraphrasing has generated significant interest among linguists. One influential theory that defines paraphrases is found in Transformational Grammar (Chomsky, 1957; Harris, 1981). Transformational Grammar decomposes complex sentences into simple sentences and defines operations on these sentences (transformations). These transformations are meaning preserving and thus generate paraphrases. For example, (1) below is a transformational rule that converts a sentence from active to passive form as in (2). N1 t V N2 ←→ N2 t be Ven by N1

(1)

He saw the man. = The man was seen by him.

(2)

Harris (1981) lists a set of 20 transformational rules for English. He states that the set of these transformational rules is not exhaustive, but is sufficient for practical purposes. The main drawback of the transformational rules based paraphrasing is that it treats paraphrasing as a purely syntactic phenomenon and ignores the lexical nature of paraphrases. However, a large number of paraphrases are lexical in nature (for example, those generated by replacing a word in a sentence by its synonyms) and it is a fact that many transformational rules need to be constrained based on lexicon. For example, certain sentences like those containing verbs like lack generate odd sentences when converted to passive — “The team lacks talent.” VS “Talent is lacked by the team.”.

11

A different perspective on paraphrases is provided by the Meaning Text Theory (MTT) (Mel’cuk, 1996). MTT outlines a seven-strata (level) model for natural language structure: the strata ranging from surface-phonetic to semantic representation levels. Central to the concept of MTT are lexical functions (LFs) which express the well established (institutionalized) relations between lexical units in a language. For example, the LF Magn(X) — “to a high degree”, “intense” — is the “intensifier” for X as in (3) and (4). Magn(to laugh) = heartily

(3)

Magn(patience) = infinite

(4)

MTT defines 64 LFs that operate at its deep-syntactic level. Using these, Mel’cuk (to appear) defined a set of 67 lexical-paraphrasing rules (again at the deep-syntactic level). To explain the structural changes taking place while using the lexical-paraphrasing rules, MTT defines 38 restructuring-paraphrasing rules (at the deep-syntactic level). While MTT gives detailed account of paraphrasing, its utility is marred by several factors. Firstly, many LFs are vague and underspecified, making it hard to model them. Secondly, the complexity of the model due to its seven levels and the definition of paraphrases at the theory-specific deep-syntactic level has made it difficult for the vast majority of NLP researchers, who work outside the context of MTT, to even use this definition in any meaningful way. Thirdly, the Explanatory Combinatorial Dictionary,

12

which is the resource that would have a detailed descriptions of LFs and the list of LF values for a large set of words, is unavailable for most languages. In contrast to the above-mentioned theories, Honeck (1971) takes a very high level view of paraphrases. He divides paraphrases into: transformational: the surface structure of the base phrase or sentence is changed but the content words are unchanged; lexical: the surface structure of the base phrase or sentences remains the same but synonyms are substituted for lexical items; formalexic: the surface structure as well as the content words of the base phrase or sentence are changed. For example, given the base sentence (5), sentences (6), (7), and (8) are its transformational, lexical, and formalexic paraphrases respectively. The fight evoked the emotions that perplexed the boy that wept.

(5)

The boy that wept was perplexed by the emotions that the fight evoked. (6) The struggle elicited the feelings that puzzled the lad that cried.

(7)

The lad that cried was puzzled by the feelings the struggle elicited.

(8)

While the Honeck (1971) theory explains a vast majority of paraphrases, it is too general to be modeled or to even be used as a general guideline for distinguishing paraphrases from non-paraphrases. Also, it limits paraphrasing to the notion of synonymy

13

which goes against the broad notion of paraphrases — paraphrases are not just synonyms — as put forth by many linguists (De Beaugrande & Dressler, 1981; Clark, 1992; Mel’cuk, to appear). Barzilay (2003) also takes a generic, high-level view of paraphrases and classifies them as: atomic: paraphrases between small non-decomposable lexical units, i.e., words and small phrases; compositional: paraphrases between constructs that can be decomposed into smaller units, i.e., sentences and complex phrases. The atomic paraphrases are further classified based on the lengths of the two phrases participating in the paraphrase relation, i.e., [1 : 1], [1 : 2], [2 : 2], and others. The compositional paraphrases are further classified based on the basic high level changes in the sentence, i..e, deletions, permutations, noun phrase transformation, active-passive transformation, and lexical changes. While the Barzilay (2003) theory is generic enough to explain almost all the possible paraphrases, the operations defined by it are too general for practical applications. Without further elaboration, these operations cannot either be modeled or even be used as a guidelines to distinguish paraphrases from non-paraphrases. Thus overall, the various theories of paraphrases to date either define paraphrasing as an operation in a deep linguistic framework or they take a very abstract view of paraphrases . The former are hard to understand and can be used only if one is working with

14

the specified deep linguistic representations, while the latter are easy to understand, but hard to use for any practical purposes. In this thesis, we take the middle ground: We explain paraphrasing using a set of commonly known phenomena. These phenomena are general enough so that a majority of (quasi) paraphrases can be explained using them, and are specific enough so that they can be used to distinguish quasi-paraphrases from non-paraphrases. This makes the definition of quasi-paraphrases (sufficiently) objective and should make it easy for NLP researchers to understand and use the definition to study paraphrasing empirically.

2.3

Learning Paraphrases

Most recent work in paraphrase acquisition is based on automatic learning. In this section, we discuss the major approaches to learning paraphrases from text corpora.

2.3.1 Learning Paraphrases using Multiple Translations The idea that multiple translations of the same foreign language texts can be used to learn paraphrases was first presented by Barzilay and McKeown (2001). The intuition behind this approach is that different translators are likely to use different words to translate the same foreign language sentence. These translations are equivalent and thus

15

their corresponding sub-parts are also equivalent, i.e., paraphrases. Barzilay and McKeown (2001) obtained multiple translations of five classic novels. They then aligned the equivalent sentences from the different translations by using the Gale and Church (1991) method. From the aligned sentences, they learned phrase level paraphrases by using a co-training algorithm that uses the contexts of phrases as features. For example, given the aligned sentences (9) and (10), they learned two pairs of paraphrases: (“burst into tears”, “cried”) and (“comfort”, “console”). Emma burst into tears and he tried to comfort her, saying things to make her smile. (9) Emma cried, and he tried to console her, adorning his words with puns. (10) Pang et al. (2003) also used multiple translations for learning paraphrases. They used 11 translations each of around 900 Chinese sentences as their training data. They parsed the multiple translations of the same sentence using a syntactic parser and then merged the matching parts of the different parse-trees into a single forest. The forest was then linearized in the form of a word lattice. The alternate paths in the word lattice were treated as paraphrases. While the multiple translations based method often produces good paraphrases, it is limited by the availability of data. The corpus used by Barzilay and McKeown (2001) has only around 0.5 million words on each side, while the corpus used by Pang et al.

16

(2003) has a little over 3 million words on each side. These corpora are very small by current standards in NLP.

2.3.2 Learning Paraphrases using Parallel Bilingual Corpora Bannard and Callison-Burch (2005) were the first to use bilingual parallel corpora for learning paraphrases. The intuition behind this approach is that foreign language words or phrases can be used as pivots to learn paraphrases for the source language: all source language words or phrases that align with the same foreign language word or phrase are paraphrases. Bannard and Callison-Burch (2005) used a English-German parallel corpus to learn paraphrases for English. They also experimented with using multiple parallel corpora to improve the quality of the paraphrases. Zhou et al. (2006) also employed a similar approach to learn paraphrases using a English-Chinese bilingual corpus. Callison-Burch (2008) built on this work and developed a method to restrict the type of paraphrases learned by limiting the paraphrases to be of the same syntactic category. These approaches are also limited by the availability of bilingual parallel corpora, which though larger than multiple translations data are still only around the size of 315 million words on the English side (Callison-Burch, 2007).

17

2.3.3 Learning Paraphrases using Comparable Corpora Shinyama et al. (2002) were the first to use comparable corpora to learn paraphrases. The intuition behind this approach is that the same event is reported by various newspapers and each of these newspapers use a different set of words to express this event. These comparable articles can be used to mine paraphrases. Shinyama et al. (2002) presented an algorithm that automatically finds news articles that report the same event and then aligns the similar sentences in these news articles by using the named entities as anchors. The named entities are then further used as anchors to learn phrase level paraphrases. Barzilay and Lee (2003) used comparable news articles to obtain sentence level paraphrases. Dolan et al. (2004) worked with a much larger scale of data from the web and extracted sentence level paraphrases using edit distance between sentence pairs and some heuristics. Quirk et al. (2004) used these sentence level paraphrases to learn phrase level paraphrases by aligning the corresponding sentence parts using machine translation techniques. These approaches rely on corpora of comparable sentences, the largest size of which has about 60 million words (Callison-Burch, 2007). This is quite small by current standards in NLP.

18

2.3.4 Learning Paraphrases using Monolingual Corpora Lin and Pantel (2001) were the first to use a single monolingual resource, for obtaining quasi-paraphrase rules. The intuition behind their approach is that like words, paths in a syntax tree that appear in similar contexts should have similar meanings. Based on this intuition, they parsed a medium sized corpus using a dependency parser and collected paths in the syntax trees that satisfy some pre-defined heuristics. They then collected contexts for each of the paths in that corpus and found similarities between the paths based in their contexts. Two paths that have similarity over some pre-defined threshold form a quasi-paraphrase rules. Szpektor et al. (2004) presented another approach for learning quasi-paraphrase rules. They have a pre-defined set of verbs which they call pivots. For each of these pivots, they obtain sentences from the web that contain that pivot, parse the sentences using a dependency parser, and find anchors, i.e., words that are in specific syntactic relations with the pivot. They then rank the anchors using some heuristics to choose reliable anchors. The reliable anchors for each pivot are then used to find other phrases (these phrases are syntactic paths in a dependence tree) from the web that might entail the pivot. In this thesis, we use the quasi-paraphrase rules obtained from the DIRT algorithm (Lin & Pantel, 2001) to demonstrate the effectiveness of ISP’s in pointing out the contexts in which they are mutually replaceable, and to demonstrate the effectiveness

19

of LEDIR to learn their directionalities. However, both the Lin and Pantel (2001) and Szpektor et al. (2004) approaches, use syntactic parsers for learning syntactic quasiparaphrases. This limits their applicability to clean, and relatively small data sets1 . Also, for applications like information extraction, researchers have found it hard to make use of full parse tree based quasi-paraphrases for learning useful patterns2 . To address these issues, in this thesis, we present methods to learn paraphrases using simple and fast part-of-speech tagging and shallow parsing. Our algorithms use a single monolingual corpus and easily scale to very large corpora. We were able to scale our part-of-speech based algorithm to a corpus of 25 billion words which is at least one order of magnitude larger than the corpora that have been used for learning paraphrases in the past. Also, we show the effectiveness of our quasi-paraphrases in learning patterns for information extraction.

2.4

Learning Selectional Preferences and Directionality

In this section, we present a brief overview of the work relating to the two paraphrasing issues that we address in this thesis: learning selectional preferences and learning directionality. 1

With the availability of large clusters, scaling parsers to large data sets has become easier, especially in big companies like Google, Yahoo, and Microsoft. However, scaling is still a problem for a majority of the research groups that work on Natural Language Processing. 2 Personal communication.

20

2.4.1 Learning Selectional Preferences Selectional Preference (SP) as a foundation for computational semantics is one of the earliest topics in AI and NLP, and has its roots in Katz and Fodor (1963); Wilks (1975). Overviews of NLP research on this theme are Wilks and Fass (1992), which includes the influential theory of Preference Semantics by Wilks (1975), and more recently Light and Greiff (2002). Light and Greiff (2002) define SPs as the preference of a predicate for the semantic class membership of its argument and vice versa. Much previous work has focused on learning SPs for simple structures. Resnik (1996), the seminal paper on this topic, introduced a statistical model for learning SPs for predicates using an unsupervised method. The focus of our work in this paper however is to learn SPs for quasi-paraphrases. Learning SPs often relies on an underlying set of semantic classes, as in both Resnik (1996) and our approach. Semantic classes can be specified manually or derived automatically. Manually constructed collections of semantic classes include the hierarchies like WordNet (Fellbaum, 1998), Levin verb classes (Levin, 1993), and FrameNet (Baker et al., 1998). Automatic derivation of semantic classes can take a variety of approaches, but often uses corpus methods and the Distributional Hypothesis (Harris, 1954) to automatically cluster similar entities into classes, e.g., CBC (Pantel & Lin, 2002). In our work, we experiment with two sets of semantic classes, one from WordNet and one from CBC.

21

Zanzotto et al. (2006) recently explored a different interplay between SPs and inferences. Rather than examine the role of SPs in inferences, they use SPs of a particular type to derive inferences. For instance, the preference of win for the subject player, a nominalization of play, is used to derive that “win ⇒ play”. Our work can be viewed as complementary to the work on extracting quasi-paraphrase rules, since we seek to refine when a given quasi-paraphrase rule applies, filtering out incorrect inferences.

2.4.2 Learning Directionality There have been a few approaches to learn the directionality of restricted sets of semantic relations, mostly between verbs. Chklovski and Pantel (2004) used lexico-syntactic patterns over the Web to detect certain types of symmetric and asymmetric relations between verbs. They manually examined and obtained lexico-syntactic patterns that help identify the types of relations they considered and used these lexico-syntactic patterns over the Web to detect these relations among a set of candidate verb pairs. Zanzotto et al. (2006) explored a selectional preference-based approach to learn asymmetric inference rules between verbs. They used the selectional preferences of a single verb, i.e., the semantic types of a verbs arguments, to infer an asymmetric inference between the verb and the verb form of its argument type. Torisawa (2006) presented a method to acquire inference rules with temporal constraints between verbs. They used co-occurrences between verbs in Japanese coordinated sentences and co-occurrences

22

between verbs and nouns to learn the verb-verb inference rules. All these approaches however are limited only to verbs, and to specific types of verb relations. In principle, the work that is the most similar to ours is by Geffet and Dagan (2005). Geffet and Dagan proposed an extension to the distributional hypothesis to discover entailment relation between words. They model the context of a word using its syntactic features, and compare the contexts of two words for strict inclusion to infer lexical entailment. Their method however is limited to lexical entailment, and they show its effectiveness for nouns. The method that we present in this thesis for learning directionality deals with quasiparaphrase rules between binary relations and includes quasi-paraphrase rules between verbal relations, non-verbal relations, and multi-word relations. Our definition of context, and the methodology for obtaining context similarity, and overlap is also much different from those used in any of the previous approaches.

2.5

Applications

In this section, we present a brief overview of the past work in pattern-based information extraction and the use of paraphrases for learning extraction patterns.

23

2.5.1 Paraphrases for Learning Extraction Patterns Paraphrases have emerged as a useful tool for learning Information Extraction (IE) patterns. Sekine (2006) and Romano et al. (2006) used syntactic paraphrases to learn patterns for relation extraction. The Sekine (2006) approach clusters patterns based on the entities they extract and on the keywords inside the patterns. Patterns in one cluster are assumed to have similar meanings (paraphrases). Romano et al. (2006) on the other hand uses the entailment templates from Szpektor et al. (2004) to learn patterns using some seeds. While procedurally different, both methods depend heavily on the performance of the syntax parser and require complex syntax tree matching to extract the relation instances. For general information extraction, i.e., when only a single entity or role needs to be extracted from text, Szpektor and Dagan (2008) used distributional similarity between paths in dependency trees to learn entailing quasi-paraphrases. This approaches also learns syntactic quasi-paraphrases which involves parsing the text with a dependency parser. On the other hand, our method for learning patterns for relation extraction (Bhagat & Ravichandran, 2008), uses surface paraphrases. Also for general information extraction, our first method learns surface paraphrases making is easily scalable to large corpora and giving us the flexibility to apply different levels of post-processing. Our second method learns lexico-syntactic paraphrases, using shallow parsing, which is also

24

several times faster than full parsing. The fact that our methods only rely on simple and fast language processing techniques makes them both robust and scalable.

2.5.2 Paraphrases for Learning Patterns for Open-Domain Relation Extraction One task related to the work we do in this thesis is relation extraction. Its aim is to extract instances of a given relation. For example, given a relation like “acquisition”, relation extraction aims to extract the “hACQUIRERi and “hACQUIREEi”. Hearst (1992), the pioneering paper in the field, used a small number of hand selected patterns to extract instances of hyponymy relation. Berland and Charniak (1999) used a similar method for extracting instances of meronymy relation. Ravichandran and Hovy (2002) used seed instances of a relation to automatically obtain surface patterns by querying the web. Romano et al. (2006) and Sekine (2006) used syntactic paraphrases to obtain patterns for extracting relations. In this thesis, we use surface paraphrases as a method for learning surface patterns for relation extraction. Our method starts with a few seed patterns for a given relation and obtains other surface patterns automatically, by finding their paraphrases. To find these paraphrases, we use our surface paraphrase learning algorithm, which uses distributional similarity over a large corpus to learn them. Using this approach helps our method to avoid the problem of obtaining overly general patterns, as against using

25

the Ravichandran and Hovy (2002) approach. Also, the use of surface patterns in our method avoids the dependence on a parser, and syntactic matching, the two factors that play an important role in the performance of Romano et al. (2006) and Sekine (2006) approaches. This also makes the extraction process scalable.

2.5.3 Paraphrases for Learning Patterns for Domain-Specific Information Extraction While the pattern based approach to domain-specific information extraction have been popular since the early 1990’s, the initial work focused on manual creation of patterns (Hobbs et al., 1993; Jacobs et al., 1993). Focus however quickly shifted to learning patterns automatically from domain-specific corpora. One line of work focused on using annotated training corpora to learn these patterns (Riloff, 1993; Kim & Moldovan, 1993; Freitag, 1998a; Califf & Mooney, 2003). However, given the need for large amounts of tedious manual annotation for these methods, weakly-supervised approaches, which need very little annotation, are now becoming popular (Riloff, 1996; Patwardhan & Riloff, 2007). These and other similar approaches to domain-specific IE use domain-specific corpora for learning patterns. Their dependence on domainspecific corpora is a hindrance to the easy portability of these methods to new domains. The pattern-learning method we present in this thesis, however, differs from the previous approaches in that it does not need a domain-specific corpus for learning patterns:

26

it learns patterns from a general broad-coverage corpus. Also, our method collects patterns from a broad-coverage corpus only once. Patterns can then be generated for any (new) domain by using just a few seed patterns.

27

Chapter 3

Paraphrases

3.1

Introduction

Sentences or phrases that convey the same meaning using different surface words are called paraphrases. For example, the sentences (1) and (2) below are called paraphrases. The school said that their buses seat 40 students each.

(1)

The school said that their buses accommodate 40 students each. (2) While the general interpretation of the term paraphrases is quite narrow (along the lines mentioned above), in linguistic literature, paraphrases are most often characterized by an approximate equivalence of meaning across sentences or phrases. De Beaugrande and Dressler (1981) define paraphrases as “Approximate conceptual equivalence among outwardly different material”. Hirst (2003) defines paraphrases as “Talk(ing)

28

about the same situation in a different way”. He argues that paraphrases aren’t synonymous: There are pragmatic differences in paraphrases, i.e., difference of evaluation, connotation, viewpoint etc. According to Mel’cuk (to appear) “For two sentences to be considered paraphrases, they need not be fully synonymous: it is sufficient for them to be quasi-synonymous, that is, to be mutually replaceable salva significatione at least in some contexts”. He further adds that approximate paraphrases include implications (not in logical, but everyday sense). Taking an extreme view, Clark (1992) rejects the idea of absolute synonymy by saying “Every two forms (in language) contrast in meaning”. Overall, there is a large body of work in the linguistic literature which argues that paraphrases are not restricted to synonymy. In this thesis, we take a broad view of paraphrases along the lines mentioned above. To avoid the conflict between the notion of strict paraphrases as understood in logic and the broad notion in linguistics, we use the term quasi-paraphrases to refer to the paraphrases that we deal with in this thesis. In the context of this thesis, the term “paraphrases” (even without the prefix “quasi”) means “quasi-paraphrases”. We define quasi-paraphrases as “Sentences or phrases that convey approximately the same meaning using different surface words”. We ignore the fine grained distinctions of meaning between sentences and phrases, introduced due to the speakers evaluation of the situation, connotation of the terms used, change of modality, etc. For example, consider sentences (3) and (4) below.

29

The school said that their buses seat 40 students each.

(3)

The school said that their buses cram 40 students each. (4) Here, the words “seat” and “cram” are not synonymous: they carry different evaluations of the speakers about the same situation. We however consider sentences (3) and (4) to be (quasi) paraphrases. Similarly, consider sentences (5) and (6) below. The school said that their buses seat 40 students each.

(5)

The school is saying that their buses might accommodate 40 students each. (6) Here, “said” and “is saying” have different tenses. Also, “might accommodate” and “seat” are not synonymous: “might accommodate” contains the modal verb “might”. We however consider sentences (5) and (6) to be quasi-paraphrases. While approximate equivalence is hard to characterize, except as the intuition of a native speaker, we will do our best in this thesis to make it as objective as possible.

3.2

Paraphrasing Phenomenon Explained

In this section, we define+describe the phenomenon of quasi-paraphrases. The phenomenon is analyzed from two perspectives: lexical and structural. The lexical perspective deals with the kinds of lexical changes that can take place in a sentence or a

30

phrase resulting in the generation of its paraphrases. These lexical changes are accompanied by changes in the structure of the original sentence or phrase, which is characterized from the structural perspective. We discuss both the perspectives in detail below.

3.2.1 Lexical Perspective The lexical perspective presents the various lexical changes that can take place in a sentence or a phrase while retaining its approximate meaning (semantics).

1. Synonym substitution: Replacing a word or a phrase by a synonymous word or phrase, in the appropriate context, results in a paraphrase of the original sentence or phrase. This category covers near-synonymy, that is, it allows for changes in evaluation, connotation, etc., of words or phrases between paraphrases. This category also covers the special case of genitives, where the clitic “’s” is replaced by other genitive indicators like “of”, “of the”, etc. Accompanying structural changes: Substitution. Example: Google bought YouTube. ⇔ Google acquired YouTube. Mary is slim. ⇔ Mary is skinny.

31

2. Actor/Action substitution: Replacing the name of an action by a word or phrase denoting the person doing the action (actor) and vice versa, in the appropriate context, results in a paraphrase of the original sentence or phrase. This substitution may be accompanied by the addition/deletion of appropriate function words. Accompanying structural changes: Substitution, Addition/Deletion. Example: I dislike rash drivers. ⇔ I dislike rash driving. 3. Manipulator/Device substitution: Replacing the name of a device by a word or phrase denoting the person using the device (manipulator) and vice versa, in the appropriate context, results in a paraphrase of the original sentence or phrase. This substitution may be accompanied by the addition/deletion of appropriate function words. Accompanying structural changes: Substitution, Addition/Deletion. Example: The pilot took off despite the stormy weather. ⇔ The plane took off despite the stormy weather. 4. General/Specific substitution: Replacing a word of a phrase by a more general or more specific word or phrase, in the appropriate context, results in a paraphrase of the original sentence or phrase. This substitution may be accompanied by the

32

addition/deletion of appropriate function words. Hypernym substitution is a part of this category. This often generates a quasi-paraphrase. Accompanying structural changes: Substitution, Addition/Deletion. Example: I dislike rash drivers. ⇔ I dislike rash motorists. John is flying in this weekend. ⇔ John is flying in this Saturday. 5. Metaphor substitution: Replacing a noun by its standard metaphorical use and vice versa, in the appropriate context, results in a paraphrase of the original sentence or phrase. This substitution may be accompanied by the addition/deletion of appropriate function words. Accompanying structural changes: Substitution, Addition/Deletion. Example: I had to drive through fog to get here. ⇔ I had to drive through a wall of fog to get here. Immigrants have used this network to send cash. ⇔ Immigrants have used this network to send stashes of cash. 6. Part/Whole substitution: Replacing a part by its corresponding whole and vice versa, in the appropriate context, results in a paraphrase of the original sentence

33

or phrase. This substitution may be accompanied by the addition/deletion of appropriate function words. Accompanying structural changes: Substitution, Addition/Deletion. Example: American airplanes pounded the Taliban defences.

⇔ American airforce

pounded the Taliban defences. 7. Verb/“Semantic-role noun” substitution: Replacing a verb by a noun corresponding to the agent of the action or the patient of the action or the instrument used for the action or the medium used for the action, in the appropriate context, results in a paraphrase of the original sentence or phrase. This substitution may be accompanied by the addition/deletion of appropriate function words and sentence restructuring. Accompanying structural changes: Substitution, Addition/Deletion, Permutation. Example: John teaches Mary. ⇔ John is Mary’s teacher. John teaches Mary. ⇔ Mary has John as her teacher. John teaches Mary. ⇔ Mary is John’s student. John teaches Mary. ⇔ John has Mary as his student.

34

John was batting. ⇔ John was wielding the bat. John tiled his bathroom floor. ⇔ John installed tiles on his bathroom floor. 8. Antonym substitution: Replacing a word or phrase by its antonym accompanied by a negation or by negating some other word, in the appropriate context, results in a paraphrase of the original sentence or phrase. This substitution may be accompanied by the addition/deletion of appropriate function words. Accompanying structural changes: Substitution, Addition/Deletion. Example: John ate. ⇔ John did not starve. I lost interest in the endeavor. ⇔ I developed disinterest in the endeavor. 9. Pronoun/“Referenced noun” substitution: Replacing a pronoun by the noun it refers to results in a paraphrase of the original sentence or phrase. Accompanying structural changes: Substitution. Example: John likes Mary, because she is pretty. ⇔ John likes Mary, because Mary is pretty. 10. Verb/Noun conversion: Replacing a verb by its corresponding nominalized noun form and vice versa, in the appropriate context, results in a paraphrase of

35

the original sentence or phrase. This substitution may be accompanied by the addition/deletion of appropriate function words and sentence restructuring. Accompanying structural changes: Substitution, Addition/Deletion, Permutation. Example: The police interrogated the suspects. ⇔ The police subjected the suspects to an interrogation. The virus spread over a period of two weeks. ⇔ Two weeks saw a spreading of the virus. 11. Verb/Adjective conversion: Replacing a verb by the corresponding adjective form and vice versa, in the appropriate context, results in a paraphrase of the original sentence or phrase. This substitution may be accompanied by the addition/deletion of appropriate function words and sentence restructuring. Accompanying structural changes: Substitution, Addition/Deletion, Permutation. Example: John loves Mary. ⇔ Mary is loveable to John.

36

12. Verb/Adverb conversion: Replacing a verb by its corresponding adverb form and vice versa, in the appropriate context, results in a paraphrase of the original sentence or phrase. This substitution may be accompanied by the addition/deletion of appropriate function words and sentence restructuring. Accompanying structural changes: Substitution, Addition/Deletion, Permutation. Example: John boasted about his work. ⇔ John spoke boastfully about his work. 13. Noun/Adjective conversion: Replacing a verb by its corresponding adjective form and vice versa, in the appropriate context, results in a paraphrase of the original sentence or phrase. This substitution may be accompanied by the addition/deletion of appropriate function words and sentence restructuring. Accompanying structural changes: Substitution, Addition/Deletion, Permutation. Example: I’ll fly by the end of June. ⇔ I’ll fly late June. 14. Converse substitution: Replacing a word or a phrase with its converse and inverting the relationship between the constituents of a sentence or phrase, in the appropriate context, results in a paraphrase of the original sentence or phrase,

37

presenting the situation from the converse perspective. This substitution may be accompanied by the addition/deletion of appropriate function words and sentence restructuring. Accompanying structural changes: Substitution, Addition/Deletion, Permutation. Example: Google bought YouTube. ⇔ YouTube was sold to Google. 15. “Verb and preposition denoting location”/“Noun denoting location” substitution: Replacing a verb and a preposition denoting location by a noun denoting the location and vice versa, in the appropriate context, results in a paraphrase of the original sentence or phrase. This substitution may be accompanied by the addition/deletion of appropriate function words and sentence restructuring. Accompanying structural changes: Substitution, Addition/Deletion, Permutation. Example: The finalists are playing in the Giants stadium. ⇔ The Giants stadium is the playground for the finalists.

38

16. Change of voice: Changing a verb from its active to passive form and vice versa, results in a paraphrase of the original sentence of phrase. This change may be accompanied by the addition/deletion of appropriate function words and sentence restructuring. This often generates one of the most strictly meaning-preserving paraphrase. Accompanying structural changes: Substitution, Addition/Deletion, Permutation. Example: John loves Mary. ⇔ Mary is loved by John. This building stores the excess items. ⇔ Excess items are stored in this building. 17. Change of tense: Changing the tense of a verb, in the appropriate context, results in a paraphrase of the original sentence or phrase. This change may be accompanied by the addition/deletion of appropriate function words. This often generates a quasi-paraphrase. Accompanying structural changes: Substitution, Addition/Deletion. Example: John loved Mary. ⇔ John loves Mary.

39

18. Change of aspect: Changing the aspect of a verb, in the appropriate context, results in a paraphrase of the original sentence of phrase. This change may be accompanied by the addition/deletion of appropriate function words. Accompanying structural changes: Substitution, Addition/Deletion. Example: John is flying in today. ⇔ John flies in today. 19. Change of modality: Addition/deletion of a modal or substitution of one modal by another, in the appropriate context, results in a paraphrase of the original sentence or phrase. This change may be accompanied by the addition/deletion of appropriate function words. This often generates a quasi-paraphrase. Accompanying structural changes: Substitution, Addition/Deletion. Example: Google must buy YouTube. ⇔ Google bought YouTube. The government wants to boost the economy. ⇔ The government hopes to boost the economy. 20. Change of person: Changing the grammatical person of a referenced object, results in a paraphrase of the original sentence of phrase. This change may be accompanied by the addition/deletion of appropriate function words. This often generates one of the most strictly meaning-preserving paraphrase.

40

Accompanying structural changes: Substitution, Addition/Deletion. Example: John said “I like football”. ⇔ John said that he liked football. 21. Repetition/Ellipsis: Ellipsis or elliptical construction results in a paraphrase of the original sentence or phrase. Accompanying structural changes: Addition/Deletion. Example: John can run fast and Mary can run fast, too. ⇔ John can run fast and Mary can, too. John can eat three apples and Mary can eat two apples. ⇔ John can eat three apples and Mary can eat two. 22. Semantic implication: Replacing a word or a phrase denoting an action, event etc. by a word or phrase denoting its possible future effect, in the appropriate context, results in a paraphrase of the original sentence of phrase. This may be accompanied by the addition/deletion of appropriate function words and sentence restructuring. This often generates a quasi-paraphrase. Accompanying structural changes: Substitution, Addition/Deletion, Permutation.

41

Example: Google is in talks to buy YouTube. ⇔ Google bought YouTube. The Marines are fighting the terrorists. ⇔ The Marines are eliminating the terrorists. 23. Approximate numerical equivalences: Replacing a numerical expression (a word or a phrase denoting a number) by an approximately equivalent numerical expression, in the appropriate context, results in a paraphrase of the original sentence or phrase. This often generates a quasi-paraphrase. Accompanying structural changes: Substitution. Example: At least 23 US soldiers were killed in Iraq last month. ⇔ At least 26 US soldiers were killed in Iraq last month. Disneyland is over 30 miles from my place. ⇔ Disneyland is around 32 miles from my place. 24. Function word variations: Changing the function words in a sentence or a phrase without affecting its semantics, in the appropriate context, results in a paraphrase of the original sentence or phrase. This can involve replacing a light

42

verb by another light verb, replacing a light verb by copula, replacing a preposition by another preposition, replacing a determiner by another determiner, replacing a determiner by a preposition and vice versa, and addition/removal of a preposition and/or a determiner. Accompanying structural changes: Substitution, Addition/Deletion, Permutation. Example: Results of the competition have been declared. ⇔ Results for the competition have been declared. John showed a nice demo. ⇔ John’s demo was nice. 25. External knowledge: Replacing a word or a phrase by another word or phrase based on extra-linguistic (world) knowledge, in the appropriate context, results in a paraphrase of the original sentence or phrase. This may be accompanied by the addition/deletion of appropriate function words and sentence restructuring. This often generates a quasi-paraphrase, although in some cases preserving meaning exactly. Accompanying structural changes: Substitution, Addition/Deletion, Permutation.

43

Example: We must work hard to win this election. ⇔ The Democrats must work hard to win this election.

3.2.2 Structural Perspective The structural perspective presents the various structural changes that take place in a sentence or phrase, at the surface level, as a result of the above mentioned lexical changes. These structural changes are crucial for the preservation of meaning across sentences or phrases.

1. Substitution: Replacing a word or phrase in the original sentence or phrase, by another word or phrase in its paraphrase, is called substitution. Example: Excess items are stored in this building. ⇔ This building stores the excess items. 2. Addition/Deletion: Addition of an extra word in the paraphrase, such that, it does not have a corresponding word or phrase in the original sentence or phrase, is called addition. Removal of such an extra word or phrase is called deletion. In paraphrasing, addition of a word to a sentence or phrase is equivalent to deletion of that word from its paraphrase.

44

Example: Excess items are stored in this building. ⇔ This building stores the excess items. 3. Permutation: Changing the order of words or phrases in the paraphrase, such that, the corresponding words or phrases in the original sentence or phrase have a different relative order, is called permutation. Example: Excess items are stored in this building. ⇔ This building stores the excess items.

3.3

Analysis of Paraphrases

In Section 3.2, we presented the lexical and structural perspectives which together explain quasi-paraphrases. In this section, we seek to validate the scope and accuracy of the changes from each of the perspectives. To validate the list of suggested changes from the lexical perspective, we show analysis using two criteria: 1. Distribution of lexical changes: What is the distribution of each of these lexical changes in a random paraphrase corpus? 2. Human judgement of lexical changes: If one uses each of the lexical changes, naively, on applicable sentences, how often do each of these changes generate acceptable quasi-paraphrases?

45

To estimate the prevalence of the structural changes, we show their sentence-level distributions for paraphrases.

3.3.1 Distribution of Lexical Changes In this section, we explain the procedure we used to measure the distribution of the changes that define paraphrases from the lexical perspective: 1. We downloaded paraphrases from portions of two publicly available data sets containing sentence-level paraphrases:

Multiple-Translations Corpus

(MTC) (Huang et al., 2002) and Microsoft Research (MSR) paraphrase corpus (Dolan et al., 2004). These data sets contain pairs of sentences that are deemed to be paraphrases of each other by annotators. The paraphrase pairs come with their equivalent parts manually aligned (Cohn et al., 2008); thus also giving phrase level paraphrases. 2. We selected 30 sentence-level paraphrase pairs from each of these corpora at random and extracted the corresponding aligned and unaligned phrases. We assume that any unaligned phrase is paired with a null phrase. This resulted in 210 phrase pairs for the MTC corpus and 145 phrase pairs for the MSR corpus.

46

3. We labeled each of the phrase pairs with the appropriate lexical changes from Section 3.2.1. If any phrase pair could not be labeled by a lexical change from Section 3.2.1, we labeled it as unknown. 4. We finally calculated the distribution of each label (lexical change), over all the labels, for each corpus.

3.3.2 Human Judgement of Lexical Changes In this section, we explain the procedure we used to obtain the human-judgement of the changes that define paraphrases from the lexical perspective: 1. We randomly selected two words or phrases, from publicly available resources (depending on the lexical change), for each of the lexical changes from Section 3.2.1 (except “external knowledge”). 2. For each selected word or phrase, we obtained five random sentences from the gigaword corpus. These sentences were manually checked to make sure that they contain the intended sense of the word or phrase. This gave us a total of 10 sentences for each phenomenon. For the phenomenon of external knowledge, we randomly sampled a total of 10 sentence pairs from the MTC and MSR corpora, such that the pairs were paraphrases based on external knowledge.

47

3. For each sentence (except the sentences for the“external knowledge” category), we applied the corresponding lexical changes to the word or phrase selected in step 1. The lexical changes were applied in a naive way: the word or phrase in 1 was replaced by the corresponding word or phrase depending on the applicable lexical change; the words in the new sentence were allowed to be reordered (permuted) if needed; only function words (and no content words) were allowed to be added to the new sentence if needed. 4. We gave the sentence pairs to two annotators and asked them to annotate them as paraphrases and non-paraphrases. 5. We calculated the precision percentage for each lexical change as the average of the precision scores obtained from the two annotations. We also calculated the kappa-statistic (Siegal & Castellan Jr., 1988) to measure the inter-annotator agreement.

3.3.3 Sentence-level Distribution of Structural Changes In this section, we explain the procedure we used to measure the sentence-level distributions of the changes from the structural perspective that define paraphrases — 1. We obtained the 60 sentence-level paraphrase pairs (30 from each corpus) used in Section 3.3.1.

48

2. We labeled each of the selected paraphrase pairs with the appropriate structural change from Section 3.2.21 (a paraphrase pair can have multiple structural changes). 3. We finally calculated the distribution of each structural change, over the selected sentence pairs, for each corpus.

3.3.4 Results This section shows the results of the analysis of the lexical and structural changes of paraphrases carried out using the methodologies described in Sections 3.3.1, 3.3.2, and 3.3.3. The corpus for calculating precision from Section 3.3.2 was annotated by two annotators. A kappa score of κ = 0.66 was obtained on the annotation task. The recall, precision, and percentage distribution were calculated as follows:

Distribution of a lexical change = # of correct phrase-level quasi-paraphrase pairs containing the lexical change ∗ 100 # of correct phrase-level quasi-paraphrase pairs in the corpus 1

Since we know that these are the only possible string operations, we do not use the label “unknown” here.

49

Human judgement of a lexical change = # of correct sentence-level quasi-paraphrase pairs containing the lexical change ∗ 100 # of sentence pairs containing the lexical change in the corpus

Distribution of a structural change = # of correct sentence-level quasi-paraphrase pairs containing the structural change ∗ 100 # of correct sentence-level quasi-paraphrase pairs in the corpus

Tables 3.1 and 3.2 show the percentage recall of lexical changes in the MTC and MSR corpora respectively; Table 3.3 shows the percentage precision of lexical changes in the sentence-level paraphrases test-corpus we created; Tables 3.4 and 3.5 show the percentage distribution of structural changes in the MTC and MSR corpora respectively.

3.4

Conclusion

A definition of what phenomena constitute paraphrases and what do not has been a problem in the past. While some people have used a very narrow interpretation of paraphrases — paraphrases are exactly logically synonymous — others have taken broader perspectives which consider even semantic implications as paraphrases. To the best of

50

# 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

Category Synonym substitution Actor/Action substitution Manipulator/Device substitution General/Specific substitution Metaphor substitution Singulative/Collective substitution Verb/“Semantic-role noun” substitution Antonym substitution Pronoun/“Referenced noun” substitution Verb/Noun conversion Verb/Adjective conversion Verb/Adverb conversion Noun/Adjective conversion Converse substitution “Verb and preposition denoting location”/“Noun denoting location” substitution Change of voice Change of tense Change of aspect Change of modality Change of person Repetition/Ellipsis Semantic implication Approximate numerical equivalences Function word variations External knowledge Unknown Total

% Distribution 36.67% 0.00% 0.00% 4.29% 0.00% 0.00% 0.48% 0.00% 0.95% 1.90% 0.48% 0.00% 0.00% 0.48% 0.00% 1.43% 3.81% 0.95% 0.95% 0.00% 3.81% 0.95% 0.00% 36.67% 6.19% 0.00% 100.00%

Table 3.1: Distribution of lexical changes in MTC paraphrase set

51

# 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

Category Synonym substitution Actor/Action substitution Manipulator/Device substitution General/Specific substitution Metaphor substitution Singulative/Collective substitution Verb/“Semantic-role noun” substitution Antonym substitution Pronoun/“Referenced noun” substitution Verb/Noun conversion Verb/Adjective conversion Verb/Adverb conversion Noun/Adjective conversion Converse substitution “Verb and preposition denoting location”/“Noun denoting location” substitution Change of voice Change of tense Change of aspect Change of modality Change of person Repetition/Ellipsis Semantic implication Approximate numerical equivalences Function word variations External knowledge Unknown Total

% Distribution 18.62% 0.00% 0.00% 2.76% 0.69% 0.00% 0.00% 0.00% 0.69% 2.76% 0.00% 0.00% 0.00% 0.00% 0.00% 0.69% 1.38% 0.00% 0.00% 0.69% 4.14% 4.14% 2.07% 29.66% 31.72% 0.00% 100.00%

Table 3.2: Distribution of lexical changes in MSR paraphrase set

52

# 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.

Category Synonym substitution Actor/Action substitution Manipulator/Device substitution General/Specific substitution Metaphor substitution Singulative/Collective substitution Verb/“Semantic-role noun” substitution Antonym substitution Pronoun/“Referenced noun” substitution Verb/Noun conversion Verb/Adjective conversion Verb/Adverb conversion Noun/Adjective conversion Converse substitution “Verb and preposition denoting location”/“Noun denoting location” substitution Change of voice Change of tense Change of aspect Change of modality Change of person Repetition/Ellipsis Semantic implication Approximate numerical equivalences Function word variations External knowledge

% Accuracy 95.00% 75.00% 30.00% 80.00% 60.00% 65.00% 60.00% 65.00% 70.00% 100.00% 55.00% 65.00% 80.00% 75.00% 65.00% 85.00% 70.00% 95.00% 80.00% 80.00% 100.00% 70.00% 95.00% 85.00% 95.00%

Table 3.3: Human Judgement of lexical changes # 1. 2. 3.

Category Substitution Addition/Deletion Permutation

% Distribution 93.33% 90.00% 60.00%

Table 3.4: Distribution of structural changes in MTC paraphrase set

53

# 1. 2. 3.

Category Substitution Addition/Deletion Permutation

% Distribution 96.67% 96.67% 40.00%

Table 3.5: Distribution of structural changes in MSR paraphrase set our knowledge, outside of specific language interpretation frameworks (like Meaning Text Theory (Mel’cuk, 1996)), no one has created a general, exhaustive list of changes that define paraphrases. In this chapter we have created such a list. We have also tried to empirically quantify the distribution and accuracy of the list. It is notable that certain types of changes dominate while some other changes are very rare. However, it is also observed that the dominating changes vary based on the type of paraphrase-corpus used, thus indicating the variety exhibited by the paraphrases. Based on the large variety of possible changes that can generate paraphrases, its seems likely that the kinds of paraphrases that are deemed useful would depend on the application at hand. This might motivate the creation of application-specific lists of the kinds of allowable paraphrases and the development of automatic methods to distinguish the different kinds of paraphrases.

54

Chapter 4

Inferential Selectional Preferences

4.1

Introduction

Semantic inference is a key component for advanced natural language understanding. Several important applications are already relying heavily on inference, including question answering (Moldovan et al., 2003; Harabagiu & Hickl, 2006), information extraction (Romano et al., 2006), and textual entailment (Szpektor et al., 2004). In response, several researchers have created resources for enabling semantic inference. Among manual resources used for this task are WordNet (Fellbaum, 1998) and Cyc (Lenat, 1995). Although important and useful, these resources primarily contain prescriptive paraphrase rules such as “X acquired Y” ⇔ “X bought Y”. In practical NLP applications, however, quasi-paraphrase rules such as “X is charged by Y” ⇔ “Y announced the arrest of X” are very useful. This, along with the difficulty and

55

labor-intensiveness of generating exhaustive lists of rules, has led researchers to focus on automatic methods for building inference resources such as quasi-paraphrase rule collections (Lin & Pantel, 2001; Szpektor et al., 2004) and quasi-paraphrase collections (Barzilay & McKeown, 2001). Using these resources in applications has been hindered by the large amount of incorrect inferences they generate, either because of altogether incorrect rules or because of blind application of plausible rules without considering the context of the relations or the senses of the words. For example, consider the following sentence: Terry Nichols was charged by federal prosecutors for murder and conspiracy in the Oklahoma City bombing. and an quasi-paraphrase rule such as: X is charged by Y ⇔ Y announced the arrest of X

(1)

Using this rule, we can infer that “federal prosecutors announced the arrest of Terry Nichols”. However, given the sentence: Fraud was suspected when accounts were charged by CCM telemarketers without obtaining consumer authorization. the plausible quasi-paraphrase rule (1) would incorrectly infer that “CCM telemarketers announced the arrest of accounts”.

56

This example depicts a major obstacle to the effective use of automatically learned quasi-paraphrase rules. What is missing is knowledge about the admissible argument values for which the phrases are synonymous, i.e., a quasi-paraphrase rule holds, which we call Inferential Selectional Preferences. For example, quasi-paraphrase rule (1) should only be applied if X is a Person and Y is a Law Enforcement Agent or a Law Enforcement Agency. This knowledge does not guarantee that the quasi-paraphrase rule will hold, but, as we show here, goes a long way toward filtering out erroneous applications of rules. In this chapter, we introduce ISP, a collection of methods for learning inferential selectional preferences and filtering out incorrect inferences. The presented algorithms apply to any collection of quasi-paraphrase rules between binary semantic relations, such as example (1). ISP derives inferential selectional preferences by aggregating statistics of quasi-paraphrase rule instantiations over a large corpus of text. Within ISP, we explore different probabilistic models of selectional preference to accept or reject specific inferences. We present empirical evidence to support the following main contribution: Claim: Inferential selectional preferences can be automatically learned and used for effectively filtering out incorrect inferences.

57

4.2

Selectional Preference Models

The aim of this chapter is to learn inferential selectional preferences for filtering quasiparaphrase rules. Let pi ⇔ pj be a quasi-paraphrase rule where p is a binary semantic relation between two entities x and y. Let hx, p, yi be an instance of relation p. Formal task definition: Given a quasi-paraphrase rule pi ⇔ pj and the instance hx, pi , yi, our task is to determine if hx, pj , yi is valid. Consider the example in Section 4.1 where we have the quasi-paraphrase rule “X is charged by y” ⇔ “Y announced the arrest of X”. Our task is to automatically determine that federal prosecutors announced the arrest of Terry Nichols (i.e., hTerry Nichols, pj , federal prosecutorsi) is valid but that CCM telemarketers announced the arrest of accounts is invalid. Because the semantic relations p are binary, the selectional preferences on their two arguments may be either considered jointly or independently. For example, the relation p = “X is charged by Y” could have joint SPs: h Person, Law Enforcement Agenti h Person, Law Enforcement Agency i (2) h Bank Account, Organization i or independent SPs:

58

h Person, * i h *, Organization i

(3)

h *, Law Enforcement Agenti This distinction between joint and independent selectional preferences constitutes the difference between the two models we present in this section. The remainder of this section describes the ISP approach. In Section 4.2.1, we describe methods for automatically determining the semantic contexts of each single relations selectional preferences. Section 4.2.2, uses these for developing our inferential selectional preference models. Finally, we present inference filtering algorithms in Section 4.2.3.

4.2.1 Relational Selectional Preferences Resnik (1996) defined the selectional preferences of a predicate as the semantic classes of the words that appear as its arguments. Similarly, we define the relational selectional preferences of a binary semantic relation pi as the semantic classes C(x) of the words that can be instantiated for x and as the semantic classes C(y) of the words that can be instantiated for y. The semantic classes C(x) and C(y) can be obtained from a conceptual taxonomy as proposed in Resnik (1996), such as WordNet, or from the classes extracted from a word clustering algorithm such as CBC (Pantel & Lin, 2002). For example, given the

59

relation “X is charged by Y”, its relational selection preferences from WordNet could be {social group, organism, state} for X and {authority, state, section} for Y . Below we propose joint and independent models, based on a corpus analysis, for automatically determining relational selectional preferences.

4.2.1.1

Joint Relational Model (JRM)

Our joint model uses a corpus analysis to learn SPs for binary semantic relations by considering their arguments jointly, as in example (2). Given a large corpus of English text, we first find the occurrences of each semantic relation p. For each instance hx, p, yi, we retrieve the sets C(x) and C(y) of the semantic classes that x and y belong to and accumulate the frequencies of the triples hc(x), p, c(y)i, where c(x) ∈ C(x) and c(y) ∈ C(y). Each triple hc(x), p, c(y)i is a candidate selectional preference for p. Candidates can be incorrect when: a) they were generated from the incorrect sense of a polysemous word; or b) p does not hold for the other words in the semantic class. Intuitively, we have more confidence in a particular candidate if its semantic classes are closely associated given the relation p. Pointwise mutual information (Cover &

60

Thomas, 1991) is a commonly used metric for measuring this association strength between two events e1 and e2 :

pmi(e1 ; e2 ) = log

P (e1 , e2 ) P (e1 )P (e2 )

(4.1)

We define our ranking function as the strength of association between two semantic classes, c(x) and c(y), given the relation p:

pmi(c(x)|p; c(y)|p) = log

P (c(x), c(y)|p) P (c(x)|p)P (c(y)|p)

(4.2)

Let |c(x), p, c(y)| denote the frequency of observing the instance hc(x), p, c(y)i. We estimate the probabilities of Equation 4.2 using maximum likelihood estimates over our corpus:

P (c(x)|p) =

|c(x), p, ∗| |∗, p, c(y)| |c(x), p, c(y)| P (c(y)|p) = P (c(x), c(y)|p) = |∗, p, ∗| |∗, p, ∗| |∗, p, ∗| (4.3)

Similarly to (Resnik 1996), we estimate the above frequencies using:

|c(x), p, ∗| =

|w,p,∗| w∈c(x) |C(w)|

P

|∗, p, c(y)| =

|∗,p,w| w∈c(y) |C(w)|

P

(4.4) |c(x), p, c(y)| =

P

|w1 ,p,w2 | w1 ∈c(x),w2 ∈c(y) |C(w1 )|×|C(w2 )|

61

where |x, p, y| denotes the frequency of observing the instance hx, p, yi and |C(w)| denotes the number of classes to which word w belongs. |C(w)| distributes w’s mass equally to all of its senses c(w).

4.2.1.2

Independent Relational Model (IRM)

Because of sparse data, our joint model can miss some correct selectional preference pairs. For example, given the relation Y announced the arrest of X we may find occurrences from our corpus of the particular class “Money Handler” for X and “Lawyer” for Y , however we may never see both of these classes co-occurring even though they would form a valid relational selectional preference. To alleviate this problem, we propose a second model that is less strict by considering the arguments of the binary semantic relations independently, as in example (3). Similarly to JRM, we extract each instance hx, p, yi of each semantic relation p and retrieve the set of semantic classes C(x) and C(y) that x and y belong to, accumulating the frequencies of the triples hc(x), p, ∗i and h∗, p, c(y)i, where c(x) ∈ C(x) and c(y) ∈ C(y).

62

All tuples hc(x), p, ∗i and h∗, p, c(y)i are candidate selectional preferences for p. We rank candidates by the probability of the semantic class given the relation p, according to Equations 4.3

4.2.2 Inferential Selectional Preferences Whereas in Section 4.2.1 we learned selectional preferences for the arguments of a relation p, in this section we learn selectional preferences for the arguments of a quasiparaphrase rule pi ⇔ pj .

4.2.2.1

Joint Inferential Model (JIM)

Given a quasi-paraphrase rule pi ⇔ pj , our joint model defines the set of inferential SPs as the intersection of the relational SPs for pi and pj , as defined in the Joint Relational Model (JRM). For example, suppose relation pi = “X is charged by Y” gives the following SP scores under the JRM: hPerson, pi , Law Enforcement Agenti = 1.45 hPerson, pi , Law Enforcement Agencyi = 1.21 hBank Account, pi , Organizationi = 0.97 and that pj = “Y announced the arrest of X” gives the following SP scores under the JRM:

63

hLaw Enforcement Agent, pj , Personi = 2.01 hReporter, pj , Personi = 1.98 hLaw Enforcement Agency, pj , Personi = 1.61 The intersection of the two sets of SPs forms the candidate inferential SPs for the rule pi ⇔ pj: hLaw Enforcement Agent, Personi hLaw Enforcement Agency, Personi We rank the candidate inferential SPs according to three ways to combine their relational SP scores, using the minimum, maximum, and average of the SPs. For example, for hLaw Enforcement Agent, Personi, the respective scores would be 1.45, 2.01, and 1.73. These different ranking strategies produced nearly identical results in our experiments, as discussed in Section 4.4.

4.2.2.2

Independent Inferential Model (IIM)

Our independent model is the same as the joint model above except that it computes candidate inferential SPs using the Independent Relational Model (IRM) instead of the JRM. Consider the same example relations pi and pj from the joint model and suppose that the IRM gives the following relational SP scores for pi :

64

h Law Enforcement Agent, pi , * i = 0.63 h*, pi , Personi = 0.27 h*, pi , Organizationi = 0.21 and the following relational SP scores for pj : h*, pj , Personi = 0.57 hLaw Enforcement Agent, pj , *i = 0.32 hReporter, pj , *i = 0.26 The intersection of the two sets of SPs forms the candidate inferential SPs for the inference pi ⇔ pj : hLaw Enforcement Agent, *i h*, Personi We use the same minimum, maximum, and average ranking strategies as in JIM.

4.2.3 Filtering Inferences Given a quasi-paraphrase rule pi ⇔ pj : and the instance hx, pi , yi, the systems task is to determine whether hx, pj , yi is valid. Let C(w) be the set of semantic classes c(w) to which word w belongs. Below we present three filtering algorithms which range from the least to the most permissive:

65

• ISP.JIM, accepts the inference hx, pj , yi if the inferential SP hc(x), pj , c(y)i was admitted by the Joint Inferential Model for some c(x) ∈ C(x) and c(y) ∈ C(y). • ISP.IIM.∧, accepts the inference hx, pj , yi if the inferential SPs hc(x), pj , ∗i AND h∗, pj , c(y)i were admitted by the Independent Inferential Model for some c(x) ∈ C(x) and c(y) ∈ C(y). • ISP.IIM.∨, accepts the inference hx, pj , yi if the inferential SP hc(x), pj , ∗i OR h∗, pj , c(y)i was admitted by the Independent Inferential Model for some c(x) ∈ C(x) and c(y) ∈ C(y). Since both JIM and IIM use a ranking score in their inferential SPs, each filtering algorithm can be tuned to be more or less strict by setting an acceptance threshold on the ranking scores or by selecting only the top τ percent highest ranking SPs. In our experiments, reported in Section 4.4, we tested each model using various values of τ .

4.3

Experimental Methodology

This section describes the methodology for testing our claim that inferential selectional preferences can be learned to filter incorrect inferences. Given a collection of quasi-paraphrase rules of the form pi ⇔ pj , our task is to determine whether a particular instance hx, pj , yi holds given that hx, pi , yi holds . In the next sections, we describe our collection of quasi-paraphrase rules, the semantic

66

classes used for forming selectional preferences, and evaluation criteria for measuring the filtering quality.

4.3.1 Quasi-paraphrase rules Our models for learning inferential selectional preferences can be applied to any collection of quasi-paraphrase rules between binary semantic relations. In our work, we focus on the quasi-paraphrase rules contained in the DIRT resource (Lin & Pantel, 2001). DIRT consists of over 12 million rules which were extracted from a 1GB newspaper corpus (San Jose Mercury, Wall Street Journal and AP Newswire from the TREC-9 collection). For example, here are DIRTs top 3 quasi-paraphrase rules for “X solves Y”: “Y is solved by X”, “X resolves Y”, “X finds a solution to Y”

4.3.2 Semantic Classes The choice of semantic classes is of great importance for selectional preference. One important aspect is the granularity of the classes. Too general a class will provide no discriminatory power while too fine-grained a class will offer little generalization and apply in only extremely few cases. The absence of an attested high-quality set of semantic classes for this task makes discovering preferences difficult. Since many of the criteria for developing such a set are not even known, we decided to experiment with two very different sets of semantic

67

classes, in the hope that in addition to learning semantic preferences, we might also uncover some clues for the eventual decisions about what makes good semantic classes in general. Our first set of semantic classes was directly extracted from the output of the CBC clustering algorithm (Pantel & Lin, 2002). We applied CBC to the TREC-9 and TREC2002 (Aquaint) newswire collections consisting of over 600 million words. CBC generated 1628 noun concepts and these were used as our semantic classes for SPs. Secondly, we extracted semantic classes from WordNet 2.1 (Fellbaum, 1998). In the absence of any externally motivated distinguishing features (for example, the Basic Level categories from Prototype Theory, developed by Eleanor Rosch (1978)), we used the simple but effective method of manually truncating the noun synset hierarchy and considering all synsets below each cut point as part of the semantic class at that node. To select the cut points, we inspected several different hierarchy levels and found the synsets at a depth of 4 to form the most natural semantic classes. Since the noun hierarchy in WordNet has an average depth of 12, our truncation created a set of concepts considerably coarser-grained than WordNet itself. The cut produced 1287 semantic classes, a number similar to the classes in CBC. To properly test WordNet as a source of semantic classes for our selectional preferences, we would need to experiment with different extraction algorithms.

68

4.3.3 Evaluation Criteria The goal of the filtering task is to minimize false positives (incorrectly accepted inferences) and false negatives (incorrectly rejected inferences). A standard methodology for evaluating such tasks is to compare system filtering results with a gold standard using a confusion matrix. A confusion matrix, such as Table 4.1, captures the filtering performance on both correct and incorrect inferences: where A represents the number

SYSTEM

GOLD STANDARD 1 0 1

A

B

0

C

D

Table 4.1: Confusion matrix of correct instances correctly identified by the system, D represents the number of incorrect instances correctly identified by the system, B represents the number of false positives and C represents the number of false negatives. To compare systems, three key measures are used to summarize confusion matrices: • Sensitivity, defined as

A , A+C

captures a filters probability of accepting correct

D , B+D

captures a filters probability of rejecting incorrect

inferences; • Specificity, defined as inferences; • Accuracy, defined as

A+D , A+B+C+D

captures the probability of a filter being correct.

69

4.4

Experimental Results

In this section, we provide empirical evidence to support the main claim of this paper. Given a collection of DIRT quasi-paraphrase rules of the form pi ⇔ pj , our experiments, using the methodology of Section 4.3, evaluate the capability of our ISP models for determining if hx, pj , yi holds given that hx, pi , yi holds.

4.4.1 Experimental Setup 4.4.1.1

Model Implementation

For each filtering algorithm in Section 4.2.3, ISP.JIM, ISP.IIM.∧, and ISP.IIM.∨, we trained their probabilistic models using corpus statistics extracted from the 1999 AP newswire collection (part of the TREC-2002 Aquaint collection) consisting of approximately 31 million words. We used the Minipar parser (Lin, 1994) to match DIRT patterns in the text. This permits exact matches since DIRT quasi-paraphrase rules are built from Minipar parse trees. For each system, we experimented with the different ways of combining relational SP scores: minimum, maximum, and average (see Section 4.2.2). Also, we experimented with various values for the τ parameter described in Section 4.2.3.

70

4.4.1.2

Gold Standard Construction

In order to compute the confusion matrices described in Section 4.3.3, we must first construct a representative set of inferences and manually annotate them as correct or incorrect. We randomly selected 100 quasi-paraphrase rules of the form pi ⇔ pj from DIRT. For each pattern pi , we then extracted its instances from the Aquaint 1999 AP newswire collection (approximately 22 million words), and randomly selected 10 distinct instances, resulting in a total of 1000 instances. For each instance of pi , applying DIRTs quasi-paraphrase rule would assert the instance hx, pj , yi. Our evaluation tests how well our models can filter these so that only correct inferences are made. To form the gold standard, two human judges were asked to tag each instance hx, pj , yi as correct or incorrect. For example, given a randomly selected quasiparaphrase rule “X is charged by Y” ⇔ “Y announced the arrest of X” and the instance “Terry Nichols was charged by federal prosecutors”, the judges must determine if the instance hfederal prosecutors, Y announced the arrest of X, Terry Nicholsi is correct. The judges were asked to consider the following two criteria for their decision: • hx, pj , yi is a semantically meaningful instance; • The inference or the paraphrase relation pi ⇔ pj holds for this instance.

71

Judges found that annotation decisions can range from trivial to difficult. The differences often were in the instances for which one of the judges fails to see the right context under which the inference could hold. To minimize disagreements, the judges went through an extensive round of training. To that end, the 1000 instances hx, pj , yi were split into DEV and TEST sets, 500 in each. The two judges trained themselves by annotating DEV together. The TEST set was then annotated separately to verify the inter-annotator agreement and to verify whether the task is well-defined. The kappa statistic (Siegal & Castellan Jr., 1988) was κ = 0.72. For the 70 disagreements between the judges, a third judge acted as an adjudicator.

4.4.1.3

Baselines

We compare our ISP algorithms to the following baselines: • B0: Rejects all inferences; • B1: Accepts all inferences; • Rand: Randomly accepts or rejects inferences. One alternative to our approach is admit instances on the Web using literal search queries. We investigated this technique but discarded it due to subtle yet critical issues

72

with pattern canonicalization that resulted in rejecting nearly all inferences. However, we are investigating other ways of using Web corpora for this task.

4.4.2 Filtering Quality System B0 B1 Rand ISP.JIM CBC ISP.IIM.∧ ISP.IIM.∨ ISP.JIM WordNet ISP.IIM.∧ ISP.IIM.∨

Parameters Selected from Dev Set Ranking Strategy τ (%) — — — — — — maximum 100 maximum 100 maximum 90 minimum 40 minimum 10 minimum 20

Sensitivity (95% Conf) 0.00 ± 0.00 1.00 ± 0.00 0.50 ± 0.06 0.17 ± 0.04 0.24 ± 0.05 0.73 ± 0.05 0.20 ± 0.06 0.33 ± 0.07 0.87 ± 0.04

Specificity (95% Conf) 1.00 ± 0.00 0.00 ± 0.00 0.47 ± 0.07 0.88 ± 0.04 0.84 ± 0.04 0.45 ± 0.06 0.75 ± 0.06 0.77 ± 0.06 0.17 ± 0.05

Accuracy (95% Conf) 0.50 ± 0.04 0.49 ± 0.04 0.50 ± 0.04 0.53 ± 0.04 0.54 ± 0.04 0.59 ± 0.04 0.47 ± 0.04 0.55 ± 0.04 0.51 ± 0.05

Table 4.2: Filtering quality of best performing systems according to the evaluation criteria defined in Section 4.3.3 on the TEST set. For each ISP algorithm and parameter combination, we constructed a confusion matrix on the development set and computed the system sensitivity, specificity and accuracy as described in Section 4.3.3. This resulted in 180 experiments on the development set. For each ISP algorithm and semantic class source, we selected the best parameter combinations according to the following criteria: • Accuracy: This system has the best overall ability to correctly accept and reject inferences. • 90%-Specificity: Several formal semantics and textual entailment researchers have commented that quasi-paraphrase rule collections like DIRT are difficult

73

to use due to low precision. Many have asked for filtered versions that remove incorrect inferences even at the cost of removing correct inferences. In response, we show results for the system achieving the best sensitivity while maintaining at least 90% specificity on the DEV set. We evaluated the selected systems on the TEST set. Table 4.2 summarizes the quality of the systems selected according to the Accuracy criterion. The best performing system, ISP.IIM.∨, performed statistically significantly better than all three baselines. The best system according to the 90%-Specificity criteria was ISP.JIM, which coincidentally has the highest accuracy for that model as shown in Table 4.2. This result is very promising for researchers that require highly accurate quasi-paraphrase rules since they can use ISP.JIM and expect to recall 17% of the correct inferences by only accepting false positives 12% of the time.

4.4.2.1

Performance and Error Analysis

SYSTEM

GOLD STANDARD 1 0 1

184

139

0

63

114

Table 4.3: Confusion matrix for ISP.IIM.∨ — best accuracy Tables 4.3 and 4.4 present the full confusion matrices for the most accurate and highly specific systems, with both systems selected on the DEV set. The most accurate

74

SYSTEM

GOLD STANDARD 1 0 1

42

28

0

205

225

Table 4.4: Confusion matrix for ISP.JIM — best 90%-Specificity system was ISP.IIM.∨, which is the most permissive of the algorithms. This suggests that a larger corpus for learning SPs may be needed to support stronger performance on the more restrictive methods. The system in Table 4.4, selected for maximizing sensitivity while maintaining high specificity, was 70% correct in predicting correct inferences. Figure 4.1 illustrates the ROC curve for all our systems and parameter combinations on the TEST set. ROC curves plot the true positive rate against the false positive rate. The near-diagonal line plots the three baseline systems. Several trends can be observed from this figure. First, systems using the semantic classes from WordNet tend to perform less well than systems using CBC classes. As discussed in Section 4.3.2, we used a very simplistic extraction of semantic classes from WordNet. The results in Figure 4.1 serve as a lower bound on what could be achieved with a better extraction from WordNet. Upon inspection of instances that WordNet got incorrect but CBC got correct, it seemed that CBC had a much higher lexical coverage than WordNet. For example, several of the instances contained proper names as either

75

Figure 4.1: ROC curves for our systems on TEST

76

the X or Y argument (WordNet has poor proper name coverage). When an argument is not covered by any class, the inference is rejected. Figure 4.1 also illustrates how our three different ISP algorithms behave. The strictest filters, ISP.JIM and ISP.IIM.∧, have the poorest overall performance but, as expected, have a generally very low rate of false positives. ISP.IIM.∨, which is a much more permissive filter because it does not require both arguments of a relation to match, has generally many more false positives but has an overall better performance. We did not include in Figure 4.1 an analysis of the minimum, maximum, and average ranking strategies presented in Section 4.2.2 since they generally produced nearly identical results.

Figure 4.2: ISP.IIM.∨ (Best system’s) performance variation over different values of the τ threshold

77

For the most accurate system, ISP.IIM.∨, we explored the impact of the cutoff threshold τ on the sensitivity, specificity, and accuracy, as shown in Figure 4.2. Rather than step the values by 10% as we did on the DEV set, here we stepped the threshold value by 2% on the TEST set. The more permissive values of τ increase sensitivity at the expense of specificity. Interestingly, the overall accuracy remained fairly constant across the entire range of τ , staying within 0.05 of the maximum of 0.62 achieved at τ = 30%. Finally, we manually inspected several incorrect inferences that were missed by our filters. A common source of errors was due to the many incorrect “antonymy” quasiparaphrase rules generated by DIRT, such as “X is rejected in Y” ⇔“X is accepted in Y”. This recognized problem in DIRT occurs because of the distributional hypothesis assumption used to form the quasi-paraphrase rules. Our ISP algorithms suffer from a similar quandary since, typically, antonymous relations take the same sets of arguments for X (and Y). For these cases, ISP algorithms learn many selectional preferences that accept the same types of entities as those that made DIRT learn the quasi-paraphrase rule in the first place, hence ISP will not filter out many incorrect inferences.

4.5

Conclusion

We presented algorithms for learning what we call inferential selectional preferences, and presented evidence that learning selectional preferences can be useful in filtering

78

out incorrect inferences. This work constitutes a step towards better understanding of the interaction of selectional preferences and inferences, bridging these two aspects of semantics.

79

Chapter 5

Learning Directionality

5.1

Introduction

As discussed in the previous chapter, manually built resources like WordNet (Fellbaum, 1998) and Cyc (Lenat, 1995) have been around for years; but for coverage and domain adaptability reasons many recent approaches have focused on automatic acquisition of quasi-paraphrases (Barzilay & McKeown, 2001) and quasi-paraphrase rules (Lin & Pantel, 2001; Szpektor et al., 2004). The downside of these approaches is that they often result in incorrect quasi-paraphrase rules or in quasi-paraphrase rules that are underspecified in directionality (i.e. asymmetric but are wrongly considered symmetric). For example, consider a quasi-paraphrase rule from DIRT (Lin & Pantel, 2001): X eats Y ⇔ X likes Y

(1)

80

All rules in DIRT are considered symmetric. Though here, one is most likely to infer that “X eats Y” ⇒ “X likes Y”, because if someone eats something, he most probably likes it, but if he likes something he might not necessarily be able to eat it. So for example, given the sentence “I eat spicy food”, one is mostly likely to infer that “I like spicy food”. On the other hand, given the sentence “I like rollerblading”, one cannot infer that “I eat rollerblading”. In this chapter, we propose an algorithm called LEDIR (pronounced “leader”) for LEarning Directionality of Inference Rules. Our algorithm filters incorrect quasiparaphrase rules and identifies the directionality of the correct ones. Our algorithm works with any resource that produces quasi-paraphrase rules of the form shown in example (1). We use both the distributional hypothesis and selectional preferences as the basis for our algorithm. We provide empirical evidence to validate the following main contribution: Claim: Relational selectional preferences can be used to automatically determine the plausibility and directionality of a quasi-paraphrase rule.

5.2

Learning Directionality of Quasi-paraphrase Rules

The aim of this chapter is to filter out incorrect quasi-paraphrase rules and to identify the directionality of the correct ones.

81

Let pi ⇔ pj be a quasi-paraphrase rule where each p is a binary semantic relation between two entities x and y. Let hx, p, yi be an instance of relation p. Formal problem definition: Given the quasi-paraphrase rule pi ⇔ pj , we want to conclude which one of the following is more appropriate: 1. pi ⇔ pj 2. pi ⇒ pj 3. pi ⇐ pj 4. No plausible inference Consider the example (1) from Section 5.1. There, it is most plausible to conclude “X eats Y” ⇒ “X likes Y”. Our algorithm LEDIR uses selectional preferences along the lines of Chapter 4 to determine the plausibility and directionality of quasi-paraphrase rules.

5.2.1 Underlying Assumption Many approaches to modeling lexical semantics have relied on the distributional hypothesis (Harris, 1954), which states that words that appear in the same contexts tend to have similar meanings. The idea is that context is a good indicator of a word meaning. Lin and Pantel (2001) proposed an extension to the distributional hypothesis and

82

applied it to paths in dependency trees, where if two paths tend to occur in similar contexts it is hypothesized that the meanings of the paths tend to be similar. In this paper, we assume and propose a further extension to the distributional hypothesis and call it the Directionality Hypothesis. Directionality Hypothesis: If two binary semantic relations tend to occur in similar contexts and the first one occurs in significantly more contexts than the second, then the second most likely implies the first and not vice versa. The intuition here is that of generality. The more general a relation, more the types (and number) of contexts in which it is likely to appear. Consider the example (1) from Section 5.1. The fact is that there are many more things that someone might like than those that someone might eat. Hence, by applying the directionality hypothesis, one can infer that “X eats Y” ⇒ “X likes Y”. The key to applying the distributional hypothesis to the problem at hand is to model the contexts appropriately and to introduce a measure for calculating context similarity. Concepts in semantic space, due to their abstractive power, are much richer for reasoning about inferences than simple surface words. Hence, we model the context of a relation p of the form hx, p, yi by using the semantic classes C(x) and C(y) of words that can be instantiated for x and y respectively. To measure context similarity of two relations, we calculate the overlap coefficient (Manning & Sch¨utze, 1999) between their contexts.

83

5.2.2 Selectional Preferences The selectional preferences of a predicate is the set of semantic classes that its arguments can belong to (Wilks, 1975). Resnik (1996) gave an information theoretical formulation of the idea. In Chapter 4, we extended this idea to non-verbal relations by defining the relational selectional preferences (RSPs) of a binary relation p as the set of semantic classes C(x) and C(y) of words that can occur in positions x and y respectively. The set of semantic classes C(x) and C(y) can be obtained either from a manually created taxonomy like WordNet as proposed in the above previous approaches or by using automatically generated classes from the output of a word clustering algorithm as proposed in Chapter 4. In this chapter, we deployed both the Joint Relational Model (JRM) and Independent Relational Model (IRM) proposed in Chapter 4 to obtain the selectional preferences for a relation p.

5.2.2.1

Joint Relational Model (JRM)

The JRM uses a large corpus to learn the selectional preferences of a binary semantic relation by considering its arguments jointly. Given a relation p and large corpus of English text, we first find all occurrences of relation p in the corpus. For every instance hx, p, yi in the corpus, we obtain the sets

84

C(x) and C(y) of the semantic classes that x and y belong to. We then accumulate the frequencies of the triples hc(x), p, c(y)i by assuming that every c(x) ∈ C(x) can co-occur with every c(y) ∈ C(y) and vice versa. Every triple hc(x), p, c(y)i obtained in this manner is a candidate selectional preference for p. As in Chapter 4, we rank these candidates using Pointwise mutual information (Cover & Thomas, 1991). The ranking function is defined as the strength of association between two semantic classes, c(x) and c(y), given the relation p:

pmi(c(x)|p; c(y)|p) = log

P (c(x), c(y)|p) P (c(x)|p)P (c(y)|p)

(5.1)

Let |c(x), p, c(y)| denote the frequency of observing the instance hc(x), p, c(y)i. We estimate the probabilities of Equation 5.1 using maximum likelihood estimates over our corpus:

P (c(x)|p) =

|c(x), p, ∗| |∗, p, c(y)| |c(x), p, c(y)| P (c(y)|p) = P (c(x), c(y)|p) = |∗, p, ∗| |∗, p, ∗| |∗, p, ∗| (5.2)

We estimate the above frequencies using:

|c(x), p, ∗| =

|w,p,∗| w∈c(x) |C(w)|

P

|∗, p, c(y)| =

|∗,p,w| w∈c(y) |C(w)|

P

(5.3) |c(x), p, c(y)| =

P

|w1 ,p,w2 | w1 ∈c(x),w2 ∈c(y) |C(w1 )|×|C(w2 )|

85

where |x, p, y| denotes the frequency of observing the instance hx, p, yi and |C(w)| denotes the number of classes to which word w belongs. |C(w)| distributes ws mass equally among all of its senses C(w).

5.2.2.2

Independent Relational Model (IRM)

Due to sparse data, the JRM is likely to miss some pair(s) of valid relational selectional preferences. Hence we use the IRM, which models the arguments of a binary semantic relation independently. Similar to JRM, we find all instances of the form hx, p, yi for a relation p. We then find the sets C(x) and C(y) of the semantic classes that x and y belong to and accumulate the frequencies of the triples hc(x), p, ∗i and h∗, p, c(y)i where c(x) ∈ C(x) and c(y) ∈ C(y). All the tuples hc(x), p, ∗i and h∗, p, c(y)i are the independent candidate RSPs for a relation p and we rank them according to Equation 5.3. Once we have the independently learnt RSPs, we need to convert them into a joint representation for use by the inference plausibility and directionality model. To do this, we obtain the Cartesian product between the sets hC(x), p, ∗i and h∗, p, C(y)i for a relation p. The Cartesian product between two sets A and B is given by:

A × B = {(a, b) : ∀a ∈ A ∧ ∀b ∈ B}

(5.4)

86

Similarly we obtain:

hC(x), p, ∗i × h∗, p, C(y)i = {hc(x), p, c(y)i : ∀hc(x), p, ∗i ∈ hC(x), p, ∗i (5.5) ∧ ∀h∗, p, c(y)i ∈ h∗, p, C(y)i}

The Cartesian product in Equation 5.5 gives the joint representation of the RSPs of the relation p learned using IRM. In the joint representation, the IRM RSPs have the form hc(x), p, c(y)i which is the same form as the JRM RSPs.

5.2.3 Plausibility and Directionality Model Our model for determining the plausibility and directionality of quasi-paraphrase rules is based on the intuition that for an inference to hold between two semantic relations there must exist sufficient overlap between their contexts and the directionality of the quasi-paraphrase rule depends on the quantitative comparison between their contexts. Here we model the context of a relation by the selectional preferences of that relation. We determine the plausibility of a quasi-paraphrase rule based on the overlap coefficient (Manning & Sch¨utze, 1999) between the selectional preferences of the two relations. We determine the directionality based on the difference in the number of selectional preferences of the relations when the inference between them seems plausible.

87

Given a candidate quasi-paraphrase rule pi ⇔ pj , we first obtain the RSPs hC(x), pi , C(y)i for pi and hC(x), pj , C(y)i for pj . We then calculate the overlap coefficient between their respective RSPs. Overlap coefficient is one of the many distributional similarity measures used to calculate the similarity between two vectors A and B: sim(A, B) =

|A ∩ B| min(|A|, |B|)

(5.6)

The overlap coefficient between the selectional preferences of pi and pj is calculated as: sim(pi , pj ) =

|hC(x), pi , C(y)i ∩ hC(x), pj , C(y)i| min(|C(x), pi , C(y)|, |C(x), pj , C(y)|)

(5.7)

If sim(pi , pj ) is above a certain empirically determined threshold α(≤ 1), we conclude that the quasi-paraphrase rule is plausible, i.e.: If sim(pi , pj ) ≥ α we conclude the quasi-paraphrase rule is plausible else we conclude the quasi-paraphrase rule is not plausible For a plausible quasi-paraphrase rule, we then compute the ratio between the number of selectional preferences |C(x), pi , C(y)| for pi and |C(x), pj , C(y)| for pj and compare it against an empirically determined threshold β(≥ 1) to determine the directionality of the quasi-paraphrase rule. So the algorithm is:

88

If

|C(x),pi ,C(y)| |C(x),pj ,C(y)|

≥ β we conclude pi ⇐ pj

else if

|C(x),pi ,C(y)| |C(x),pj ,C(y)|



else

5.3

1 β

we conclude pi ⇒ pj we conclude pi ⇔ pj

Experimental Setup

In this section, we describe our experimental setup to validate our claim that LEDIR can be used to determine plausibility and directionality of a quasi-paraphrase rule. Given a quasi-paraphrase rule of the form pi ⇔ pj , we want to use automatically learned relational selectional preferences to determine whether the quasi-paraphrase rule is valid and if it is valid then determine its directionality.

5.3.1 Quasi-paraphrase Rules LEDIR can work with any set of binary semantic quasi-paraphrase rules. For the purpose of this paper, we chose the quasi-paraphrase rules from the DIRT resource (Lin & Pantel, 2001). DIRT consists of 12 million rules extracted from 1GB of newspaper text (AP Newswire, San Jose Mercury and Wall Street Journal). For example, “X eats Y” ⇔ “X likes Y” is a quasi-paraphrase rule from DIRT.

89

5.3.2 Semantic Classes Appropriate choice of semantic classes is crucial for learning relational selectional preferences. The ideal set should have semantic classes that have the right balance between abstraction and discrimination, the two important characteristics that are often conflicting. A very general class has limited discriminative power, while a very specific class has limited abstractive power. Finding the right balance here is a separate research problem of its own. Since the ideal set of universally acceptable semantic classes in unavailable, we decided to use the approach from Chapter 4 of using two sets of semantic classes. This approach gave us the advantage of being able to experiment with sets of classes that vary a lot in the way they are generated but try to maintain the granularity by obtaining approximately the same number of classes. The first set of semantic classes was obtained by running the CBC clustering algorithm (Pantel & Lin, 2002) on TREC-9 and TREC-2002 newswire collections consisting of over 600 million words. This resulted in 1628 clusters, each representing a semantic class. The second set of semantic classes was obtained by using WordNet 2.1 (Fellbaum, 1998). We obtained a cut in the WordNet noun hierarchy by manual inspection and used all the synsets below a cut point as the semantic class at that node. Our inspection showed that the synsets at depth four formed the most natural semantic classes . A cut at

90

depth four resulted in a set of 1287 semantic classes, a set that is much coarser grained than WordNet which has an average depth of 12. This seems to be a depth that gives a reasonable abstraction while maintaining good discriminative power. It would however be interesting to experiment with more sophisticated algorithms for extracting semantic classes from WordNet and see their effect on the relational selectional preferences, something we do not address this in this paper.

5.3.3 Implementation We implemented LEDIR with both the JRM and IRM models using quasi-paraphrase rules from DIRT and semantic classes from both CBC and WordNet. We parsed the 1999 AP newswire collection consisting of 31 million words with Minipar (Lin, 1994) and used this to obtain the probability statistics for the models (as described in Section 5.2.2). We performed both system-wide evaluations and intrinsic evaluations with different values of α and β parameters. Section 5.4 presents these results and our error analysis.

5.3.4 Gold Standard Construction In order to evaluate the performance of the different systems, we compare their outputs against a manually annotated gold standard. To create this gold standard, we randomly

91

sampled 160 quasi-paraphrase rules of the form pi ⇔ pj from DIRT. We discarded three rules since they contained nominalizations. For every quasi-paraphrase rule of the form pi ⇔ pj , the annotation guideline asked annotators (in this work we used two annotators) to choose the most appropriate of the four options: 1. pi ⇔ pj 2. pi ⇒ pj 3. pi ⇐ pj 4. No plausible inference To help the annotators with their decisions, the annotators were provided with 10 randomly chosen instances for each quasi-paraphrase rule. These instances, extracted from DIRT, provided the annotators with context where the inference could hold. So for example, for the quasi-paraphrase rule X eats Y ⇔ X likes Y, an example instance would be “I eat spicy food” ⇔ “I like spicy food”. The annotation guideline however gave the annotators the freedom to think of examples other than the ones provided to make their decisions.

92

The annotators found that while some decisions were quite easy to make, the more complex ones often involved the choice between bi-directionality and one of the directions. To minimize disagreements and to get a better understanding of the task, the annotators trained themselves by annotating several samples together. We divided the set of 157 quasi-paraphrase rules, into a development set of 57 quasi-paraphrase rules and a blind test set of 100 quasi-paraphrase rules. Our two annotators annotated the development test set together to train themselves. The blind test set was then annotated individually to test whether the task is well defined. We used the kappa statistic (Siegal & Castellan Jr., 1988) to calculate the inter-annotator agreement, resulting in κ = 0.63. The annotators then looked at the disagreements together to build the final gold standard. All this resulted in a final gold standard of 100 annotated DIRT rules.

5.3.5 Baselines To get an objective assessment of the quality of the results obtained by using our models, we compared the output of our systems against three baselines: • B-random: Randomly assigns one of the four possible tags to each candidate quasi-paraphrase rule. • B-frequent: Assigns the most frequently occurring tag in the gold standard to each candidate quasi-paraphrase rule.

93

• B-DIRT: Assumes each quasi-paraphrase rule is bi-directional and assigns the bi-directional tag to each candidate quasi-paraphrase rule.

5.4

Experimental Results

In this section, we provide empirical evidence to validate our claim that the plausibility and directionality of a quasi-paraphrase rule can be determined using LEDIR.

5.4.1 Evaluation Criterion We want to measure the effectiveness of LEDIR for the task of determining the validity and directionality of a set of quasi-paraphrase rules. We follow the standard approach of reporting system accuracy by comparing system outputs on a test set with a manually created gold standard. Using the gold standard described in Section 5.3.4, we measure the accuracy of our systems using the following formula:

Accuracy =

|correctly tagged quasi-paraphrase rules| × 100 |input quasi-paraphrase rules|

(5.8)

5.4.2 Result Summary We ran all our algorithms with different parameter combinations on the development set (the 57 DIRT rules described in Section 5.3.4). This resulted in a total of 420

94

experiments on the development set. Based on these experiments, we used the accuracy statistic to obtain the best parameter combination for each of our four systems. We then used these parameter values to obtain the corresponding percentage accuracies on the test set for each of the four systems. Table 5.1 summarizes the results obtained on the Model B-random B-frequent B-DIRT CBC JRM WN CBC IRM WN

α — — — 0.15 0.55 0.15 0.45

β — — — 2 2 3 2

Accuracy (%) 25 34 25 38 38 48 43

Table 5.1: Summary of results on the test set

test set for the three baselines and for each of the four systems using the best parameter combinations obtained as described above. The overall best performing system uses the IRM algorithm with RSPs form CBC. Its performance is found to be significantly better than all the three baselines using the Students paired t-test (Manning & Sch¨utze, 1999) at p < 0.05. However, this system is not statistically significant when compared with the other LEDIR implementations (JRM and IRM with WordNet).

5.4.3 Performance and Error Analysis The best performing system selected using the development set is the IRM system using CBC with the parameters α = 0.15 and β = 3. In general, the results obtained on the

95

SYSTEM

⇔ ⇒ ⇐ NO

⇔ 16 0 7 2

GOLD STANDARD ⇒ ⇐ NO 1 3 7 3 1 3 4 22 15 3 4 9

Figure 5.1: Confusion Matrix for the best performing system, IRM using CBC with α = 0.15 and β = 3 test set show that the IRM tends to perform better than the JRM. This observation points at the sparseness of data available for learning RSPs for the more restrictive JRM, the reason why we introduced the IRM in the first place. A much larger corpus would be needed to obtain good enough coverage for the JRM. Figure 5.1 shows the confusion matrix for the overall best performing system as selected using the development set (results are taken from the test set). The confusion matrix indicates that the system does a very good job of identifying the directionality of the correct quasi-paraphrase rules, but gets a big performance hit from its inability to identify the incorrect quasi-paraphrase rules accurately. We will analyze this observation in more detail below. Figure 5.2 plots the variation in accuracy of IRM with different RSPs and different values of α and β. The figure shows a very interesting trend. It is clear that for all values of β, systems for IRM using CBC tend to reach their peak in the range 0.15 ≤ α ≤ 0.25, whereas the systems for IRM using WordNet (WN), tend to reach their peak in the range 0.4 ≤ α ≤ 0.6. This variation indicates the kind of impact the selection of semantic

96

Figure 5.2: Accuracy variation for IRM with different values of α and β classes could have on the overall performance of the system. This is not hard evidence, but it does suggest that finding the right set of semantic classes could be one big step towards improving system accuracy. Two other factors that have a big impact on the performance of our systems are the values of the system parameters α and β, which decide the plausibility and directionality of a quasi-paraphrase rule, respectively. To better study their effect on the system performances, we studied the two parameters independently. Figure 5.3 shows the variation in the accuracy for the task of predicting the correct and incorrect quasi-paraphrase rules for the different systems when varying the value of α. To obtain this graph, we classified the quasi-paraphrase rules in the test set only as correct and incorrect without further classification based on directionality.

97

Figure 5.3: Accuracy variation in predicting correct versus incorrect quasi-paraphrase rules for different values of α All of our four systems obtained accuracy scores in the range of 68 − 70% showing a good performance on the task of determining plausibility. This however is only a small improvement over the baseline score of 66% obtained by assuming every inference to be plausible (as will be shown below, our system has most impact not on determining plausibility but on determining directionality). Manual inspection of some system errors showed that the most common errors were due to the well-known “problem of antonymy” when applying the distributional hypothesis. In DIRT, one can learn rules like “X loves Y” ⇔ “X hates Y”. Since the plausibility of quasi-paraphrase rules is determined by applying the distributional hypothesis and the antonym paths tend to take the same set of classes for X and Y, our models find it difficult to filter out the incorrect

98

quasi-paraphrase rules which DIRT ends up learning for this very same reason. To improve our system, one avenue of research is to focus specifically on filtering incorrect quasi-paraphrase rules involving antonyms (perhaps using methods similar to Lin et al. (2003)).

Figure 5.4: Accuracy variation in predicting directionality of correct quasi-paraphrase rules for different values of β

Figure 5.4 shows the variation in the accuracy for the task of predicting the directionality of the correct quasi-paraphrase rules for the different systems when varying the value of β. To obtain this graph, we separated the correct quasi-paraphrase rules form the incorrect ones and ran all the systems on only the correct ones, predicting only the directionality of each rule for different values of β. Too low a value of β means that the algorithms tend to predict most things as unidirectional and too high a value means

99

that the algorithms tend to predict everything as bidirectional. It is clear from the figure that the performance of all the systems reach their peak performance in the range 2 ≤ β ≤ 4, which agrees with our intuition of obtaining the best system accuracy in a medium range. It is also seen that the best accuracy for each of the models goes up as compared to the corresponding values obtained in the general framework. The best performing system, IRM using CBC RSPs, reaches a peak accuracy of 63.64%, a much higher score than its accuracy score of 48% under the general framework and also a significant improvement over the baseline score of 48.48% for this task. Paired t-test shows that the difference is statistically significant at p < 0.05. The baseline score for this task is obtained by assigning the most frequently occurring direction to all the correct quasi-paraphrase rules. This paints a very encouraging picture about the ability of the algorithm to identify the directionality much more accurately if it can be provided with a cleaner set of quasi-paraphrase rules.

5.5

Conclusion

Semantic inferences are fundamental to understanding natural language and are an integral part of many natural language applications such as question answering, summarization and textual entailment. Given the availability of large amounts of text and with the increase in computation power, learning them automatically from large text

100

corpora has become increasingly feasible and popular. We introduced the Directionality Hypothesis, which states that the contexts of relations can be used determine the directionality of the quasi-paraphrase rules containing them. Our experiments show empirical evidence that the Directionality Hypothesis with RSPs can indeed be used to filter incorrect quasi-paraphrase rules and find the directionality of correct ones. We believe that this result is one step in the direction of solving the basic problem of semantic inference. Ultimately, our goal is to improve the performance of NLP applications with better inferencing capabilities. Several recent data points, such as (Harabagiu & Hickl, 2006), and others discussed in Chapter 4, give promise that refined quasi-paraphrase rules for directionality may indeed improve question answering, textual entailment and multi-document summarization accuracies. It is our hope that methods such as the one proposed in this paper may one day be used to harness the richness of automatically created quasi-paraphrase rule resources within large-scale NLP applications.

101

Chapter 6

Learning Semantic Classes

6.1

Introduction

With the recent shift in Natural Language Processing (NLP) towards data driven techniques, the demand for annotated data to help different areas in NLP has increased greatly. One such area, lexical semantics, depends heavily on manually compiled semantic resources like WordNet (Fellbaum, 1998) and VerbNet (Kipper, 2005). People turn to these resources when they need information like the semantic classes of words, relations between words etc. Creating a semantic resource however is both time consuming and expensive. Thus, inadequate recall is often a problem with the existing semantic resources, especially when dealing with specialized domains or when they are used for specific tasks. For example, when we used the classes from WordNet to learn the Relational Selectional Preferences (RSPs) for relations in Chapters 4 and

102

5, we found that there were many instantiations of our relations for which we could not find classes from WordNet. Such recall problems often result in a drop in system performance. To get around this problem, the fall-back often used by people is some automatic method like unsupervised word clustering. The unsupervised clustering algorithms group words based on some similarity measure and attempt to automatically induce semantic classes. For example, given the words impress, mesmerize, cover and mask the algorithms attempt to discover the grouping as in Figure 6.1. Lin and Pantel (2002) used an unsupervised clustering algorithm, Clustering By Committee (CBC) to cluster nouns; Schulte im Walde and Brew (2002) used the KMeans (McQueen, 1967) algorithm to cluster German verbs. While these methods produce good clusters (corresponding to semantic classes), the approaches often produce classes that are different from the kind of semantic classes that a user wants. For example, in Chapters 4 and 5, our results indicated that the nature of semantic classes produced by CBC was quite different from those obtained from WordNet. There may be times however, when a user wants the learned semantic classes to resemble some pre-existing classes. For example, in verb clustering, we might want the verb clusters to be along the lines of the classes in VerbNet. Unsupervised algorithms do not have ways for incorporating such preferences or human knowledge.

103

impress

cover

mesmerize

mask

Figure 6.1: Example classes A potential middle ground then lies in semi-supervised clustering. On one hand, the semi-supervised clustering algorithms give us the ability to automatically cluster new elements thus overcoming the recall problem and on the other, they allow us to incorporate external knowledge into their learning process so that we can tailor the clusters. Given these advantages, off-late there has been surge in machine learning community in developing semi-supervised clustering algorithms. Basu et al. (2002) presented a semi-supervised clustering algorithm that accepts supervision in the form of labeled points; Basu et al. (2004), Wagstaff et al. (2001), Xing et al. (2003) presented algorithms that accept supervision in the form of constraints. In this chapter, we use semi-supervised clustering as the framework for learning semantic classes. To do this, we incorporated external knowledge by formulating it as constraints and then employ the recently developed HMRF-KMeans algorithm (Basu et al., 2004), which provides a way to add these constraints to the KMeans algorithm. We show that this framework allows us to incorporate the knowledge available in the

104

semantic resources into the clustering process very easily, thus reproducing the human generated classes more faithfully. We provide empirical evidence to validate the following main claim: Claim: Semi-supervised clustering can be used to learn good semantic classes by using readily available semantic knowledge.

6.2

Incorporating Constraints

As mentioned before, in this work we use the HMRF-KMeans algorithm proposed by Basu et al. (2004) to incorporate minimal supervision in word clustering in the form of constraints. HMRF-KMeans allows the incorporation of two types of constraints: must-link and cannot-link. The must-link constraint specifies that a pair of elements must be in the same class, and cannot-link constraint specifies that the pair may not be in the same class. For example, from figure 6.1, we can come up with the must-link and cannot-link constraints in figure 6.2. Must-link: impress ⇔ mesmerize cover ⇔ mask Cannot-link: impress × cover impress × mask mesmerize × cover mesmerize × mask Figure 6.2: Example constraints

105

To incorporate these constraints into clustering, Basu et al. (2004) used the Hidden Markov Random Fields (HMRFs). A HMRF model consists of the following components (Figure 6.3): 1. Y = {y1 , y2 , ..., yN } is a set of hidden random variables, which in the clustering framework corresponds to the set of cluster labels for the N data points to be clustered. Each cluster label yi ∈ Y takes values from {1...K}, where K is the total number of clusters. 2. X = {x1 , x2 , ..., xN } is a set of observed random variables, which in the clustering framework corresponds to the set of the N data points to be clustered. Each random variable xi ∈ X is assumed to be generated from a conditional probability distribution P (xi |yi ) determined by the corresponding hidden variable yi ∈ Y . The random variables X are conditionally independent given the hidden variables Y , i.e., P (X|Y ) =

Y

P (xi |yi )

(6.1)

xi ∈X

Using this model, the task of finding the best possible clustering boils down to finding the maximum a posteriori (MAP) configuration of HMRF, i.e., to maximize the posterior probability P (Y |X).

106

Observed Data (X: Words) X1 X3

X2

y1 must-link

y2

cannot-link

y3

Hidden Field (Y: Cluster Labels) Figure 6.3: HMRF model The overall posterior probability of a label sequence Y can be calculated using Bayes’ rule as: P (Y )P (X|Y ) P (X)

(6.2)

P (Y |X) ≃ P (Y )P (X|Y )

(6.3)

P (Y |X) = Assuming P (X) to be constant, we get:

Thus from Equation 6.3, maximizing P (Y |X) is equivalent to maximizing the product of P (Y ) and P (X|Y ). Basu et al. (2004) show that in the current semi-supervised framework, where the must-link and cannot-link constraints are known:

107

• The prior P (Y ) in Equation 6.3 can be calculated from the given must-link and cannot-link constraints, and the distance between the points to be clustered. • The likelihood P (X|Y ) in Equation 6.3 can be calculated by using the cluster centroids, and the distance between the points to be clustered. In the following section, we present the method for measuring the distance between points and the algorithm for clustering.

6.3

Word Similarity and Algorithm

To use the model in Section 6.2 for clustering, we need a way to measure similarity (distance) between words and an algorithm to find the best posterior configuration of cluster labels. This section briefly describes the similarity (distance) measure we use and the algorithm for finding the MAP configuration of cluster labels.

6.3.1 Word Similarity Following Lin (1998), we represent each word x by a feature vector. Each feature in the feature vector corresponds to a context in which the word occurs. For example, consider the word “impress” in the phrase “impressed by her beauty”. Here, “by her beauty” would be a feature of “impress”. Each feature f has with it an associated score, which measures the strength of the association of the feature f with the word x. We

108

use a commonly used measure pointwise mutual information (PMI) (Cover & Thomas, 1991) to measure this strength of association:

pmi(x; f ) = log

P (x, f ) P (x)P (f )

(6.4)

The probabilities in Equation 6.4 are calculated by using the maximum likelihood estimate over our corpus. Let |x, f | be the number of times feature f occurs with x; |x, ∗| =

P

f

|x, f |, be the total frequency of all the features of x; |∗, f | =

P

x

|x, f |

be the total number of times f occurs with any word in the entire corpus; and |∗, ∗| =

P P x

f

|x, f | be the total number of all features of all the words that occur

in the corpus. P (x, f ) =

|x,f | |∗,∗|

P (x) =

|x,∗| |∗,∗|

P (f ) =

|∗,f | |∗,∗|

(6.5)

To compute the similarity between two words xi and xj in our corpus, we use cosine similarity, which is a commonly used distortion measure in Natural Language Processing. Following Basu et al. (2004), we used the parameterized form of cosine similarity:

DcosA (xi , xj ) = 1 −

Here A is a diagonal matrix and kxkA =

xTi · A · xj kxi kA kxj kA

(6.6)

p xTi · A · xj is the weighted L2 norm.

109

6.3.2 Algorithm 1. Initialize K cluster centroids U = {u1 , u2 , ..., uK } 2. Repeat until convergence: a. E-Step: Given centroids U = {u1 , u2 , ..., uK }, re-assign cluster labels Y = {y1 , y2 , ..., yN } to points X = {x1 , x2 , ..., xN } to minimize the objective function P E. b. M-Step (A): Given cluster labels Y = {y1 , y2 , ..., yN }, re-calculate cluster centroids U = {u1 , u2 , ..., uK } to optimize P (Y |X). c. M-Step (B): Re-estimate distance measure D to optimize P (Y |X). Figure 6.4: EM algorithm

To optimize the posterior probability P (Y |X), the EM algorithm is employed. This algorithm called HMRF-KMeans is very similar to the KMeans algorithm. In the Estep, HMRF-KMeans algorithm assigns cluster labels to the data points so as to optimize P (Y |X), and in the M-step it re-estimates the cluster centroids and the distance measure again to optimize P (Y |X). The outline of the algorithm is presented in Figure 6.4. HMRF-KMeans also has some nice heuristics to come up with a good estimation of initial centroids based on the must-link and cannot-link constraints.

110

6.4

Experiments

In this section, we describe our evaluation criterion, the experimental methodology and the data that we used.

6.4.1 Evaluation Criterion To measure the quality of clusters, we compare the clustering output against a gold standard answer class key. We use normalized mutual information (NMI), which is commonly used extrinsic evaluation measure (Dom, 2001; Basu et al., 2004) to judge the clustering quality. It is a good measure for clustering with fixed number of clusters and measures how effectively the clustering regenerates the original classes (Dom, 2001). Let C be the random variable denoting the cluster assignments and A be the random variable denoting the true class labels. Let H(C) be the Shannon entropy (Shannon, 1948) of variable C, H(A) be the Shannon entropy of variable A, and H(C|A) is the conditional entropy of C given A. Then the mutual information I(C; A) between the random variables C and A is given by:

I(C; A) = H(C) − H(C|A)

(6.7)

The NMI is defined as: NMI =

2 · I(C; A) H(C) + H(A)

(6.8)

111

6.4.2 Data and Methodology To verify the quality of semantic classes learned, we used the verb classes from the publicly available resource VerbNet as our training and test set. VerbNet 2.1 has a total of 237 verb classes and a little over 3600 unique verbs. Following Lin and Pantel (2002), we represent each verb by its syntactic dependency based features. We parsed the 3GB Aquaint corpus (over 350 million tokens) using the dependency parser Minipar (Lin, 1994) and extracted the dependency features for each verb occurring in the corpus. We then obtained the frequency counts for the features and calculated the PMI score for each feature as described in Section 6.3. For our experiments, we then retained only those verbs from the corpus that are present in VerbNet, whose total mutual information equals or exceeds a threshold α, and whose VerbNet class size in our set equals or exceeds a threshold β. We constructed two such data sets: 1. S2500 consisting of 406 verbs belonging to 11 VerbNet classes (α = 2500, β = 20) 2. S250 consisting of 1691 verbs belonging to 39 VerbNet classes (α = 250, β = 20) Table 6.1 shows the classes in the S2500 set.

112

Class admire amalgamate amuse appear build characterize fill force judgement other cos run

Frequency 30 20 90 24 23 35 27 22 35 77 23

Elements admire, appreciate, bear, cherish, favor, miss, prefer, ... couple, incorporate, integrate, match, consolidate, pair, team, ... affect, afflict, aggravate, alarm, alienate, amaze, amuse, ... appear, arise, awaken, break, burst, come, derive, ... arrange, assemble, bake, blow, cast, compile, cook, ... depict, detail, envision, interpret, picture, paint, specify, ... adorn, bind, bombard, choke, clog, coat, contaminate, ... coax, compel, dare, draw, force, incite, induce, ... acclaim, applaud, assail, assault, attack, blame, bless, ... accelerate, activate, age, air, alter, animate, balance, ... bowl, hike, hobble, hop, inch, leap, march, ...

Table 6.1: Classes from S2500 set For the purpose of this work, we have to assume that each verb belongs to only one semantic class. This is because both our algorithm and the baseline are hard clustering algorithms which assign each element to only one cluster. However, the verbs in VerbNet can belong to more than one class. So whenever we came across such a verb in either of our data sets (S2500 or S250 ), we randomly assigned it to one of its classes. Note that this simplifying assumption has the effect of under-reporting our system performance, since sometimes a verb might get marked as wrong even though it is assigned to a correct class. But given that the baseline has the same disadvantage, we consider this a fair comparison. We however keep in mind that the results we report here serve as a lower bound on the expected system performance. Having built these sets, we randomly divided each of them into two equal sized subsets1 for training and testing, and performed two-fold cross-validation. The training 1

For the S250 set, one of the subsets had one element more than the other

113

subset was used to generate constraints by randomly picking pairs of elements and generating a must-link or cannot-link constraint between them based on whether they belonged to the same or different classes. Unit weights w = 1 and w¯ = 1 were assigned to all the must-link and cannot-link constraints. Each data set (containing both training and test subsets) was then clustered using HMRF-KMeans2 : S2500 with K = 11 and S250 with K = 39. The clustering performance however was measured only on the corresponding test subsets. The results were averaged over 10 runs of two-fold cross validation.

6.5

Results and Discussion

This section presents the experimental results and the discussion of their possible implications.

6.5.1 Results Figures 6.5 and 6.6 show the effect of varying the number of constraints on NMI in the HMRF-KMeans algorithm. The case with zero constraints corresponds to standard KMeans. 2

We used the implementation http://www.cs.utexas.edu/users/ml/risc/code

from

WekaUT

toolkit

114

0.35 0.3 0.25

NMI

0.2 0.15 0.1 0.05 0 200

400

600

800

1000

# of Constraints

Figure 6.5: Learning curve for S2500 set 0.35 0.3 0.25

NMI

0.2 0.15 0.1 0.05 0 500

1000 1500 # of Constraints

2000

2500

Figure 6.6: Learning curve for S250 set From the figures, it is clear that adding constraints to HMRF-KMeans results in an improvement in the NMI. For S2500 , even with just 200 constraints (which is much less than even 0.5% of the possible constraints), the algorithm performs better than the baseline. Similar result is found for the S250 set with only 1000 constraints. This is a good sign, indicating that even a small number of constraints can help guide the algorithm in the right direction. As we add more constraints, we expect the algorithm

115

to perform better. This general trend becomes clear as the number of constraints increases. The question then is how long will this trend continue? We found that for the S2500 set the average performance reached a plateau at 0.40 NMI with around 10, 000 constraints. After that, even when we increased the number of constraints we found only minimal improvement in NMI scores. On manual inspection, the clusters with 10, 000 constraints looked fairly consistent with the majority elements of each class always in the same cluster. For the S250 set, we found a similar plateau at around 50, 000 constraints with an NMI score of around 0.43. We found very slow improvement after that. Again in this case the clusters looked quite consistent on manual inspection. Table 6.2 shows randomly selected elements from random clusters from one of the runs on the sets S2500 and S250 . The VerbNet classes are those with which the corresponding clusters had largest intersection. Set S2500

# Constraints 10, 000

S2500

10, 000

S250

50, 000

S250

50, 000

Cluster Elements pound, stir, touch, adorn, lash, bind, cook, taint, bombard, soak, ... envisage, diagnose, depict, report, laud, describe, suffer, select, remember, judge, ... esteem, treasure, hate, dislike, venerate, chaperone, doubt, propose, trust, like, ... deflate, galvanize, mute, stagger, bereave, frighten, slim exhaust, peeve, trick, ...

VerbNet Class fill characterize admire amuse

Table 6.2: Example clusters with the corresponding largest intersecting VerbNet classes

116

6.5.2 Discussion In Section 6.5.1, we have shown that by adding constraints to the clustering algorithm its performance on both data sets S2500 and S250 improves. This is reflected by the improvement in the NMI scores. But what does this mean for the verb semantic classes? What do they look like? To answer these questions, we inspected the clusters and compared them to the semantic classes in VerbNet. For our analysis, we considered the HMRF-KMeans runs with 10, 000 constraints for the S2500 set and with 50, 000 constraints for the S250 set. The baseline is standard unsupervised KMeans. For the clustering and hence the learned semantic classes to be good, two properties are desirable: 1. Most of the elements in one semantic class should be present in only one cluster 2. Most of the elements in one cluster should belong to only one semantic class These correspond to the traditional recall and precision criterion for the clusters. To get a sense for recall, we first consider each VerbNet (gold standard) class for the S2500 set and find the corresponding cluster with the highest number of intersecting elements. We find that on an average around 53% of the elements for every class are present in its corresponding cluster. For the S250 set, this score is 38%. While these numbers aren’t spectacular they are a huge improvement over the baseline scores of 25% for the S2500 set and 17% for the S250 set.

117

We then considered the algorithms clusters for precision. For the S2500 set, we find that on an average around 54% elements in every cluster belong to only one VerbNet semantic class. For the S250 set, the corresponding score is 37%. This is again an improvement over the corresponding baseline scores of 28% for the S2500 set and 21% for the S250 set.

6.6

Conclusion

We have shown that the HMRF-KMeans algorithm shows good performance for the task of learning semantic classes. When using unsupervised clustering algorithm for this task, we have little or no control on the kind of classes we learn. But the presented algorithm provides a way of controlling this with some supervision. This kind of control is often desirable and goes a long way in tailoring the learned semantic classes to meet our requirements.

118

Chapter 7

Paraphrases for Learning Surface Patterns

7.1

Introduction

In Chapters 4 and 5, we presented methods for learning the contexts in which the quasiparaphrase rules are mutually replaceable, and for learning the directionality of these rules (using the phrase contexts) to filter “strict paraphrases”. In that work, we presented our results on syntactic quasi-paraphrase rules, i.e., rules that contain phrases in the form of paths in a syntax tree. These can be obtained by parsing a corpus using a syntactic parser, and using distributional similarity to find similar paths in the parse trees. An example paraphrase pair, as in example (1), is learned and represented (in the syntactic form) in DIRT (Lin & Pantel, 2001) as in example (2). “X acquired Y” ⇔ “X completed the acquisition of Y”

(1)

“N:subj:VhacquirediV:obj:N” ⇔ “N:subj:VhcompleteiV:obj:NiacquisitioniN:of :N”

(2)

119

But what if we did not want to parse the text, perhaps because its too large or noisy? In such a case, can we learn just surface-level paraphrases, i.e., paraphrases that contain phrases only in the form of surface n-grams as in example (1)? We here present a method to acquire surface paraphrases from a single monolingual corpus. We use a large corpus (about 150GB, i.e., 25 billion words) to overcome the data sparseness problem. To overcome the scalability problem, we pre-process the text with a simple parts-of-speech (POS) tagger and then apply locality sensitive hashing (LSH) to speed up the remaining computation for paraphrase acquisition. Our experiments show results to verify the following main claim: Claim 1: Highly precise surface paraphrases can be obtained from a very large monolingual corpus. With this result, we further show that these paraphrases can be used to obtain high precision surface patterns that enable the discovery of relations in a minimally supervised way. Surface patterns are templates for extracting information from text. For example, if one wanted to extract a list of company acquisitions, “hACQUIRERi acquired hACQUIREEi” would be one surface pattern with “hACQUIRERi” and “hACQUIREEi” as the slots to be extracted. Thus we can claim: Claim 2: These paraphrases can then be used for generating high precision surface patterns for relation extraction.

120

7.2

Acquiring Paraphrases

This section describes our model for acquiring paraphrases from text.

7.2.1 Distributional Similarity Harris’s distributional hypothesis (Harris, 1954) has played a central role in many approaches to lexical semantics such as unsupervised word clustering. It states that words that appear in similar contexts tend to have similar meanings. In this chapter we apply the distributional hypothesis to phrases, i.e., word n-grams. For example, consider the phrase “acquired” of the form “X acquired Y ”. Considering the context of this phrase, we might find {Google, eBay, Yahoo,...} in position X and {YouTube, Skype, Overture,...} in position Y . Now consider another phrase “completed the acquisition of ”, again of the form “X completed the acquisition of Y ”. For this phrase, we might find {Google, eBay, Hilton Hotel corp.,...} in position X and {YouTube, Skype, Bally Entertainment Corp.,...} in position Y . Since the contexts of the two phrases are similar, our extension of the distributional hypothesis would assume that “acquired” and “completed the acquisition of ” have similar meanings.

121

7.2.2 Paraphrase Generation Model Let pi be a phrase in text of the form X pi Y , where X and Y are the placeholders for entities occurring on either side of pi . Our first task is to find the set of phrases that are similar in meaning to pi . Let P = {p1 , p2 , p3 , ..., pl } be the set of all phrases of the form X pi Y where pi ∈ P . Let Si,X be the set of entities that occur in position X of pi and Si,Y be the set of entities that occur in position Y of pi . Let Vi be the vector representing pi such that Vi = Si,X ∪ Si,Y . Each entity f ∈ Vi has an associated score that measures the strength of the association of the entity f with phrase pi ; as do many others, we employ pointwise mutual information (Cover & Thomas, 1991) to measure this strength of association.

pmi(pi ; f ) = log

P (pi , f ) P (pi )P (f )

(7.1)

The probabilities in Equations 7.1 are calculated by using the maximum likelihood estimate over our corpus. Once we have the vectors for each phrase pi ∈ P , we can find the paraphrases for each pi by finding its nearest neighbors. We use cosine similarity, which is a commonly used measure for finding similarity between two vectors.

122

If we have two phrases pi ∈ P and pj ∈ P with the corresponding vectors Vi and Vj constructed as described above, the similarity between the two phrases is calculated as: sim(pi ; pj ) =

Vi  Vj |Vi | ∗ |Vj |

(7.2)

Each entity in Vi (and Vj ) has with it an associated flag which indicates whether the entity came from Si,X or Si,Y . This ensures that the X and Y entities of pi are considered separate and do not get merged in Vi . Also, for each phrase pi of the form X pi Y , we have a corresponding phrase −pi that has the form Y pi X. For example, consider the sentences: Google acquired YouTube.

(3)

YouTube was bought by Google.

(4)

From sentence (3), we obtain two phrases: 1. pi = acquired which has the form “X acquired Y ” where “X = Google” and “Y = YouTube” 2. −pi = −acquired which has the form “Y acquired X” where “X = YouTube” and “Y = Google” Similarly, from sentence (4) we obtain two phrases: 1. pj = was bought by which has the form “X was bought by Y ” where “X = YouTube” and “Y = Google”

123

2. −pj = −was bought by which has the form “Y was bought by X” where “X = Google” and “Y = YouTube” The switching of X and Y positions in (3) and (4) ensures that “acquired” and “−was bought by” are found to be paraphrases by the algorithm.

7.2.3 Locality Sensitive Hashing As described in Section 7.2.2, we find paraphrases of a phrase pi by finding its nearest neighbors based on cosine similarity between the feature vector of pi and other phrases. To do this for all the phrases in the corpus, we’ll have to compute the similarity between all vector pairs. If n is the number of vectors and d is the dimensionality of the vector space, finding cosine similarity between each pair of vectors has time complexity O(n2 d). This computation is infeasible for our corpus, since both n and d are large. To solve this problem, we make use of Locality Sensitive Hashing (LSH). The basic idea behind LSH is that a LSH function creates a fingerprint for each vector such that if two vectors are similar, they are likely to have similar fingerprints. The LSH function we use here was proposed by Charikar (2002). It has the property of preserving the cosine similarity between vectors, which is exactly what we want. The LSH represents

124

a d dimensional vector U by a stream of b bits. The bit steam is obtained by using b hash functions. Each hash function is defined as follows:

hR (U ) =

    1

   0

if R.U > 0 (7.3) if R.U < 0

where R is a d-dimensional random vector obtained from a d-dimensional Gaussian distribution. Then for two vectors U and V , Charikar (2002) showed that:

cos(θ(U, V )) = cos((1 − P r[hR (U ) = hR (V )]) ∗ π)

(7.4)

Ravichandran et al. (2005) have shown that by using the LSH nearest neighbors calculation can be done very rapidly1 .

7.3

Learning Surface Patterns

In this section we describe the model for learning surface patterns for a given target relation. 1

The details of the algorithm are omitted, but interested readers are encouraged to read Charikar (2002) and Ravichandran et al. (2005).

125

7.3.1 Surface Patterns Model Let r be a target relation. Our task is to find a set of surface patterns S = {s1 , s2 , ..., sn } that express the target relation. For example, consider the relation r = “acquisition”. We want to find the set of patterns S that express this relation: S = {hACQUIRERi acquired hACQUIREEi, hACQUIRERi bought hACQUIREEi, hACQUIREEi was bought by hACQUIRERi,...}. Let SEED = {seed1 , seed2 ,..., seedn } be the set of seed patterns that express the target relation. For each seedi ∈ SEED, we obtain the corresponding set of new patterns P ATi in two steps: 1. We find the surface phrase, pi , using a seed and find the corresponding set of paraphrases, Pi = {pi,1 , pi,2 , ..., pi,m }. Each paraphrase, pi,j ∈ Pi , has with it an associated score which is similarity between pi and pi,j . 2. In seed pattern, seedi , we replace the surface phrase, pi , with its paraphrases and obtain the set of new patterns P ATi = {pati,1 , pati,2 , ..., pati,m }. Each pattern has with it an associated score, which is the same as the score of the paraphrase from which it was obtained. The patterns are ranked in the decreasing order of their scores.

126

After we obtain P ATi for each seedi ∈ SEED, we obtain the complete set of patterns, P AT , for the target relation r as the union of all the individual pattern sets, i.e., P AT = P AT1 ∪ P AT2 ∪ ... ∪ P ATn .

7.4

Experimental Methodology

In this section, we describe experiments to validate the main claims of the chapter. We first describe paraphrase acquisition, we then summarize our method for learning surface patterns, and finally describe the use of patterns for extracting relation instances.

7.4.1 Paraphrases Finding surface variations in text requires a large corpus. The corpus needs to be at least one order of magnitude larger than that required for learning syntactic variations, since surface phrases are sparser than syntactic phrases. For our experiments, we used a corpus of about 150GB (25 billion words) obtained from Google News . It consists of few years worth of news data. We POS tagged the corpus using TNT tagger (Brants, 2000) and collected all phrases (n-grams) in the corpus that contained at least one verb, and had a noun or a noun-noun compound on either side. We restricted the phrase length to at most five words.

127

We build a vector for each phrase as described in Section 7.2. To mitigate the problem of sparseness and co-reference to a certain extent, whenever we have a nounnoun compound in the X or Y positions, we treat it as bag of words. For example, in the sentence “Google Inc. acquired YouTube”, “Google” and “Inc.” will be treated as separate features in the vector. Once we have constructed all the vectors, we find the paraphrases for every phrase by finding its nearest neighbors as described in Section 7.2. For our experiments, we set the number of random bits in the LSH function to 3000, and the similarity cut-off between vectors to 0.15. We eventually end up with a resource containing over 2.5 million phrases such that each phrase is connected to its paraphrases.

7.4.2 Surface Patterns One claim of this paper is that we can find good surface patterns for a target relation by starting with a seed pattern. To verify this, we study two target relations that have been used as examples by other researchers (Bunescu & Mooney, 2007; Banko & Etzioni, 2008): 1. Acquisition: We define this as the relation between two companies such that one company acquired the other. 2. Birthplace: We define this as the relation between a person and his/her birthplace.

128

For the “acquisition” relation, we start with the surface patterns containing only the words buy and acquire: 1. “hACQUIRERi bought hACQUIREEi” (and its variants, i.e., buy, buys and buying) 2. “hACQUIRERi acquired hACQUIREEi” (and its variants, i.e., acquire, acquires and acquiring) This results in a total of eight seed patterns. For the “birthplace” relation, we start with two seed patterns: 1. “hPERSONi was born in hLOCATIONi” 2. “hPERSONi was born at hLOCATIONi”. We find other surface patterns for each of these relations by replacing the surface words in the seed patterns by their paraphrases, as described in Section 7.3.

7.4.3 Relation Extraction The purpose of learning surface patterns for a relation is to extract instances of that relation. We use the surface patterns obtained for the relations “acquisition” and “birthplace” to extract instances of these relations from the LDC North American News Corpus (Graff, 1995). This helps us to extrinsically evaluate the quality of the surface patterns.

129

7.5

Experimental Results

In this section, we present the results of the experiments and analyze them.

7.5.1 Baselines It is hard to construct a baseline for comparing the quality of paraphrases, as there isn’t much work in extracting surface level paraphrases using a monolingual corpus (see discussion in Chapter 2). To overcome this, we compare the results informally to the other methods that produce syntactic paraphrases. To compare the quality of the extraction patterns and relation instances, we use the method presented by Ravichandran and Hovy (2002) as the baseline. For each of the given relations, “acquisition” and “birthplace”, we use 10 instances and obtain the top 1000 results from the Google search engine for each query. We use these results to obtain the set of baseline patterns for each relation. We then apply these patterns to the test corpus and extract the corresponding baseline instances.

7.5.2 Evaluation Criteria Here we present the evaluation criteria we used to evaluate the performance on the different tasks.

130

7.5.2.1

Paraphrases

We estimate the quality of paraphrases by annotating a random sample as correct/incorrect and calculating the accuracy. However, estimating the recall is difficult given that we do not have a complete set of paraphrases for the input phrases. Following Szpektor et al. (2004), instead of measuring recall, we calculate the average number of correct paraphrases per input phrase.

7.5.2.2

Surface Patterns

We can calculate the precision (P ) of learned patterns for each relation by annotating the extracted patterns as correct/incorrect. However calculating the recall is a problem for the same reason as above. But we can calculate the relative recall (RR) of the system against the baseline and vice versa. The relative recall RRS|B of system S with respect to system B can be calculated as:

RRS|B =

CS ∩ CB CB

where CS is the number of correct patterns found by our system and CB is the number of correct patterns found by the baseline. RRB|S can be found in a similar way.

131

7.5.2.3

Relation Extraction

We estimate the precision (P ) of the extracted instances by annotating a random sample of instances as correct/incorrect. While calculating the true recall here is not possible, even calculating the true relative recall of the system against the baseline is not possible as we can annotate only a small sample. However, following Pantel et al. (2004), we assume that the recall of the baseline is 1 and estimate the relative recall RRS|B of the system S with respect to the baseline B using their respective precision scores PS and PB and number of instances extracted by them |S| and |B| as:

RRS|B =

PS ∗ |S| PB ∗ |B|

7.5.3 Gold Standard In this section, we describe the creation of gold standard for the different tasks.

7.5.3.1

Paraphrases

We created the gold standard paraphrase test set by randomly selecting 50 phrases and their corresponding paraphrases from our collection. For each test phrase, we asked two annotators to annotate its paraphrases as correct or incorrect. The annotators were instructed to look for strict paraphrases, i.e., phrases that express the same meaning using different words.

132

To obtain the inter-annotator agreement, the two annotators annotated the test set separately. The kappa statistic (Siegal & Castellan Jr., 1988) was κ = 0.63. It was interesting that the annotators obtained this respectable kappa score without any prior training, which is hard to achieve in annotation of a similar task like textual entailment. This indicates that the notion of (quasi) paraphrases was well defined.

7.5.3.2

Surface Patterns

For the target relations, we asked two annotators to annotate the patterns for each relation as either “precise” or “vague”. The annotators annotated outputs from the system as well as the baseline. We consider the “precise” patterns as correct and the “vague” as incorrect. The intuition is that applying the vague patterns for extracting target relation instances might find some good instances, but will also find many bad ones. For example, consider the following two patterns for the “acquisition” relation: hACQUIRERi acquired hACQUIREEi (5) hACQUIRERi and hACQUIREEi

(6)

Example (5) is a precise pattern as it clearly identifies the acquisition relation while example (6) is a vague pattern because it is too general and says nothing about the “acquisition” relation. The kappa statistic between the two annotators for this task was κ = 0.72.

133

7.5.3.3

Relation Extraction

We randomly sampled 50 instances of the “acquisition” and “birthplace” relations from the system as well as the baseline outputs. We asked two annotators to annotate the instances as correct or incorrect. The annotators marked an instance as correct only if both the entities and the relation between them were correct. To make their task easier, the annotators were provided the context for each instance, and were free to use any resources at their disposal (including a web search engine) to verify the correctness of the instances. The annotators found that the annotation for this task was much easier than the previous two; the few disagreements they had were due to ambiguity of some of the instances. The kappa statistic for this task was κ = 0.91.

7.5.4 Result Summary Table 7.1 shows the results of annotating the paraphrases test set. We do not have a baseline to compare against but we can analyze them in light of numbers reported previously for syntactic paraphrases. DIRT (Lin & Pantel, 2001) and TEASE (Szpektor et al., 2004) report accuracies of 50.1% and 44.3% respectively compared to our average accuracy across two annotators of 70.79%. The average number of paraphrases per phrase is however 10.1 and 5.5 for DIRT and TEASE respectively compared to our 4.2.

134

Table 7.2 shows some paraphrases generated by our system for the phrases “are being distributed to” and “approved a revision to the”. Annotator Annotator 1 Annotator 2

Accuracy 67.31% 74.27%

Average # correct paraphrases 4.2 4.28

Table 7.1: Quality of paraphrases

are being distributed to have been distributed to are being handed out to were distributed to −are handing out will be distributed to all

approved a revision to the unanimously approved a new approved an annual will consider adopting a approved a revised approved a new

Table 7.2: Example paraphrases

Table 7.3 shows the results on the quality of surface patterns for the two relations. It can be observed that our method outperforms the baseline by a wide margin in both precision and relative recall. Table 7.4 shows some example patterns learned by our system. Relation Acquisition Birthplace

Method

# Patterns

Baseline Paraphrase Method Baseline Paraphrase Method

160 231 16 16

Annotator 1 P RR 55% 13.02% 83.11% 28.40% 31.35% 15.38% 81.25% 40%

Annotator 2 P RR 60% 11.16% 93.07% 25% 31.25% 15.38% 81.25% 40%

Table 7.3: Quality of extraction patterns

135

acquisition X agreed to buy Y X , which acquired Y X completed its acquisition of Y X has acquired Y X purchased Y

birthplace X , who was born in Y X , was born in Y X was raised in Y X was born in NNNN in Y X , born in Y

Table 7.4: Example extraction patterns Table 7.5 shows the results of the quality of extracted instances. Our system obtains very high precision scores but suffers in relative recall given that the baseline with its very general patterns is likely to find a huge number of instances (though a very small portion of them are correct). Table 7.6 shows some example instances we extracted. Relation Acquisition Birthplace

Method

# Patterns

Baseline Paraphrase Method Baseline Paraphrase Method

1, 261, 986 3875 979, 607 1811

Annotator 1 P RR 6% 100% 88% 4.5% 4% 100% 98% 4.53%

Annotator 2 P RR 2% 100% 82% 12.59% 2% 100% 98% 9.06%

Table 7.5: Quality of instances

7.5.5 Discussion and Error Analysis We studied the effect of the decrease in size of the available raw corpus on the quality of the acquired paraphrases. We used about 10% of our original corpus to learn the surface paraphrases and evaluated them. The precision and the average number of correct paraphrases are calculated on the same test set, as described in Section 7.5.2. The

136

acquisition 1. Huntington Bancshares Inc. agreed to acquire Reliance Bank 2. Sony bought Columbia Pictures 3. Hanson Industries buys Kidde Inc. 4. Casino America inc. agreed to buy Grand Palais 5. Tidewater inc. acquired Hornbeck Offshore Services Inc.

birthplace 1. Cyril Andrew Ponnam-peruma was born in Galle 2. Cook was born in NNNN in Devonshire 3. Tansey was born in Cincinnati 4. Tsoi was born in NNNN in Uzbekistan 5. Mrs. Totenberg was born in San Francisco

Table 7.6: Example instances performance drop on using 10% of the original corpus is significant (11.41% precision and on an average 1 correct paraphrase per phrase), which shows that we indeed need a large amount of data to learn good quality surface paraphrases. One reason for this drop is also that when we use only 10% of the original data, for some of the phrases from the test set, we do not find any paraphrases (thus resulting in 0% accuracy for them). This is not unexpected, as the larger resource would have a much larger recall, which again points at the advantage of using a large data set. Another reason for this performance drop could be the parameter settings: We found that the quality of learned paraphrases depended greatly on the various cut-offs used. While we adjusted our model parameters for working with smaller sized data, it is conceivable that we did not find the ideal setting for them. So we consider these numbers to be a lower bound. But even then, these numbers clearly indicate the advantage of using more data.

137

Moving to the task of relation extraction, we see from Table 7.5 that our system has a much lower relative recall compared to the baseline. This was expected as the baseline method learns some very general patterns, which are likely to extract some good instances, even though they result in a huge hit to its precision. However, our system was able to obtain this performance using very few seeds. So an increase in the number of input seeds, is likely to increase the relative recall of the resource. The question however remains as to what good seeds might be. It is clear that it is much harder to come up with good seed patterns (that our system needs), than seed instances (that the baseline needs). But there are some obvious ways to overcome this problem. One way is to bootstrap. We can look at the paraphrases of the seed patterns and use them to obtain more patterns. Our initial experiments with this method using handpicked seeds showed good promise. However, we need to investigate automating this approach. Another method is to pick good patterns from the baseline system by manual inspection and use them as seeds for our system. We plan to investigate this approach as well. One reason, why we have seen good preliminary results using these approaches (for improving recall), we believe, is that the precision of the paraphrases is good. So either a seed doesn’t produce any new patterns or it produces good patterns, thus keeping the precision of the system high while increasing relative recall.

138

7.6

Conclusion

Paraphrases are an important technique to handle variations in language. Given their utility in many NLP tasks, it is desirable that we come up with methods that produce good quality paraphrases. We believe that the paraphrase acquisition method presented here is a step towards this goal. We have shown that high precision surface paraphrases can be obtained by using distributional similarity on a large corpus. We made use of some recent advances in theoretical computer science to make this task scalable. We have also shown that these paraphrases can be used to obtain high precision extraction patterns for information extraction. While we believe that more work needs to be done to improve the system recall (some of which we are investigating), this seems to be a good first step towards developing a minimally supervised and scalable relation extraction system.

139

Chapter 8

Paraphrases for Domain-Specific Information Extraction

8.1

Introduction

While Information Extraction (IE) has evolved into a highly sophisticated application, one of its core methods remains based on patterns that match the sequences of words or syntactic units characteristic of each desired entity to be extracted. Generally, given the wide range of expressive possibility of language, especially English, better results are obtained when the system is focused on a particular application domain. The general approach to learning patterns for domain-specific IE is: 1. Obtain a domain-specific corpus.

140

2. For each type of entity or event-role to be extracted, build or learn the patterns from this corpus (possibly with the help of an out-of-domain corpus). This procedure generally has to be repeated every time a new domain is addressed. Aside from the fact that this repetition is tedious, obtaining a domain-specific corpus for a new domain is hard, especially when the definition and scope of the domain are not clear. Thus, adapting an IE system to a new domain requires a lot of effort (at least in the order of days). However, experience with IE in a variety of domains has led to the observation that a broad-coverage corpus, if it is large enough, contains most of the domain-specific patterns. If it were possible to learn these domain-specific patterns from such a broadcoverage corpus, it would save the efforts of collecting domain-specific corpora and re-learning IE patterns. IE in different domains would be easy. Adapting an IE system to a new domain would require very little effort (in the order of hours). In this chapter, we describe a very general method of learning IE patterns from a large broad-coverage corpus, using paraphrase learning techniques, at both surface and deeper levels. We then apply the patterns learned by these methods to various domainspecific test corpora and show results to demonstrate the following claim: Claim 1: Lexico-syntactic paraphrase based patterns, learned using a shallow parser, outperform surface-level paraphrase based patterns, for domain-specific IE.

141

Once the best paraphrase based pattern learning technique is determined, we compare its results to several domain-specific IE engines. We show results to verify the following claim: Claim 2: Paraphrase based patterns, learned from a large broad-coverage corpus, perform at a level comparable to the patterns learned from a domain-specific corpus, for domain-specific IE.

8.2

Learning Broad-Coverage Paraphrase Patterns

In this section, we present a set of methods to learn domain-specific patterns from a broad-coverage corpus. Our focus here is on event-oriented IE, where the task is to identify facts related to specific events. Formally, given a domain d, a broad-coverage corpus b, and an event-role e in d, the aim is to learn (from b) a set of extraction patterns P AT = {pat1 , pat2 ,..., patn } that can extract instances of e.

8.2.1 Learning Surface-Level Paraphrase Patterns In this section, we present our method to learn surface-level paraphrase patterns. Formally, given an event-role e in a domain d, the aim is to learn a set of surfacelevel extraction patterns

142

SU RF = {surf1 , surf2 ,..., surfn } that can extract instances of e. For example, given the event role e = “weapon” in the domain d = “terrorism”, the aim is to learn the following patterns: SU RF = {hSLOTi went off, hSLOTi exploded, hSLOTi that exploded, . . . } Let p be a pattern of the form “hSLOTi n-gram” or “n-gram hSLOTi”, where “hSLOTi” contains a word or phrase that we expect p to extract, and “n-gram” is any n-gram in a corpus. Call the word or phrase that p extracts its slot-filler. Let P = {p1 , p2 , p3 , ..., pl } be the set of all patterns of the form “hSLOTi n-gram” or “ngram hSLOTi”. Define the context Ci of a pattern pi ∈ P to be its slot-filler plus a two word (token) window on its other side. For example, assume that we have the following sentence in our corpus: The bomb exploded prematurely.

(1)

Also assume that we have a pattern “hSLOTi exploded” in our set of patterns P . Given sentence (1) and pi = “hSLOTi exploded”, its context is: Ci = {hSLOTi:bomb, +1:prematurely, +2:.}

Each item that occurs in the context Ci of pi , is called a feature of pi . For each feature f ∈ Ci , we calculate its strength of association with the pattern pi using pointwise mutual information (PMI) (Cover & Thomas, 1991). We then construct the feature vector Vi associated with pi , such that it contains each feature with its associated PMI

143

value. For example, given the pattern pi = “hSLOTi exploded”, its feature vector could be: Vi = {hSLOTi:bomb 3.53, +1:prematurely 1.74, +2:. 2.76} Once we have vectors for all the patterns in a corpus, we can find paraphrases for any pattern by nearest neighbors computation using cosine similarity1 . Assume that we have a set of surface-level seed patterns: SEED = {seed1 , seed2 ,..., seedm } for the event role e. Given SEED, our model finds the paraphrase set P ARAi for each seedi ∈ SEED. The set of surface-level extraction patterns SU RF for e then is the union of the individual paraphrase sets, i.e., SU RF = P ARA1 ∪ P ARA2 ∪ ... ∪ P ARAm . Each pattern in SU RF comes with an associated score, which is the similarity between the learned pattern surfi ∈ SU RF and the seed pattern seedj ∈ SEED that generated it2 . For example, given the event role e = “weapon” in the domain d = “terrorism” and the seeds: SEED = {hSLOTi went off, hSLOTi exploded, . . . } 1

Cosine similarity between two vectors is the cosine of the angle between them. If a pattern is generated by two or more seed patterns, its score is the average of all the scores it obtains from the different seeds. We also tried using maximum and sum, but the results were very similar. 2

144

we might find a set of surface patterns: SU RF = {hSLOTi that exploded, hSLOTi blew up, was hit by hSLOTi, . . . }. This provides the surface-level paraphrase extraction patterns — SurfPara.

8.2.2 Learning

Lexico-Syntactic

Paraphrase

Patterns

by

Conversion The method described in Section 8.2.1 generates surface-level patterns from a set of seed extraction patterns. These patterns are lexical in nature. They extract information by matching the exact sequence of words (tokens) in the pattern with that in the given text and extracting the noun phrase that acts as a slot-filler for that pattern (ex: hSLOTi exploded). The lexical nature of these patterns makes them very specific. While this specificity often results in high precision extractions, it can be a serious disadvantage, especially when recall is important. For example, we need several variations of surface-level patterns to match all the active voice verb phrases containing the verb “exploded”: “hSLOTi exploded”, “hSLOTi recently exploded”, “hSLOTi suddenly exploded”, etc. However a single lexico-syntactic pattern “Subject(hSLOTi) ActiveVP(exploded)” matches them all. Since extracting specific entities in a small domain-specific corpus is likely to require high recall, we automatically generalize the surface-level extraction patterns as described below.

145

Formally, let: SU RF = {surf1 , surf2 ,..., surfn } be a set of surface-level extraction patterns. The aim is to convert them into a corresponding set of lexico-syntactic extraction patterns: LEXSY N = {lexsyn1 , lexsyn2 ,..., lexsynm } To convert the surface-level patterns into lexico-syntactic patterns, we use a shallow parser. We parse all the surface patterns surfi ∈ SU RF using this shallow parser and generalize them based on certain lexical and syntactic dimensions: • Voice: The different lexical variations of the active and the passive voice forms of a verb all map to the corresponding general active and passive representations. For example, “hSLOTi exploded”, “hSLOTi recently exploded”, “hSLOTi suddenly exploded” all contain active voice forms of the verb “exploded”, and hence map to the lexico-syntactic pattern “Subject(hSLOTi) ActiveVP(exploded)”. • Head word: All noun phrases, verb phrases, prepositional phrases, and adjectival phrases in the surface pattern are represented only by their respective heads. For example, given the pattern “the recent explosion of hSLOTi”, it is generalized to the form “NP(explosion) of hSLOTi”, which is considered equivalent to “the loud explosion of hSLOTi”, “the explosion of hSLOTi”.

146

• Syntactic templates: We use a pre-defined set of 17 hand-built syntactic templates to add only those patterns that match these templates into our set of lexicosyntactic patterns LEXSY N . For example, “hSubjecti ActiveVP” is a syntactic template. Using this template, the surface-level pattern “hSLOTi recently exploded” results in the pattern “Subject(hSLOTi) ActiveVP(exploded)”3 , which would be added to LEXSY N . After performing these generalizations, we obtain the first set of lexico-syntactic paraphrase extraction patterns — LexSynPara (Conv).

8.2.3 Learning Lexico-Syntactic Paraphrase Patterns Directly In this section, we present our method to learn lexico-syntactic paraphrase patterns from a corpus. Formally, given an event-role e in a domain d, the aim is to learn a set of lexicosyntactic patterns: LEXSY N = {lexsyn1 , lexsyn2 ,..., lexsynn } that can extract instances of e. For example, given the event role e = “weapon” in the domain d = “terrorism”, the aim is to learn the following patterns: 3

If multiple templates match a given surface-level pattern, then a lexico-syntactic pattern is generated for each matched construct.

147

LEXSY N = {Subject(hSLOTi) ActiveVP(exploded), Subject(hSLOTi) PassiveVP(defused), ActiveVP(detonated) DirObject(hSLOTi), . . . } Let lp be a pattern of the form “hSLOTi syn-rel” or “syn-rel hSLOTi”, where “hSLOTi” contains a word or phrase that we expect lp to extract (slot-filler), and “syn-rel” is a syntactic relation discovered by a shallow parser, in a corpus. Let LP = {lp1 , lp2 , lp3 , ..., lpl } be the set of all patterns of the form “hSLOTi syn-rel” or “syn-rel hSLOTi”. Define the context Ci of a pattern lpi ∈ LP to be its slot-filler. For example, given the sentence (1) in Section 8.2.1 and lpi = “Subject(hSLOTi) ActiveVP(exploded)”, its context is: Ci = {hSLOTi:bomb} Using this definition of patterns and their contexts, feature vectors are constructed for all of the lexico-syntactic patterns in a corpus, similar to Section 8.2.1. The set of lexico-syntactic extraction patterns LEXSY N , for an event-role e in a domain d, then is constructed using a set of lexico-syntactic seed patterns LSEED, similar to Section 8.2.1. For example, given the event role e = “weapon” in the domain d = “terrorism” and the set of seeds: LSEED = {Subject(hSLOTi) ActiveVP(exploded), Subject(hSLOTi) PassiveVP(defused), . . . }

148

we might find a set of lexico-syntactic patterns: LEXSY N = {ActiveVP(detonated) DirObject(hSLOTi), ActiveInfVP(wants defuse) DirObject(hSLOTi), . . . }. This provides the second set of lexico-syntactic paraphrase patterns — LexSynPara (Direct).

8.3

Experimental Methodology

In this section, we summarize the experiments and results of learning extraction patterns using broad-coverage paraphrases.

8.3.1 Paraphrase Patterns For learning the broad coverage paraphrases, we used a 2.2 billion word corpus consisting mainly of newswire data. It contains data from the English gigaword corpus (Graff, 2003) which has about 1.75 billion tokens collected from various international news sources; the HARD 2004 text corpus (Kong et al., 2005), consisting of about 225 million words of newswire and web text; and the CSR-III text corpus (Graff et al., 1995), consisting of about 225 million words of newswire text. For learning surface-level paraphrase patterns, we part-of-speech (POS) tagged the corpus using the Brill POS tagger (Brill, 1994) and applied the method described in

149

Section 8.2.1 to the corpus. We assume that every noun (or a sequence of nouns) that occurs in our corpus is a potential slot-filler. We enumerate every n-gram around that noun (or sequence of nouns), upto a maximum size of three as a candidate pattern for paraphrase learning. We discard all patterns that occur fewer than 100 times in the corpus (for scalability). We eventually build a resource containing over two million patterns. We then create 10 seed patterns for each event-role we want to extract. We then obtain paraphrases for each of these seed patterns4 to obtain the surface-level patterns. Once we have the surface-level paraphrase patterns, we convert them into lexicosyntactic patterns as described in Section 8.2.2. To perform this conversion, we use the pattern generation component of the AutoSlog system (Riloff, 1993). We use the Sundance shallow parser (Riloff & Phillips, 2004) and all of the default 17 pattern templates that are a part of this system package. Table 8.1 shows these templates. hSubjecti PassiveVP hSubjecti ActiveVP hSubjecti ActiveVP DirObject hSubjecti ActiveInfVP hSubjecti PassiveInfVP hSubjecti AuxVP DirObject hSubjecti AuxVP Adj

ActiveVP hDirObjecti InfinitiveVP hDirObjecti ActiveInfVP hDirObjecti PassiveInfVP hDirObjecti Subject AuxVP hDirObjecti

NP Prep hNPi ActiveVP Prep hNPi PassiveVP Prep hNPi InfVP Prep hNPi hPossessivei NP

Table 8.1: List of pattern templates 4

A similarity threshold of 0.1 is set for the paraphrases, based on data inspection and prior experience.

150

Finally, for learning the second set of lexico-syntactic paraphrase patterns, we applied the method described in Section 8.2.3 to the corpus. We used the Sundance shallow parser (Riloff & Phillips, 2004), all of the default 17 pattern templates that are a part of this system package (Table 8.1), and the pattern generation component of the AutoSlog system (Riloff, 1993), to generate every pattern occurring in the corpus. We assume that every noun phrase that matches a pattern in our corpus is a potential slot-filler. We discard all patterns that occur fewer than 10 times in the corpus (for scalability). We eventually build a resource containing over two million patterns. We then convert the 10 surface-level seed patterns (above) for each event-role into lexicosyntactic forms5 . We then obtain paraphrases for each of these lexico-syntactic seed patterns and learn the lexico-syntactic paraphrase patterns.

8.3.2 Domains To test our IE systems, we used data from three domains: terrorism, disease-outbreaks, and corporate-acquisitions. For the terrorism domain, we used the MUC-4 terrorism corpus (Sundheim, 1992), which consists of Latin American terrorist events. It has a total of 400 gold-standard annotated documents in its test portion, divided into four sets of 100 each (TST1, TST2, TST3, and TST4). Of these, the TST1 and TST2 documents were used for tuning, TST3 5

The conversion of 10 surface seeds results in 8–10 lexico-syntactic seeds for each event role, because some surface seeds map to the same lexico-syntactic seed.

151

and TST4 documents were used for test. We focused on extracting five event roles in this domain: perpetrator individuals, perpetrator organizations, physical targets, victims, and weapons. For the disease-outbreaks domain, we used a ProMed-mail6 IE data set, which consists of reports about outbreaks of infectious diseases. This collection consists of 245 gold-standard annotated articles. Of these, 125 were used for tuning and 120 for test. We extracted two event roles in this domain: diseases and victims. For the corporate-acquisitions domain, we used the corporate-acquisitions corpus (Freitag, 1998b). It consists of 600 newspaper articles about acquisitions and mergers of companies. These articles have gold-standard annotations. Of these, we randomly set aside 300 documents for tuning and used the remaining 300 for test. We extracted two event roles in this domain: acquired and purchaser.

8.3.3 Evaluation The complete event-oriented IE task involves the generation of event templates. Template generation is complicated: It requires discourse analysis to identify the different events in one article and coreference resolution to find coreferring entities. Our aim here, however, is to evaluate the quality of extraction patterns. Hence, following Patwardhan and Riloff (2007), we evaluate our methods on the quality of their extractions 6

http://www.promedmail.org

152

rather than on template generation7 . Also, following Patwardhan and Riloff (2007), we merge duplicate extractions and employ a head noun matching based evaluation scheme: An extraction is considered correct only if its head noun matches the head noun of the gold standard answer.

8.4

Result 1 and Discussion

8.4.1 Comparison of Broad-Coverage Paraphrase Patterns In this section, we summarize the experiments and results of using the different broadcoverage paraphrase based systems. We compare the performance of these systems, to two baselines: • SurfSeeds: Uses only the surface-level seed patterns for extraction. • LexSynSeeds: Uses only the lexico-syntactic seed patterns for extraction. For the SurfPara, LexSynPara (Conv), and LexSynPara (Direct) systems, their performances were first measured using the top 25, 50, 75, ..., m patterns on the corresponding tuning sets. The configurations that performed the best on the tuning set were then applied to the test set. Figures 8.1, 8.2, and 8.3 summarize the results of the baselines and the paraphrase based systems for the terrorism, disease-outbreaks, 7

Template generation should logically follow the extraction step.

153

and corporate-acquisitions domains respectively using macro-averaged precision, recall, and f-scores. 0.7 Precision Recall F-score 0.6

0.5

Score

0.4

0.3

0.2

0.1

0 SurfSeeds

LexSynSeeds

SurfPara System

LexSynPara(Conv) LexSynPara(Direct)

Figure 8.1: Paraphrase patterns in terrorism domain

0.7 Precision Recall F-score 0.6

0.5

Score

0.4

0.3

0.2

0.1

0 SurfSeeds

LexSynSeeds

SurfPara System

LexSynPara(Conv) LexSynPara(Direct)

Figure 8.2: Paraphrase patterns in disease-outbreaks domain

154

0.7 Precision Recall F-score 0.6

0.5

Score

0.4

0.3

0.2

0.1

0 SurfSeeds

LexSynSeeds

SurfPara System

LexSynPara(Conv) LexSynPara(Direct)

Figure 8.3: Paraphrase patterns in corporate-acquisitions domain

8.4.2 Discussion and Error Analysis Looking at the figures, it is clear that the SurfPara system does not improve much over the LexSynSeeds baseline (in fact sometimes it performs much worse). The LexSynPara (Conv) and LexSynPara (Direct) systems, however, consistently perform at par or improve over the baselines and the SurfPara system. This demonstrates the power of the generalization obtained by using the lexico-syntactic patterns. This generalization seems especially important in domain-specific IE: Limited redundancy in a small domain-specific corpus means that the system must have a large number of patterns to achieve good recall8 . Our analysis of the paraphrase patterns also confirms that generalization is crucial here9 . Also, the fact that LexSynPara (Direct) performs the best 8

Even for open-domain relation extraction, (Bhagat & Ravichandran, 2008) point out that using surface-level patterns for extractions results in low recall. 9 It should however be noted that automatic generalization sometimes produces very general patterns, due to parser error. Hence at times it harms precision.

155

indicates that data sparseness is a problem in learning surface-level paraphrases. Analysis of the learned patterns also confirms this. We discuss the data sparseness problem in Section 8.5.2.

8.5

Result 2 and Discussion

8.5.1 Comparison

of

Broad-Coverage

and

Domain-Specific

Patterns In this section, we present experiments and results to compare the performance of our broad-coverage patterns based IE system to some state-of-the-art domain-specific patterns based IE systems. For the terrorism and disease-outbreaks domain, we show results for the ASlog-TS system (Riloff, 1996) and the Semantic Affinity (SemAff ) system (Patwardhan & Riloff, 2007). Both these are weakly-supervised systems, that require a domain-specific corpus to learn extraction patterns as explained below: • ASlog-TS: ASlog-TS (Riloff, 1996) is a weakly supervised learner that relies on domain-specific training data to learn extraction patterns for the given IE task. The training data consists of a set of relevant documents and a set irrelevant documents. The pattern learner extracts all the patterns that match some pre-specified templates from both the relevant and irrelevant documents. It then uses the distribution of patterns between the relevant and irrelevant documents to generate a

156

ranking for the patterns. This ranking enables a human expert to then map the top-ranked patterns to their corresponding event-roles and to discard patterns that do not map to any of the event-roles (or are obviously bad patterns). • SemAff : The SemAff system uses the semantic affinity metric introduced by Patwardhan and Riloff (2007). Semantic affinity is a measure of the tendency of a pattern to extract noun phrases corresponding to a specific event-role. For example, if a large proportion of the extractions of a pattern are weapon words, the pattern is more likey to be a weapon event-role pattern. The learner uses the semantic class information of its extractions and a manually created mapping between semantic classes and event-roles to estimate the the semantic affinity score for each patterns. The top-ranked patterns are then used to extract information from text. We also show the results of applying these systems in the relevant regions (Patwardhan & Riloff, 2007). The basic idea behind the relevant regions is that certain sentences are more likely to contain the relevant event-roles than others. To automatically identify such relevant sentences, Patwardhan and Riloff (2007) developed self-trained classifier that uses a set of documents relevant to the domain, a set of irrelevant documents, and a set of seed extraction patterns as training data. They then apply extraction patterns from the different systems only in relevant regions, i.e., relevant sentences to get better performance. We show results for two such systems as well:

157

• ASlog-TS (Rel): This system is obtained by applying the patterns from the above mentioned ASlog-TS system in relevant regions. • SemAff (Rel): This system is obtained by applying the patterns from the above mentioned SemAff system in relevant regions. The system scores for ASlog-TS, SemAff, ASlog-TS (Rel), and SemAff (Rel) are taken from Patwardhan and Riloff (2007). These numbers are directly comparable to ours as we use the same test sets and evaluation methodology as them. For the corporate-acquisitions domain, we show the results of the SRV and SRVlng systems (Freitag, 1998b), both of which are supervised learning systems trained on domain-specific annotated data as described below: • SRV: SRV is a supervised classification algorithm that uses training data containing positive and negative examples to identify the relevant and non-relevant event-roles. It uses a set of basic features like token capitalization, numeric/nonnumeric nature of tokens, context of the token among others to learn the classifier. • SRVlng: SRVlng is an incremental improvement over the SRV approach. In SRVlng, the feature space of the original SRV algorithm is enriched by adding syntactic features using a parser and semantic features using WordNet.

158

The system scores for SRV and SRVlng are taken from Freitag (1998b). These numbers are not directly comparable to ours, since we use a different evaluation methodology from theirs. These numbers are shown here only to give the readers a rough idea of performance in this domain. Figures 8.4, 8.5, and 8.6 summarize the results for the terrorism, disease-outbreaks, and corporate-acquisitions domains respectively using macro-averaged precision, recall, and f-scores. 0.7 Precision Recall F-score 0.6

0.5

Score

0.4

0.3

0.2

0.1

0 ASlog-TS

ASlog-TS(Rel)

SemAff

SemAff(Rel)

LexSynPara(Direct)

System

Figure 8.4: Paraphrase based vs traditional IE systems in terrorism domain 0.7 Precision Recall F-score 0.6

0.5

Score

0.4

0.3

0.2

0.1

0 ASlog-TS

ASlog-TS(Rel)

SemAff System

SemAff(Rel)

LexSynPara(Direct)

Figure 8.5: Paraphrase based vs traditional IE systems in disease-outbreaks domain

159

0.7 Precision Recall F-score 0.6

0.5

Score

0.4

0.3

0.2

0.1

0 SRV

SRVlng System

LexSynPara(Direct)

Figure 8.6: Paraphrase based vs traditional IE systems in corporate-acquisitions domain

8.5.2 Discussion and Error Analysis From Figures 8.4, 8.5, and 8.6, it can be seen that overall, the performance of the best paraphrase patterns based system — LexSynPara (Direct), is comparable to the domain-specific IE systems. While we are satisfied by the overall performance of the LexSynPara (Direct) system, we found that our approach performed badly on two slots — perpetrator individuals in the terrorism domain and victims in the disease-outbreaks domain. We investigated the cause for its poor performance on the these two slots. Intuition tells us, that it would be much easier to learn patterns for slots that extract entities whose event-roles are less ambiguous: Any context the entities occur in is likely to be a good disambiguator for them and hence a good extraction-pattern. Another intuition tells us that it would be much easier to learn patterns for slots that extract entities belonging to smaller (in size) semantic classes, especially when using a broad-coverage corpus: The vectors for

160

at least the high-frequency patterns corresponding to these slots would be much more dense compared to those of the patterns for slots that extract entities belonging to larger semantic classes. The above two intuitions are confirmed by our analysis of the learned patterns and also explain our results. The slot — perpetrator individuals in the terrorism domain, extracts entities belonging to the general semantic class people, a class whose members can play many event roles both within and across domains (ex: members of the class people can also be victims, or onlookers or security personnel etc.) and is large in size. Looking at the learned patterns for this slot, we find that they also contain patterns for the slot victims. The second slot — victims in the disease-outbreaks domain, extracts entities that can belong to general semantic classes people, animals, birds, etc., again classes whose members play different event-roles and which have large sizes. Our analysis of the patterns for this slot shows that some of the patterns are very general that extract animals, birds, etc. A third slot — victims in the terrorism domain, also suffers from the slot-filler ambiguity and sparse vector problems initially (when using SurfPara and LexSynPara (Conv) methods), but the problem becomes much less pronounced in the LexSynPara (Direct) method, where the vectors are less sparse. We hypothesize that if we were to use a corpus one order of magnitude larger than the one we have used and have the computation power to handle it, the paraphrases based method will perform much better. Unfortunately, we do not have access to such a corpus or the computation

161

power to handle it currently. However, working with such a data set is feasible (Bhagat & Ravichandran, 2008) and it is an avenue we want to explore. Finally, while it takes days to adapt any of the domain-specific methods to other domains, our method could be adapted to each of the new domains in a couple of hours.

8.6

Conclusion

Clearly, learning extraction patterns once from a broad-coverage corpus and being able to use them in any specific domain and application is preferable to the repetitive task of building a domain-specific corpus and applying some learning algorithm (and possibly some human annotation) to it, every time a new domain is addressed. The paraphrasebased method described in this chapter performs well, makes moving to new domains easy, and obviates the need to worry about the scope of a domain and/or the type of training data required. This observation leads to the conclusion that learning broadcoverage paraphrases is an effective general methodology for domain-specific information extraction.

162

Chapter 9

Conclusion and Future Work

9.1

Contributions

In this thesis we have: • Developed a typology of quasi-paraphrases which categorizes them based on lexical and structural changes in the sentences and phrases, and provided their relative frequencies. • Presented a method for learning Inferential Selectional Preferences (ISPs), the contexts in which a pair of quasi-paraphrases are mutually replaceable. Using ISPs, we have shown that incorrect inferences can be filtered significantly better than several baselines. • Developed an algorithm, LEDIR, that learns the directionality of quasiparaphrases. Learning directionality allows one to separate the strong from the

163

weak paraphrases. We have shown that LEDIR performs significantly better than several baselines. • Presented a method to learn high quality surface-level paraphrases from a large monolingual corpus. We then used these paraphrases to learn patterns for relation extraction and have shown that these patterns are better than those learned using a state-of-the-art baseline. • Shown that broad-coverage paraphrases, learned using a large monolingual corpus, can be used to learn patterns for domain-specific information extraction. We have shown that the patterns learned using the broad-coverage paraphrases perform roughly at par with several state-of-the-art domain-specific information extraction systems. Overall, through this work, we have shown that paraphrases can be learned from a monolingual corpus and can be used for information extraction, an important application area of NLP. There are however several open questions that need to be addressed and several areas that need to be explored. In the following section, we list a set of possible directions for future work based on the work done in this thesis.

164

9.2

Future Work

In this section, we present several issues that can probably be addressed in the future.

9.2.1 Inferential Selectional Preferences using Tailored Classes In Chapter 4, we experimented with using two types of classes for learning Inferential Selectional Preferences (ISPs): classes from WordNet, classes learned by CBC (Pantel & Lin, 2002). The results there show that the ISPs learned from the CBC classes are more effective than the ISPs learned from WordNet classes for filtering inferences. Even though the WordNet classes are hand created and accurate, they suffer from the problem of low recall. Recently, there has been work on adding new words to WordNet to improve its recall (Snow et al., 2006). The semi-supervised algorithm we used in Chapter 6 can also be used to increase the recall of WordNet based classes. However, the drawback of the algorithm from Chapter 6 is that it is a hard clustering algorithm and thus assigns each word to only a single class. But, it can be extended using a simple strategy to allow words to be assigned to multiple classes as follows: • Run HMRF-KMeans using constraints from WordNet to obtain hard clustering of new words.

165

• Calculate centroids for the clusters and for each cluster find the top-n elements that are most similar to its centroid. Let’s call these these small clusters formed by the top-n elements of each cluster the “representative-clusters” . • Calculate centroids for the representative-clusters.

Let’s call these the

“representative-centroids”. • For each of the remaining words, find the top-m representative-centroids whose similarity exceeds a certain threshold and assign them to the clusters. Alternatively, words can be assigned to various clusters using the stage III of the soft clustering version of the CBC algorithm. We believe this method will learn better WordNet based ISPs.

9.2.2 Knowledge Acquisition Recently, there has been a growth in interest among researchers for building large knowledge bases containing relations between entities (Pas¸ca et al., 2006; Shinyama & Sekine, 2006; Banko & Etzioni, 2008). In Chapter 7, we have shown that paraphrases can be used to learn high precision patterns for extracting relations from text. For example, given the relation “acquisition” between two companies, i.e., the relation such that one company acquired the other, we can with high precision extract instances of the companies which have this relation. An obvious use of paraphrases then is to

166

build a large knowledge base containing pairs of entities that exhibit useful relations. It can be done as follows: • Obtain a set of relations. These can be specified either manually or can be discovered automatically as in Banko and Etzioni (2008). • For each relation in step 1, specify a few seed patterns. • For each seed pattern, find its paraphrases using the method described in Chapter 7 and create a list of surface patterns for each relation. If needed, generalize the surface patterns using the method described in Chapter 8. • Extract instances for each relation using the patterns from step 3 and create a database of relation instances.

9.2.3 Paraphrases for Machine Translation Recently, there has been work in Machine Translation to address the problem of sparseness of training data using paraphrases (Callison-Burch et al., 2006). Callison-Burch et al. (2006) used bilingual data to learn paraphrases for the source language phrases. They show that automatically paraphrasing source language phrases whose translations are not known to the ones whose translations are known improves the translation quality. The algorithm for learning surface paraphrases that we present in Chapter 7 can also be adapted for this purpose as follows:

167

• Obtain a large monolingual corpus. • Define the context of a word or a phrase as a two word window to its right and a two word window to its left. • Collect all n-grams in this corpus upto a size of 5 along with their contexts. • Build a context vector for each n-gram using its context as described in Chapter 7. • Find paraphrases for unknown phrases using the method described in Chapter 7. • If any paraphrase of a unknown phrase has a known translation, assign that as the translation for the unknown phrase. The method described above is language-independent. With the availability of large publicly available corpora for many languages — English, Spanish, French, Arabic, Chinese — paraphrases for these language can be learned. There however are several details that will need to be worked out depending on the language. For example, for Chinese, tokenization might be a factor that determines the quality of paraphrases. Also, for languages with heavy morphology, e.g., Finnish, demorphing might be an important factor. While these and other language specific issues will need to be addressed, the method presented in this paper can be used as a general framework for learning paraphrases.

168

9.3

Conclusion

This thesis presents methods for automatically learning quasi-paraphrases from text. While we have used these paraphrases for information extraction, there are other areas in Natural Language Processing (NLP) that can benefit by using them. We have listed some of these potential applications above. There are however many other applications like text summarization, information retrieval, automatic evaluation for machine translation and summarization, among others that can benefit by using paraphrases. We are only scratching the surface here. Much more work still needs to be done in the area of paraphrase learning to make effective use of paraphrases for NLP.

169

Bibliography Anick, P. G., & Tipirneni, S. (1999). The paraphrase search assistant: terminological feedback for iterative information seeking. ACM SIGIR (pp. 153–159). Berkeley, California, United States. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The berkeley framenet project. In Proceedings of international conference on Computational linguistics (pp. 86–90). Montreal, Quebec, Canada. Banko, M., & Etzioni, O. (2008). The tradeoffs between traditional and open relation extraction. Association for Computational Linguistics (pp. 28–36). Columbus, Ohio. Bannard, C., & Callison-Burch, C. (2005). Paraphrasing with bilingual parallel corpora. Association for Computational Linguistics (pp. 597–604). Ann Arbor, Michigan. Barzilay, Regina (2003). Information fusion for multidocument summarization: Paraphrasing and generation. Doctoral dissertation, Columbia University. Barzilay, R., & Lee, L. (2003). Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In Proceedings North American Chapter of the Association for Computational Linguistics on Human Language Technology (pp. 16– 23). Edmonton, Canada. Barzilay, R., & McKeown, K. R. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of Association for Computational Linguistics (pp. 50–57). Toulouse, France. Barzilay, R., McKeown, K. R., & Elhadad, M. (1999). Information fusion in the context of multi-document summarization. Association for Computational Linguistics (pp. 550–557). College Park, Maryland. Basu, S., Banerjee, A., & Mooney, R.J. (2002). Semi-supervised clustering by seeding. 19th International Conference on Machine Learning (pp. 19–26).

170

Basu, S., Bilenko, M., & Mooney, R.J. (2004). A probabilistic framework for semisupervised clustering. 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 59–68). Seattle, WA, USA. Berland, M., & Charniak, E. (1999). Finding parts in very large corpora. In Proceedings of Association for Computational Linguistics (pp. 57–64). College Park, Maryland. Bhagat, R., Hovy, E.H., & Patwardhan, S. (2009). Acquiring paraphrases from text corpora. International Conference on Knowledge Capture (KCap). Redondo Beach, California, USA. Bhagat, R., Pantel, P., & Hovy, E.H. (2007). Ledir: An unsupervised algorithm for learning directionality of inference rules. Empirical Methods in Natural Language Processing (EMNLP). Prague, Czech Republic. Bhagat, R., & Ravichandran, D. (2008). Large scale acquisiton of paraphrases for learning surface patterns. Association for Computational Linguistics (ACL). Columbus, OH, USA. Brants, T. (2000). Tnt – a statistical part-of-speech tagger. In Proceedings of the Applied NLP Conference (ANLP). Seattle, WA. Brill, Eric (1994). Some advances in rule-based part of speech tagging. Proceedings of the Twelfth National Conference on Artificial Intelligence (pp. 722–727). Seattle, WA. Bunescu, Razvan, & Mooney, Raymond (2007). Learning to extract relations from the web using minimal supervision. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 576–583). Prague, Czech Republic: Association for Computational Linguistics. Califf, M., & Mooney, R. (2003). Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction. Journal of Machine Learning Research, 4, 177– 210. Callison-Burch, Chris (2007). Paraphrasing and translation. Doctoral dissertation, University of Edinburgh. Callison-Burch, Chris (2008). Syntactic constraints on paraphrases extracted from parallel corpora. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 196–205). Honolulu, Hawaii: Association for Computational Linguistics.

171

Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases. Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 17–24). New York, New York. Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 380–388). Montreal, Quebec, Canada. Chklovski, T., & Pantel, P. (2004). Verbocean: Mining the web for fine-grained semantic verb relations. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP) (pp. 33–40). Barcelona, Spain. Chomsky, Noam (1957). Syntactic structures. The Hague: Mouton Publishers, Paris. Clark, Eve V. (1992). Conventionality and contrasts: pragmatic principles with lexical consequences. In A. Lehrer and E. F. Kittay (Eds.), Frame, fields, and contrasts: New essays in semantic lexical organization. Lawrence Erlbaum Associates. Cohn, Trevor, Callison-Burch, Chris, & Lapata, Mirella (2008). Constructing corpora for the development and evaluation of paraphrase systems. Computational Linguistics, 34, 597–614. Cover, T.M., & Thomas, J.A. (1991). Elements of information theory. John Wiley & Sons. De Beaugrande, R., & Dressler, W.. V. (1981). Introduction to text linguistics. New York, NY: Longman. Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the conference on Computational Linguistics (COLING) (pp. 350–357). Geneva, Switzerland. Dom, B.E. (2001). An information-theoretic external cluster-validity measure. Fellbaum, C. (1998). An electronic lexical database. MIT Press. Freitag, Dayne (1998a). Information extraction from HTML: Application of a general learning approach. Proceedings of the Fifteenth National Conference on Artificial Intelligence (pp. 517–523). Madison, WI.

172

Freitag, D. (1998b). Toward General-Purpose Learning for Information Extraction. Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (pp. 404– 408). Montreal, Quebec. Gale, William A., & Church, Kenneth W. (1991). A program for aligning sentences in bilingual corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (pp. 177–184). Berkeley, California, USA: Association for Computational Linguistics. Geffet, M., & Dagan, I. (2005). The distributional inclusion hypotheses and lexical entailment. In Proceedings of Association for Computational Linguistics (pp. 107– 114). Ann Arbor, Michigan. Graff, D. (1995). North american news text corpus. Linguistic Data Consortium, Philadelphia, PA. Graff, D. (2003). English gigaword. Linguistic Data Consortium, Philadelphia, PA. Graff, D., Rosenfeld, R., & Pau, D. (1995). Csr-iii text. Linguistic Data Consortium, Philadelphia, PA. Harabagiu, S., & Hickl, A. (2006). Methods for using textual entailment in opendomain question answering. In Proceedings of the International Conference on Computational Linguistics and ACL (pp. 905–912). Sydney, Australia. Harris, Z. (1954). Distributional structure. Word, 10(23):146–162. Harris, Z. (1981). Co-occurence and transformation in linguistic structure. In H. Hiz (Ed.), Papers on syntax. D. Reidel Publishing Company. First published in 1957. Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. Proceedings of the conference on Computational linguistics (pp. 539–545). Nantes, France. Hirst, Graeme (2003). Paraphrasing paraphrased. Invited talk at the ACL International Workshop on Paraphrasing. Hobbs, J. R., Appelt, D., Bear, J., Israel, D., Kameyalna, M., & Tyson, M. (1993). Fastus: A system for extracting information from text. Proceedings of Human Language Technology Conference. Plainsboro, New Jersey. Honeck, Richard P. (1971). A study of paraphrases. Journal of Verbal Learning and Verbal Behavior, 10, 367–381.

173

Huang, Shudong, Graff, David, & Doddington, George (2002). Multiple-translation chinese corpus. Linguistic Data Consortium, Philadelphia, PA. Jacobs, P. S., Krupka, G., Rau, L., Mauldin, M. L., Mitamura, T., Kitani, T., Sider, I., Childs, L., & Marietta, M. (1993). Ge-cmu: Description of the shogun system used for muc-5. In Proceedings of the Fifth Message Understanding Conference (pp. 109–120). Katz, J.;, & Fodor, J.A. (1963). The structure of a semantic theory. Language, 39, 170–210. Kim, J., & Moldovan, D. (1993). Acquisition of semantic patterns for information extraction from corpora. Proceedings of the IEEE Conference on Artificial Intelligence for Applications (pp. 171–176). Orlando, FL, USA. Kipper, K. (2005). Verbnet: A broad-coverage comprehensive verb lexicon. Doctoral dissertation, Computer and Information Science Dept., University of Pennsylvania. Kong, J., Graff, D., Maeda, K., & Strassel, S. (2005). Hard 2004 text. Linguistic Data Consortium, Philadelphia, PA. Lenat, D. (1995). Cyc: A large-scale investment in knowledge infrastructure. 38(11), 33–38. Levin, B. (1993). English verb classes and alternations: a preliminary investigation. Chicago and London: University of Chicago Press. Light, M., & Greiff, W.R. (2002). Statistical models for the induction and use of selectional preferences. Cognitive Science, 26, 269–281. Lin, D. (1994). Principar — an efficient, broad-coverage, principle-based parser. Computational Linguistics (COLING) (pp. 42–48). Kyoto, Japan. Lin, D. (1998). Automatic retrieval and clustering of similar words. COLING/ACL (pp. 768–774). Montreal, Canada. Lin, D., & Pantel, P. (2001). Dirt: Discovery of inference rules from text. ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 323–328). San Francisco, California. Lin, D., & Pantel, P. (2002). Concept discovery from text. Computational Linguistics (COLING) (pp. 577–583). Taipei, Taiwan. Lin, D., Zhao, S., Qin, L., & Zhou, M. (2003). Identifying synonyms among distributionally similar words. In Proceedings of IJCAI (pp. 1492–1493). Acapulco, Mexico.

174

Manning, C. D., & Sch¨utze, H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts: The MIT Press. McQueen, J. (1967). Some methods for classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematics, Statistics and Probability (pp. 281–298). Mel’cuk, Igor (1996). Lexical functions: A tool for description of lexical relations in a lexicon. In L. Wanner (Ed.), Lexical functions in lexicography and natural language processing. John Benjamin Publishing Company. Mel’cuk, Igor (to appear). Semantics: From meaning to text, chapter Deep-Syntactic Paraphrasing. Moldovan, D., Clark, C., Harabagiu, S., & Maiorano, S. (2003). Cogex: a logic prover for question answering. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (pp. 87–93). Edmonton, Canada. Pas¸ca, M., Lin, D., Bigham, J., Lifchits, A., & Jain, A. (2006). Organizing and searching the world wide web of facts - step one: The one-million fact extraction challenge. Proceedings of the National Conference on Artificial Intelligence. Boston, Massachusetts, USA. Pang, B., Knight, K., & Marcu, D. (2003). Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In Proceedings of HLT/NAACL. Pantel, P., Bhagat, R., Coppola, B., Chklovski, T., & Hovy, E.H. (2007). Isp: Learning inferential selectional preferences. Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Rochester, NY, USA. Pantel, Patrick, & Lin, Dekang (2002). Discovering word senses from text. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 613–619). Edmonton, Canada. Pantel, P., Ravichandran, D., & Hovy, E.H. (2004). Towards terascale knowledge acquisition. Proceedings of the conference on Computational Linguistics (COLING) (pp. 771–778). Geneva, Switzerland. Patwardhan, S., & Riloff, E. (2007). Effective Information Extraction with Semantic Affinity Patterns and Relevant Regions. Proceedings of EMNLP (pp. 717–727). Prague, Czech Republic.

175

Quirk, Chris, Brockett, Chris, & Dolan, William (2004). Monolingual machine translation for paraphrase generation. Proceedings of EMNLP 2004 (pp. 142–149). Barcelona, Spain: Association for Computational Linguistics. Ravichandran, D., & Hovy, E.H. (2002). Learning surface text for a question answering system. Association for Computational Linguistics (ACL). Philadelphia, PA. Ravichandran, D., Pantel, P., & Hovy, E.H. (2005). Randomized algorithms and nlp: using locality sensitive hash function for high speed noun clustering. In Proceedings of Association for Computational Linguistics (pp. 622–629). Ann Arbor, Michigan. Resnik, Philip (1996). Selectional constraints: an information-theoretic model and its computational realization. Cognition, 61, 127–159. Riloff, E. (1993). Automatically Constructing a Dictionary for Information Extraction Tasks. Proceedings of the Eleventh National Conference on Artificial Intelligence (pp. 811–816). Washington, DC. Riloff, E. (1996). Automatically Generating Extraction Patterns from Untagged Text. Proceedings of the Thirteenth National Conference on Articial Intelligence (pp. 1044–1049). Portland, OR. Riloff, E., & Phillips, W. (2004). An Introduction to the Sundance and AutoSlog Systems (Technical Report UUCS-04-015). School of Computing, University of Utah. Romano, L., Kouylekov, M., Szpektor, I., Dagan, I., & Lavelli, A. (2006). Investigating a generic paraphrase-based approach for relation extraction. In Proceedings of the European Chapter of the Association for Computational Linguistics (EACL). Rosch, E. (1978). Human categorization. Cognition and Categorization. Schulte im Walde, S., & Brew, C. (2002). Inducing german verb semantic classes from purely syntactic subcategorization information. Association for Computational Linguistics (ACL). Philadelphia, PA. Sekine, S. (2006). On-demand information extraction. In Proceedings of COLING/ACL (pp. 731–738). Sydney, Australia. Shannon, C. E. (1948). A mathematical theory of communication. 27, 379–423 and 623–656. Shinyama, Y., & Sekine, S. (2006). Preemptive Information Extraction using Unrestricted Relation Discovery. Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (pp. 304–311). New York, NY.

176

Shinyama, Y., Sekine, S., & Sudo, K. (2002). Automatic paraphrase acquisition from news articles. Proceedings of Human Language Technology Conference (pp. 40–46). Siegal, S., & Castellan Jr., N.J. (1988). Nonparametric statistics for the behavioral sciences. McGraw-Hill. Snow, Rion, Jurafsky, Daniel, & Ng, Andrew Y. (2006). Semantic taxonomy induction from heterogenous evidence. Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics (pp. 801–808). Morristown, NJ, USA: Association for Computational Linguistics. Sundheim, B. (1992). Overview of the Fourth Message Understanding Evaluation and Conference. Proceedings of the Fourth Message Understanding Conference (MUC4) (pp. 3–21). McLean, VA. Szpektor, I., & Dagan, I. (2008). Learning entailment rules for unary templates. Proceedings of the International Conference on Computational Linguistics (COLING) (pp. 849–856). Manchester, UK. Szpektor, I., Tanev, H., Dagan, I., & Coppola, B. (2004). Scaling web-based acquisition of entailment relations. In Proceedings of Empirical Methods in Natural Language Processing (pp. 41–48). Barcellona, Spain. Torisawa, K. (2006). Acquiring inference rules with temporal constraints by using japanese coordinated sentences and noun-verb co-occurrences. In Proceedings of Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 57–64). New York, New York. Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained k-means clustering with background knowledge. 18th International Conference on Machine Learning (pp. 577–584). Wilks, Y. (1975). Preference semantics. Cambridge, Massachusetts: Cambridge University Press. Wilks, Y.;, & Fass, D. (1992). Preference semantics: a family history. Computing and Mathematics with Applications, 23. Xing, E.P., Ng, A.Y., Jordan, M.I., & Russell, S. (2003). Distance metric learning, with application to clustering with side-information. 505512.

177

Zanzotto, F. M., Pennacchiotti, M., & Pazienza, M. T. (2006). Discovering asymmetric entailment relations between verbs using selectional preferences. In Proceedings of the International Conference on Computational Linguistics and ACL (pp. 849–856). Sydney, Australia. Zhou, L., Lin, C.Y., Munteanu, D., & Hovy, E.H. (2006). Paraeval: using paraphrases to evaluate summaries automatically. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 447–454).

178

Appendix

Example Paraphrases

1 Introduction In this appendix, we present a random sample of 100 phrases with their paraphrases from the paraphrase resource of 2.5 million phrases created in Chapter 7. As described in Chapter 7, we used the 150GB (25 billion words) Google News corpus to create this resource. A cosine similarity cutoff of 0.15 was set for the paraphrases. In the examples below, cosine similarity of each paraphrase with the original phrase is provided and at most 10 paraphrases are listed for each phrase. See Chapter 7 for details on how this resource was learned.

179

2 List of Paraphrases X took on new Y

X took on added Y

0.31

X have taken on new Y

0.28

X has taken on renewed Y

0.26

X take on new Y

0.26

X took on greater Y

0.25

X has taken on new Y

0.25

X has been given added Y

0.25

X takes on added Y

0.23

X took on even more Y

0.23

X take on added Y

0.23

180

X , will cost Y

X working with a Y

X - would cost Y

0.22

X it costs Y

0.21

X , which will cost Y

0.20

X , would cost Y

0.20

X expected to cost Y

0.19

X will probably cost Y

0.19

X and will cost Y

0.19

X say it would cost Y

0.18

X , which would cost Y

0.18

X are expected to cost Y

0.18

X of working with a Y

0.20

X , hiring a Y

0.18

X by working as a Y

0.17

X working for a Y

0.17

X to work with a Y

0.16

X fired up a Y

0.15

181

X and build more Y

X , building more Y

0.34

X , build more Y

0.34

X and building more Y

0.32

X , build new Y

0.30

X and building new Y

0.30

X is to build more Y

0.29

X by building more Y

0.29

X to build more Y

0.29

X of building more Y

0.28

X for building new Y

0.28

182

X expiring Y

X - established Y

X expiring on Y

0.62

X that will expire on Y

0.51

X that expires on Y

0.50

X of fiscal nnnn , ended Y

0.48

X that expires Y

0.44

X that expired Y

0.43

Y nn were up n.n X

0.43

X which expired on Y

0.42

X , in the quarter ended Y

0.42

Y nn increased nn.n X

0.42

X - establish their Y

0.20

X of strengthening its Y

0.16

X - establishing its Y

0.15

183

X priced at Y

X involved with this Y

X , priced at Y

0.30

X is priced at Y

0.26

X priced from Y

0.25

X priced between Y

0.25

X will be priced at Y

0.25

X at prices ranging from Y

0.24

X was priced at Y

0.24

X costing Y

0.23

X with prices ranging from Y

0.22

X priced at rs n.nn Y

0.21

X involved in this Y

0.21

X involved in that Y

0.18

Y involves a number of X

0.15

184

X , driving a Y

Y driven by nn - X

0.36

X driving a Y

0.36

Y driven by a nn - X

0.35

Y parked in the nnnn X

0.34

Y parked in the nnn X

0.34

X say the driver of a Y

0.33

Y parked on the nnn X

0.33

X fled the scene in a Y

0.30

Y , driven by nn - X

0.30

X were driving a Y

0.30

185

X of letting Y

X of allowing Y

0.25

X of spending more Y

0.23

X of restricting Y

0.22

X of pursuing the Y

0.21

X of letting the Y

0.21

X of raising the minimum Y

0.21

X of giving a Y

0.21

X of letting a Y

0.21

X of holding a Y

0.20

X of spending public Y

0.20

186

X to protect their Y

X popped Y

X to protect our Y

0.29

X to protect their own Y

0.29

X to protect your Y

0.28

X to shield their Y

0.27

X to protect its Y

0.27

X protect their Y

0.27

X protecting their Y

0.26

X to safeguard their Y

0.25

X in protecting their Y

0.23

X to protect the Y

0.23

X was popping Y

0.16

X is popping Y

0.16

X by popping Y

0.16

Y popped and the X

0.15

187

X will close on Y

X and helped his Y

X will close on monday , Y

0.27

X will close on friday , Y

0.27

X closes on Y

0.27

X would close on Y

0.26

X is expected to close on Y

0.26

X , which closes on Y

0.26

X to close on or about Y

0.25

X closed through Y

0.25

X would open on Y

0.24

X will close on saturday , Y

0.24

X he joined his Y

0.18

X he helped his Y

0.16

X to open the fifth Y

0.15

X accused of helping his Y

0.15

X to lead the men ’s Y

0.15

188

X of negotiating with Y

X of not negotiating with Y

0.24

X of negotiating with the Y

0.22

X of spending more Y

0.22

X against negotiating with Y

0.22

X not to negotiate with Y

0.22

X to negotiate with Y

0.21

X ’s refusal to negotiate with Y

0.20

X of isolating Y

0.20

X of considering the Y

0.19

X of disarming Y

0.19

189

X pitting Y

X pitting the Y

0.22

X that pitted Y

0.18

X broke out between Y

0.18

X that has pitted Y

0.17

Y - takes - all X

0.17

Y to prevail in this X

0.16

X , pitting Y

0.16

Y - take - all X

0.16

X broke out between the Y

0.16

X erupted between the Y

0.16

190

X adjusted its Y

X cut its nnnn and nnnn Y

0.23

X on thursday lowered its Y

0.22

X on tuesday raised its Y

0.22

X has downgraded its Y

0.22

X on friday raised its Y

0.21

X lowering its Y

0.21

X also lowered its Y

0.21

X also downgraded its Y

0.21

X on wednesday raised its Y

0.21

X to adjust its Y

0.21

191

X who send their Y

X from sending their Y

0.31

X who want to send their Y

0.28

X who sent their Y

0.22

X teach their Y

0.21

X are worried their Y

0.20

X to enrol their Y

0.19

X do not want their Y

0.19

X who bring their Y

0.18

X still want their Y

0.18

X who pull their Y

0.18

192

X gives me Y

X that gave me Y

0.27

X , it gave me Y

0.26

X of gives me Y

0.26

X give me Y

0.26

X of gives you Y

0.25

X that gives me Y

0.25

X it gave me Y

0.24

X will give you Y

0.23

X and it gave me Y

0.22

X , it gives you Y

0.22

193

X who get into Y

X are not getting into Y

0.25

X who have gotten in Y

0.25

X get into Y

0.25

X were getting into Y

0.25

X or got into Y

0.23

X get into financial Y

0.21

X involved in physical Y

0.21

X after getting into Y

0.20

X have gotten into Y

0.20

X who try Y

0.20

194

X and increased Y

X is bound by Y

X will result in higher Y

0.19

X , and increased Y

0.18

X will be increased Y

0.16

X , maximising Y

0.16

Y coupled with higher X

0.16

X while improving Y

0.16

X while maximizing the Y

0.16

X due to lower Y

0.15

X and reducing Y

0.15

X , increased Y

0.15

X to abide by Y

0.18

X has not violated Y

0.16

X may have violated Y

0.16

X bound by Y

0.15

X has promulgated Y

0.15

195

X is opposing the Y

X were affected by Y

X strongly opposed the Y

0.18

X were opposing the Y

0.17

X will oppose the Y

0.15

X authorized a Y

0.15

X , which is supporting the Y

0.15

X affected by Y

0.30

X have been affected by Y

0.29

X hardest hit by Y

0.27

X hard hit by Y

0.27

X that were affected by Y

0.24

X hit hardest by Y

0.22

X had been affected by Y

0.21

X devastated by Y

0.21

X who were affected by Y

0.21

X already hit by Y

0.21

196

X begin at Y

X beginning at Y

0.29

X , which begin at Y

0.28

X that begin at Y

0.28

X start at Y

0.27

X start at n Y

0.26

X begin at nn Y

0.24

X begin at n am and Y

0.24

X start at nn Y

0.23

X begin at nn:nn Y

0.23

X start at nn:nn Y

0.22

197

X to stop the Y

X to halt the Y

0.36

X to prevent the Y

0.31

X to try to stop the Y

0.27

X and stop the Y

0.25

X in order to stop the Y

0.25

X to stop this Y

0.23

X to stem the Y

0.23

X to help stop the Y

0.22

X of stopping the Y

0.22

X to stop further Y

0.22

198

X , updating Y

X , update Y

0.25

X and updated Y

0.23

X and updating Y

0.20

X and update Y

0.20

X , upgrading Y

0.19

X to update their Y

0.18

X with updated Y

0.17

X or update Y

0.17

X , update their Y

0.17

X to update Y

0.17

199

X announced by Y

X , announced by Y

0.20

X unveiled by Y

0.20

Y announced a X

0.19

X , which was announced by Y

0.19

X recently announced by Y

0.18

X was announced by Y

0.17

X announced last week by Y

0.17

X announced yesterday by Y

0.16

Y has ruled out any X

0.16

Y has defended the government ’s X

0.16

200

X took over at Y

X was named coach at Y

0.20

X has done at Y

0.20

X took charge at Y

0.19

X took the job at Y

0.19

X to take over at Y

0.18

X taking over at Y

0.17

X was in charge at Y

0.17

X that will keep him at Y

0.17

Y , then managed by X

0.17

X , whose side entertain Y

0.16

201

X reported fourth Y

X offers low Y

X reported first Y

0.38

X reported fourth quarter nnnn Y

0.37

X reported second Y

0.36

X reported fourth quarter net Y

0.36

X reported third Y

0.32

X reported fourth quarter nnnn net Y

0.29

X reported better than expected first Y

0.26

X posted second - Y

0.25

X said fourth - Y

0.24

X reported record third Y

0.23

X by offering low Y

0.17

X offer low Y

0.17

X to offer lower Y

0.17

X will offer low Y

0.16

X and provide low Y

0.16

X enjoy low Y

0.15

X is offering special introductory Y

0.15

X offer lower Y

0.15

X can charge higher Y

0.15

202

X clinched Y

X changed on Y

X clinched victory for the Y

0.18

Y came back through X

0.16

X clawed their Y

0.15

X changed dramatically on Y

0.32

X and will last until Y

0.28

X , which was signed Y

0.28

X as scheduled on Y

0.26

Y nn , when he lost X

0.25

X he received on Y

0.24

X did not end on Y

0.23

X , which was submitted on Y

0.23

X and lasts until Y

0.23

X ratified on Y

0.23

203

X is carried by Y

X , which is carried by Y

0.33

X , which is transmitted by Y

0.32

X is transmitted to humans by Y

0.32

X is spread by Y

0.31

X , which is spread by Y

0.31

X is transmitted by infected Y

0.30

X is a mosquito - borne Y

0.30

Y , which can spread the X

0.29

Y carrying the deadly X

0.29

Y , which transmit the X

0.28

204

X can close the Y

X can close the gap on Y

0.21

Y clear of second - placed X

0.20

X can not close the Y

0.20

X could close the Y

0.19

Y ’s champions league win over X

0.17

X are second on nn Y

0.17

Y as united beat X

0.17

X not to lose this Y

0.17

X have closed the Y

0.17

Y of clubs including X

0.17

205

X starts in Y

X come from the Y

X , which starts in Y

0.32

X which begins in Y

0.29

X which starts in Y

0.28

X begins in Y

0.27

X , which begins in Y

0.26

X gets underway in Y

0.22

X , which concludes in Y

0.19

X , will be played in Y

0.19

X starting in Y

0.18

X against australia starting in Y

0.18

X are from the Y

0.18

Y , add and X

0.16

X - producing areas of the Y

0.16

X that come from the Y

0.16

X that come out of the Y

0.15

X are coming from the Y

0.15

206

X varied by Y

X vary significantly by Y

0.24

X did not differ by Y

0.23

X differed by Y

0.21

X were broken down by Y

0.20

X differ by Y

0.18

X may vary by Y

0.18

X differs by Y

0.18

X , broken down by Y

0.16

X include age , Y

0.16

X are broken down by Y

0.16

207

X , are good Y

X and are good Y

0.22

X that were good Y

0.19

X who were good Y

0.18

X that are good Y

0.18

X touched the wall in a Y

0.16

X are much better Y

0.16

X , and are great Y

0.15

X were good Y

0.15

X , they ’re good Y

0.15

X who are good Y

0.15

208

X making nn Y

X by making nn Y

0.21

X making nine Y

0.20

X , making three Y

0.17

X , making seven Y

0.16

X , making five Y

0.16

X had made nnn Y

0.15

X and made nn Y

0.15

X , making four Y

0.15

X that made nn Y

0.15

X and made just nn Y

0.15

209

X have urged Y

X are urging Y

0.23

X to urge Y

0.22

X have called on Y

0.21

X are also urging Y

0.21

X have warned Y

0.18

X are calling on Y

0.18

X have also urged Y

0.18

X have appealed to Y

0.17

X say other Y

0.17

X have blasted the Y

0.17

210

X , a decorated Y

X , himself a decorated Y

0.34

X , nn , a decorated Y

0.32

X is a decorated Y

0.28

X - a decorated Y

0.27

X , a highly decorated Y

0.25

X as a decorated Y

0.24

X was a decorated Y

0.24

X , was a decorated Y

0.23

X - old decorated Y

0.22

X , a decorated veteran of Y

0.21

211

X can detect Y

X to detect Y

0.38

X that can detect Y

0.35

X used to detect Y

0.33

X , which can detect Y

0.33

X for detecting Y

0.32

X detect Y

0.29

X detects Y

0.29

X that detects Y

0.28

X that detect Y

0.27

X to detect the Y

0.27

212

X dedicated to providing Y

X focused on providing Y

0.24

X that provides a range of Y

0.24

X devoted to providing Y

0.22

X committed to providing Y

0.20

X aimed at providing Y

0.19

X that offers Y

0.19

X that provides human Y

0.19

X that provides technology and Y

0.19

X that promotes sustainable Y

0.19

X that provides free Y

0.18

213

X for handling the Y

X on handling the Y

0.23

X for dealing with the Y

0.22

X for responding to Y

0.18

X for managing the Y

0.18

X to try to alleviate the Y

0.18

X for coping with the Y

0.17

X on handling Y

0.16

X for tackling the Y

0.16

X are prepared to handle the Y

0.16

X for securing Y

0.16

214

X , evading Y

X and felony evading Y

0.42

X and evading Y

0.37

X and fleeing and eluding Y

0.37

X and driving under Y

0.34

X , and resisting Y

0.34

X and eluding Y

0.34

X , attempting to elude Y

0.33

X and possession of a controlled Y

0.33

X , tampering with Y

0.33

X , evading arrest and Y

0.33

215

X of suspending Y

X has offered its Y

X of resuming Y

0.26

X would suspend Y

0.24

X of halting Y

0.22

X of restarting Y

0.20

X of establishing a separate Y

0.20

X nn to suspend Y

0.20

X of restricting Y

0.19

X including suspension of Y

0.19

X for halting Y

0.19

X of using eminent Y

0.19

X wishing to offer their Y

0.16

X and offered my Y

0.15

216

X attached to the Y

X , matching Y

X attached to Y

0.24

X affixed to the Y

0.22

X mounted on the Y

0.21

X are attached to the Y

0.21

Y attached to a X

0.20

X is attached to the Y

0.19

X was attached to the Y

0.19

X attached to a Y

0.19

X attached to its Y

0.19

X attached to his Y

0.19

X and matching Y

0.27

X with matching Y

0.21

X nn , matching Y

0.17

X and a ruffled Y

0.17

X that topped Y

0.15

X stained with Y

0.15

Y and matching X

0.15

217

X can get Y

X also can get Y

0.25

X can also get Y

0.23

X who are getting Y

0.20

X who get Y

0.20

X get Y

0.20

Y are provided to X

0.20

X can still get Y

0.20

X who obtain Y

0.19

X can obtain Y

0.19

X who have gotten Y

0.19

218

X have blamed the Y

X - ringed Y

X blame the Y

0.26

X have attributed the Y

0.24

X are warning that the Y

0.23

X have warned that the Y

0.22

X also blame the Y

0.20

X fear an Y

0.19

X warn the Y

0.19

X blame on the Y

0.19

X have blamed for the Y

0.19

X fear the Y

0.18

X - rimmed Y

0.19

X - fringed Y

0.18

Y ringed with X

0.18

X - circled Y

0.17

X - colored plastic Y

0.17

219

X and interconnect Y

X , interconnect Y

0.24

Y such as embedded X

0.22

X , interconnect and Y

0.20

X , interconnects , Y

0.19

Y enabling seamless X

0.18

Y interconnects , X

0.18

X interconnects , Y

0.17

X , printed circuit boards and Y

0.17

Y , such as embedded X

0.17

X and compute Y

0.17

220

X and require Y

X and requiring Y

0.26

X , and require Y

0.23

X by requiring Y

0.22

X , requiring Y

0.22

X , require Y

0.22

X to require Y

0.21

X , would require Y

0.20

X required Y

0.19

X would require Y

0.19

X that would require Y

0.19

221

X , as acting Y

X , will serve as acting Y

0.21

X , was named acting Y

0.20

X to serve as acting Y

0.20

X , to serve as interim Y

0.19

X , will become interim Y

0.19

X , was named interim Y

0.19

X was named acting Y

0.18

X , who became acting Y

0.18

X became acting Y

0.18

X would serve as acting Y

0.18

222

X are due Y

X are due by Y

0.51

X are due no later than Y

0.47

X were due by Y

0.42

X must be in by Y

0.42

X are due by friday , Y

0.42

X were due Y

0.41

X are due by n Y

0.41

X , which are due by Y

0.41

X are due by monday , Y

0.40

X , which are due Y

0.40

223

X said it sees Y

X said it expected Y

0.25

X said it saw Y

0.22

X still expects Y

0.22

X is expected to report Y

0.21

X reduced its nnnn Y

0.21

X said yesterday that fourth - Y

0.21

X raised its fourth - Y

0.21

X predicts nnnn Y

0.21

X also lowered its Y

0.21

X said it still expects Y

0.20

224

X are always the Y

X were definitely the Y

0.21

X were clearly the Y

0.18

X have always been the Y

0.17

X , are now the Y

0.17

X who are really the Y

0.17

X , but we were the Y

0.16

X must have been the Y

0.16

X have really been the Y

0.16

X they have been the Y

0.16

X we ’ve been the Y

0.15

225

X sold nnn Y

X , and operates Y

X sold n,nnn Y

0.32

X each sold nnn Y

0.24

X , which bought n,nnn Y

0.22

X each bought nnn Y

0.19

X , which sold n,nnn Y

0.18

X , sold nnn Y

0.17

X buying nnn Y

0.17

X reported sales of n,nnn Y

0.16

X is to sell nn,nnn Y

0.16

X expects to sell nn,nnn Y

0.16

X and operates Y

0.22

X ) and operates Y

0.21

X , and operates nn Y

0.18

Y located throughout X

0.17

X they have three Y

0.16

226

X and adjusting Y

X begins with the Y

X by adjusting Y

0.24

X , adjust Y

0.23

X to adjust Y

0.22

X and adjust Y

0.20

X , adjusting Y

0.20

X , and adjust Y

0.19

X or adjusting Y

0.19

X and adjusting the Y

0.18

X , adjust their Y

0.18

X are adjusting their Y

0.17

X , begins with the Y

0.18

X begins in nnnn with the Y

0.17

X starts with the Y

0.16

X begins today with the Y

0.16

X culminates with the Y

0.16

227

X to speak about Y

X find other Y

X to talk about Y

0.18

X to talk to students about Y

0.16

X to talk openly about Y

0.16

Y is a taboo X

0.16

X seek other Y

0.29

X to find other Y

0.28

X in finding new Y

0.27

X to find alternative Y

0.27

X have found other Y

0.27

X will find other Y

0.26

X to seek alternative Y

0.26

X had to find alternative Y

0.23

X will have to find other Y

0.23

X can find other Y

0.22

228

X betting Y

X to realign Y

X - betting Y

0.42

X betting at Y

0.26

Y betting X

0.25

X - based betting Y

0.25

X ’s betting Y

0.23

X , betting Y

0.23

X ’s sports betting Y

0.23

X betting and Y

0.23

X and betting Y

0.22

X betting and gambling Y

0.22

X for realigning Y

0.22

X for realigning the Y

0.20

X to re - align Y

0.18

X to reorganize the Y

0.16

Y would lose n,nnn X

0.16

X of realigning Y

0.16

X going westbound on Y

0.15

229

X of cleaning up Y

X for cleaning up Y

0.34

X to clean up Y

0.33

X on cleaning up Y

0.29

X of cleaning up the Y

0.27

X in cleaning up Y

0.27

X to cleaning up Y

0.26

X and cleaning up Y

0.26

X of dollars to clean up Y

0.26

X is to clean up Y

0.26

X cleaning up Y

0.26

230

X recovered at the Y

X recovered for the Y

0.23

X ( nn tackles , n.n Y

0.21

X fumble at the Y

0.20

X ( nn tackles , n Y

0.20

X lost nn-nn at Y

0.20

X recovered the ball at the Y

0.19

X recovered on the Y

0.19

X nn-nn win over the Y

0.18

X ( n-n ) play the Y

0.18

X recovered the fumble at the Y

0.18

231

X due to rising Y

X because of rising Y

0.44

X caused by rising Y

0.40

X as a result of rising Y

0.37

X caused by higher Y

0.35

X on rising Y

0.35

X as rising Y

0.33

X amid rising Y

0.33

X from rising Y

0.33

X resulting from rising Y

0.32

X , as rising Y

0.32

232

X gained national Y

X , who gained national Y

0.21

X gained national attention last Y

0.19

X gained international Y

0.19

X has gained national Y

0.18

X gained national attention in Y

0.18

X , which gained national Y

0.18

X and gained international Y

0.18

X gaining national Y

0.18

X gained global Y

0.17

X and has gained international Y

0.17

233

X opened and Y

X open a little Y

0.18

Y of opened the X

0.18

X did not open in Y

0.18

Y it will open the X

0.18

X opened when the Y

0.18

X opened and the Y

0.18

X opened , and Y

0.18

Y and it opened the X

0.17

X would open for the Y

0.16

X opened at n Y

0.16

234

X reckon Y

X reckon the Y

0.21

X wonder if the Y

0.18

X are predicting Y

0.17

X have been predicting for Y

0.16

X were suggesting that the Y

0.16

X predict that Y

0.16

X are predicting that Y

0.16

X believe the latest Y

0.15

X believe that the bank of Y

0.15

X have predicted that this Y

0.15

235

X are finding Y

X are still finding Y

0.26

X are catching Y

0.23

X are finding a few Y

0.21

X are doing well for Y

0.21

Y are being caught by X

0.20

X have been catching Y

0.20

X report good Y

0.19

X look for Y

0.19

X are finding plenty of Y

0.19

X invest in their Y

0.19

236

X and synchronized Y

X , called an Y

X , synchronized Y

0.35

X , and synchronized Y

0.31

Y , tumbling , X

0.26

Y and synchronized X

0.25

Y , synchronized swimming and X

0.24

Y , synchronized swimming , X

0.24

X , synchronized swimming and Y

0.22

X , tumbling , Y

0.21

X and the synchronized Y

0.21

X synchronized Y

0.20

X known as an Y

0.30

X called an Y

0.30

X , known as an Y

0.27

X is called an Y

0.19

X - called an Y

0.17

X called an Y

0.17

X , called an Y

0.16

237

X now allow Y

X that ’s got Y

X already allow Y

0.18

X that are available to Y

0.18

X make it easier for Y

0.18

X restrict the use of Y

0.17

X do not allow Y

0.16

X also allow Y

0.16

Y have comprehensive X

0.16

Y show up on their X

0.15

X that prohibit a Y

0.15

Y are suspected , X

0.16

X on speaking in Y

0.15

X else to do Y

0.15

238

X registered on Y

X were registered on Y

0.25

X ( ended Y

0.19

X in the nn months ending Y

0.19

X rose by n,nnn in Y

0.18

X have registered since Y

0.18

X registered between Y

0.18

X was lodged on Y

0.18

Y nnnn was nn.n X

0.18

X sent to him on Y

0.17

X ending in mid - Y

0.17

239

X is to save Y

X was to save Y

0.34

X is to get as much Y

0.29

X is to serve the Y

0.23

X is to attract Y

0.23

X and to save Y

0.22

X was to pursue Y

0.22

X is to safeguard the Y

0.21

X is to draw more Y

0.21

X is to care for the Y

0.20

X is to reassure the Y

0.20

240

X charged by the Y

X and told Y

Y charging higher X

0.25

X charged by Y

0.24

Y will charge higher X

0.22

X charged by most Y

0.21

Y charges high X

0.20

X charged by their Y

0.20

Y paying high X

0.20

X to be charged by the Y

0.20

X are going up nn Y

0.20

Y charge high X

0.20

X , telling Y

0.16

X when he told Y

0.15

X - old told Y

0.15

X , she told Y

0.15

241

X unloading Y

X loading and unloading Y

0.32

X unload Y

0.27

X and unload Y

0.27

X to unload Y

0.26

X and unloading Y

0.26

Y are loaded onto X

0.26

X , unloading Y

0.25

X to load and unload Y

0.25

X hauling Y

0.24

X carrying n,nnn tons of Y

0.24

242

X before making Y

X has a diverse Y

X starting with the Y

X before you make Y

0.29

X before making any Y

0.26

X before they make Y

0.26

Y were made last X

0.25

X when they make Y

0.25

X prior to making Y

0.24

X and then making Y

0.24

X to make those Y

0.23

X before we make Y

0.23

X after making Y

0.23

X has a diverse portfolio of Y

0.22

Y had become X

0.16

X has a diverse range of Y

0.15

X beginning with the Y

0.17

X scheduled this Y

0.16

X but has never won the Y

0.15

X , graduated to Y

0.15

243

X made some Y

X assisted with Y

X , made some Y

0.19

X and is making Y

0.18

Y were made , the X

0.18

X made a number of Y

0.18

X did make some Y

0.17

X , made a few Y

0.17

X or make Y

0.17

X and made a lot of Y

0.17

X had made some Y

0.16

X have made some Y

0.16

X assisting with Y

0.20

X and assist with Y

0.17

X helped with Y

0.15

244

X that get Y

X from reading Y

X that receive Y

0.25

X that do not receive Y

0.24

X that currently receive Y

0.21

X that do not get Y

0.20

X that applied for Y

0.20

X that receive the Y

0.20

X that are receiving Y

0.19

X that received Y

0.19

X , which receive Y

0.18

X that need the Y

0.18

X at reading Y

0.20

X of reading Y

0.19

X with reading Y

0.19

X who do not read Y

0.18

X for reading Y

0.16

X , they read Y

0.15

245

X to simplify its Y

X had a combined Y

X to optimize its Y

0.22

X simplifies our Y

0.22

X by expanding our Y

0.21

X to optimise its Y

0.20

Y , inc. has selected the X

0.19

X is a much simpler Y

0.18

X to restructure its Y

0.18

X to align the Y

0.18

X to overhaul its Y

0.17

X to modernise its Y

0.17

X posted a combined Y

0.21

X , which had a combined Y

0.19

X had an aggregate Y

0.19

X will have a combined Y

0.18

X have a combined Y

0.17

X generated a combined Y

0.16

X , who finished nnth Y

0.15

X , which currently has a Y

0.15

X , had a combined Y

0.15

246

X to identify those Y

X fought off a Y

X , was given Y

X to better identify Y

0.17

X for identifying new Y

0.17

X to better understand the Y

0.16

X for identifying Y

0.16

X to work with those Y

0.15

X fought off a pair of Y

0.20

X fended off a Y

0.18

X had to save two Y

0.16

Y to defeat american X

0.15

X to fight off a Y

0.15

X fought off a series of Y

0.15

X , has been granted Y

0.18

X and was given Y

0.18

X after being given Y

0.16

X is given Y

0.15

247

X booked their Y

X booked their place in the Y

0.53

Y that started against X

0.28

Y when they take on X

0.27

X were the better Y

0.27

X ran out n-n Y

0.26

X have completed the Y

0.25

X have been linked with a Y

0.25

Y - final win over X

0.25

X took the lead on nn Y

0.24

X are a very good Y

0.24

248

X - handled Y

X of long - handled Y

0.21

X had armed himself with a Y

0.20

X - handle Y

0.20

X wielding a Y

0.18

X stabbed with a Y

0.17

Y and stabbed the X

0.17

X swinging a Y

0.16

X found a bloody Y

0.16

X while brandishing a Y

0.16

X suspected of using a Y

0.16

249

X of slowing the Y

X of slowing down the Y

0.44

X of stimulating the Y

0.32

X of stopping the Y

0.31

X was to slow the Y

0.30

X of limiting the Y

0.29

X of controlling the Y

0.29

X in hopes of slowing the Y

0.28

X of significantly reducing the Y

0.27

X of accelerating the Y

0.27

X of raising the Y

0.25

250

X and shaved Y

X , shaved Y

0.49

X with shaved Y

0.39

X , grated Y

0.31

Y tossed with X

0.30

Y and served with X

0.30

X and grated Y

0.30

X tossed with Y

0.29

X , and shaved Y

0.29

X and chopped Y

0.29

Y - dried tomatoes , X

0.28

251

X funded with Y

X financed with Y

0.34

X that are funded with Y

0.33

X will be funded with Y

0.32

X were funded with Y

0.30

X funded through Y

0.30

X is funded with Y

0.29

Y used to fund X

0.28

X , funded with Y

0.27

Y are used to fund X

0.27

X are funded with Y

0.27

252