Extracting Implicit Knowledge from Text - CiteSeerX

27 downloads 187 Views 969KB Size Report
Benjamin returned to the University of Rochester in the Fall of 2004, pursuing research ...... Stephen D. Richardson, William B. Dolan, and Lucy Vanderwende.
Extracting Implicit Knowledge from Text by Benjamin D. Van Durme Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

Supervised by Professor Lenhart K. Schubert

Department of Computer Science Arts, Sciences and Engineering Edmund A. Hajim School of Engineering and Applied Sciences and Department of Linguistics Arts, Sciences and Engineering School of Arts and Sciences

University of Rochester Rochester, New York

2009

ii

To whom I could have been otherwise.

iii

Curriculum Vitae

Benjamin David Van Durme was born in Dansville, New York on November 13th, 1979. He began studies at the University of Rochester in 1997, graduating in 2001 with a Bachelor of Arts degree in the area of Cognitive Science and a Bachelor of Science degree in the area of Computer Science. From 2002 to 2004, he attended Carnegie Mellon University, and graduated with a Master of Science degree in Language Technologies. Benjamin returned to the University of Rochester in the Fall of 2004, pursuing research in the subjects of Computer Science and Linguistics, under the direction of Professor Lenhart K. Schubert. He received the Master of Science degree in Computer Science from the University of Rochester in 2006. During the Summer of 2006, as well as 2007, Benjamin performed research at Google Inc., under the direction of Marius Pa¸sca.

iv

Acknowledgments

To my committee as a whole, William Cohen, Gregory Carlson, Daniel Gildea, and Lenhart K. Schubert: thank you for your advice and guidance. Len Schubert, who is as broadly contemplative as he is exceedingly generous with his time, has over the last ten years fundamentally shaped my views on language and intelligence (synthetic or otherwise). Thank you Len. Dan Gildea provided critical feedback on the majority of my work as a graduate student, from which I benefited significantly. Marius Pa¸sca taught me focus, and how to frame interesting problems into a manageable size. T. Florian Jaeger provided a welcoming environment for interdisciplinary work, along with a useful perspective on the world of a young faculty member, and academia in general. Thank you Dan, Marius, and Florian. My parents, Michael and Mary Van Durme, and siblings, Jessica and Jordan, have supported me in anything and everything I’ve wanted. My father showed me that everything in life can be a puzzle, while my mother showed me that everything in life can be entertaining. Without both of these principles in mind I could not be a researcher today. Thank you family. Sara Eleoff has been with me since before the beginning, whose love has been a required element of my success. Thank you Sara. I am glad to count so many of my classmates as friends. Some have become collaborators, such as Ashwin Lall and Austin Frank, leading to interesting work in areas I wouldn’t otherwise have considered. Some friends in particular I relied upon heavily,

v

at one point or another, and owe a special thanks: Anna Kup´s´c, Paul Ogilvie, Kevyn Collins-Thompson, Craig Harman, Kirk Kelsey, and Matt Post. Thank you all. Chapter 3 is the result of extensive discussions with Len Schubert on a variety of semantic phenomena. Chapter 4 derives from joint work with Len Schubert (Van Durme and Schubert, 2008). Chapter 5 is the result of joint work with Ting Qian and Len Schubert (Van Durme et al., 2008). Material from Chapter 6 is based on collaboration with Marius Pa¸sca while at Google Inc. (Van Durme and Pa¸sca, 2008). Chapter 7 is based on joint work with Phillip Michalak and Len Schubert (Van Durme et al., 2009b). Chapter 8 is based on work performed with Dan Gildea (Van Durme and Gildea, 2009). This material is based upon research supported by National Science Foundations awards, IIS-0328849 entitled “Deriving General World Knowledge from Texts by Abstraction of Logical Forms”, IIS-0535105 entitled “Knowledge Representation and Reasoning Mechanisms for Explicitly Self-Aware Communicative Agents” and CCF– 0910415 entitled “RI: Small: General Knowledge Bootstrapping from Text”, in addition to a University of Rochester Provost’s Multidisciplinary Award (2008), entitled “Computational Psycholinguistics: Integrating Computational and Behavioral Methods to Study Human Language Processing”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of above named organizations.

vi

Abstract

The everyday intelligence of both humans and machines relies on a large store of background, or common-sense, knowledge. That such a knowledge base is not yet available to machines helps partially explain the community’s inability to provide society with the sort of synthetic intelligence described by futurists such as Turing, or Asimov. In response, there have emerged a variety of methods for automated Knowledge Acquisition (KA) that are now being actively explored. Here I consider the extraction of knowledge that is conveyed implicitly, both within everyday texts and queries posed to internet search engines. Through recognizing certain forms of existential predicative patterns, and abstracting from these to more strongly quantifiable statements, I show that a significant amount of general knowledge can be gleaned based on how we talk about the world. I provide experimental results both for the direct extraction and strengthening of such knowledge, and for the automatic acquisition of supporting resources for this task. In addition, I draw attention to the relationship between automatically acquired background knowledge and natural language generic sentences. Humans use generics when they wish to directly assert the same sorts of “rules of the world” that are of concern to the KA community. And yet, there has been little recognition in applied circles that decades of work from formal linguistic semantics may have a role to play in the representation, and perhaps even the acquisition, of common knowledge.

vii

Table of Contents

Curriculum Vitae

iii

Acknowledgments

iv

Abstract

vi

List of Tables

xi

List of Figures 1 Introduction

xiii 1

1.1

Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2

Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2 Background

8

2.1

Examples of Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.2

Ways to Gather Knowledge . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.3

Knowledge Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.4

Feasibility of Text-based Approach . . . . . . . . . . . . . . . . . . . . .

15

2.5

Knowledge from Documents . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.6

Knowledge from Queries . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

2.7

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

viii

3 Framework and Challenges

31

3.1

Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.2

Representation Language . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.3

Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

3.4

Level of Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.5

The Interpretation Problem . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.6

Causal Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

3.7

Reporting Bias and Quantifier Strength . . . . . . . . . . . . . . . . . .

51

3.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

4 Comparison of Approaches

54

4.1

TextRunner

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.2

Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.3

Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.4

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

4.5

Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60

4.6

Extracting from Core Sentences . . . . . . . . . . . . . . . . . . . . . . .

64

4.7

Class Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

5 Using Classes

69

5.1

Attribute Extraction via Knext . . . . . . . . . . . . . . . . . . . . . .

70

5.2

Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

5.3

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

5.4

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

5.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

ix

6 Learning Classes

84

6.1

Extraction Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

6.2

Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

6.3

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

6.4

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

6.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

7 Using Ontologies

100

7.1

Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.2

Deriving Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.3

Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.4

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.5

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.6

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

8 Learning Soft Classes

118

8.1

Generalizing Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8.2

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8.3

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8.4

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

9 Conclusion

130

9.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

9.2

Looking Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

9.3

Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Bibliography

135

x

A Generics

149

A.1 Individual and Stage-level Predication . . . . . . . . . . . . . . . . . . . 151 A.2 Formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 A.3 Truth Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

xi

List of Tables

2.1

Classes used, their respective number of instances, and examples. . . . .

23

2.2

Precision at various ranks n. . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.3

Top 10 attributes per class. . . . . . . . . . . . . . . . . . . . . . . . . .

26

2.4

Comparison with attributes from TREC questions. . . . . . . . . . . . .

27

2.5

Comparison with attributes from the CIA Factbook. . . . . . . . . . . .

28

2.6

Comparison with attributes from user survey. . . . . . . . . . . . . . . .

28

3.1

Car, Motorcycle, and Airplane crash statistics . . . . . . . . . . . . . . .

51

3.2

Various Event Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.1

Top 10 Domains in TextRunner Corpus . . . . . . . . . . . . . . . . .

56

4.2

Verbalized propositions concerning the class Person. . . . . . . . . . .

58

4.3

Results using original Knext metric . . . . . . . . . . . . . . . . . . . .

59

4.4

Average judgements for natural and core sentences . . . . . . . . . . . .

65

4.5

Top 10 attributes compared to Pa¸sca and Van Durme (2007) . . . . . .

66

4.6

Average quality of discovered properties . . . . . . . . . . . . . . . . . .

67

5.1

Cut-off decision given the p/a ratio of an adjective . . . . . . . . . . . .

72

5.2

Extraction volume with and without using gazeteers . . . . . . . . . . .

75

5.3

Average acceptability for top 10 attributes . . . . . . . . . . . . . . . . .

77

xii

5.4

Impact of filtering on volume . . . . . . . . . . . . . . . . . . . . . . . .

78

5.5

Average quality for unary attributes . . . . . . . . . . . . . . . . . . . .

78

5.6

Top 10 unary attributes for select classes . . . . . . . . . . . . . . . . . .

79

5.7

Example high-scoring phrases as ranked by Lin’s metric . . . . . . . . .

81

5.8

Top attributes extracted for the class Car . . . . . . . . . . . . . . . . .

83

6.1

For J = 0.01, K = 30, number of classes whose size ≤ N . . . . . . . . .

88

6.2

Size of sets for different J and K . . . . . . . . . . . . . . . . . . . . . .

89

6.3

Assessed quality of underlying instances. . . . . . . . . . . . . . . . . . .

90

6.4

Examples from instance assessment . . . . . . . . . . . . . . . . . . . . .

90

6.5

Average quality of pairs . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

6.6

Interesting or questionable pairs. . . . . . . . . . . . . . . . . . . . . . .

91

6.7

Refinements for the classes writers and weapons . . . . . . . . . . . . . .

94

6.8

Number of classes with prenominal adjective

. . . . . . . . . . . . . . .

94

6.9

Examples of bad labels taken from small classes. . . . . . . . . . . . . .

95

6.10 Comparison to coverage and precision results from the literature . . . .

99

7.1

Overly general WordNet senses . . . . . . . . . . . . . . . . . . . . . . . 107

7.2

Development templates . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.3

Templates chosen for evaluation . . . . . . . . . . . . . . . . . . . . . . . 110

7.4

Example results from WordNet based conditionals . . . . . . . . . . . . 111

7.5

Prec., Recall and F-score for coarse-grained WSD . . . . . . . . . . . . . 112

7.6

Average quality for derived and alternative synsets . . . . . . . . . . . . 115

8.1

The 3 most probable arguments from topics 0, 1 and 6. . . . . . . . . . 124

8.2

The 3 most probable templates from topics 27, 62, and 108. . . . . . . . 125

8.3

Examples of propositional templates and arguments. . . . . . . . . . . . 125

8.4

Results for drawing 100 arguments from various models. . . . . . . . . . 126

xiii

List of Figures

2.1

Example from Epilog’s logical-form generator . . . . . . . . . . . . . .

19

2.2

Example output from Knext . . . . . . . . . . . . . . . . . . . . . . . .

20

2.3

Precision of class attributes as a function of rank. . . . . . . . . . . . . .

24

4.1

Instructions for categorical judging. . . . . . . . . . . . . . . . . . . . . .

59

4.2

Instructions for scaled judging. . . . . . . . . . . . . . . . . . . . . . . .

60

4.3

Assessments for natural and core sentences . . . . . . . . . . . . . . . .

64

4.4

Quality of attributes as function of diversity. . . . . . . . . . . . . . . .

67

5.1

Instructions for scaled judging. . . . . . . . . . . . . . . . . . . . . . . .

74

5.2

Comparisons to attributes from recent work . . . . . . . . . . . . . . . .

76

6.1

Algorithm for extracting hinstance, class labeli pairs. . . . . . . . . . . .

86

6.2

Number of classes extracted.

88

7.1

Algorithm for deriving slot type restrictions . . . . . . . . . . . . . . . . 104

7.2

Example of a context provided for evaluation . . . . . . . . . . . . . . . 113

7.3

Instructions for evaluating Knext propositions. . . . . . . . . . . . . . . 114

8.1

LDA Model in plate notation. . . . . . . . . . . . . . . . . . . . . . . . . 121

8.2

Gibbs sampling procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8.3

Cross entropy of topic models with 3, 10 and 200 topics. . . . . . . . . . 124

. . . . . . . . . . . . . . . . . . . . . . . .

xiv

8.4

Histogram of evaluation results. . . . . . . . . . . . . . . . . . . . . . . . 126

8.5

Pruning all but k-best topics. . . . . . . . . . . . . . . . . . . . . . . . . 128

1

1

Introduction

In this dissertation I consider the problem of using a machine to automatically acquire common-sense knowledge from textual resources. This knowledge takes the form of general tendencies or characterizing properties, gathered both from fully formed natural language sentences, and individual queries posed to an internet search engine. In particular, I focus on acquiring knowledge that is implicit in the data; this dissertation is not concerned with, for example, the direct interpretation of entries from encyclopedias and dictionaries. For example, in the following sentence:1 (1)

John walked up to the house, and knocked on the door.

we have no trouble interpreting the final nominal as the door (of the house which John walked up to), because we all know that houses have doors and they tend to be things that a person may knock on, especially after having walked up to the house that it is part of. These facts are commonly known, and are thus referred to as common (sense) knowledge. Notice there is nothing in the syntax of (1) that would directly lead us to the proper interpretation, where the second definite nominal is physically-part-of the first definite nominal. This is clear even before considering a syntactically similar sentence, but carrying different implicit assumptions: 1

Due to Len Schubert, p.c.

2

(2)

John sat at the table, and ordered from the menu.

In sentences (3-5) we can imagine having access to miscellaneous related facts, such as those respectively given in sentences (6-8). (3)

My dog ate his dinner quickly.

(4)

The capital is full of traffic today.

(5)

John can afford his medication.

(6)

A dog’s dinner is something I could, but wouldn’t usually, consume.

(7)

Capitals are the centers of government for things like states and countries.

(8)

Medications usually have a cost, and they can be high.

Looking just at (3), we know in addition to (6) that: Dinner is the last meal of the day, Dogs bark, Dogs are pets, Eating involves physical ingestion of material, Eating quickly can lead to indigestion, and so on. When we focus consciously on our ability to understand language, we realize there are simply a large variety of things we know that are being actively made use of or ignored (such as the consumability of dog food), in both production and comprehension. If we want machines to mimic our own abilities, then they’ll need access to this same large variety of knowledge. That communication is rooted in a shared understanding that extends beyond basic rules of the language has been commented on by many, including Nunberg (1987): Suppose we put the problem in a schematic way. On the one hand, you have this extensive body of knowledge and assumptions–the collective sense– which underlies the use of natural-language expressions. A part of this knowledge is actually possessed by all discourse participants when they interpret utterances–this is what constitutes their “common-sense beliefs” in the accepted use of the term. [...] the collective sense does play a role in the interpretation of all utterances, even when I am ignorant of it. Whatever

3

my internal state vis-a-vis the world, I make certain social commitments about the world when I use an expression, and these are determined by the collective sense. Clark (1975), concerned with a certain type of Gricean implicature (Grice, 1975) that he took to be used in bridging inferences, wrote: These implicatures, though conveyed by language and a necessary part of the intended message, draw on one’s knowledge of natural objects and events that goes beyond one’s knowledge of language itself. Prince (1978), in sketching out a system for distinguishing discourse commitments from an interlocutor’s private beliefs, referred to common knowledge as stereotypical tacit assumptions (STAs), giving as examples: People have:

Countries have:

parents

siblings

a spouse

relatives

a home

a job

a television

a clock

neighbors

a leader

a president

a queen

a duke

citizens

land

borders

a language

a history

Prince took these as a type of tacit assumption (TA) that was most likely to be shared by an arbitrary discourse participant, and thus could be taken as given, without ever being directly asserted. I discuss in Chapter 2 why we might expect to be able to get at such knowledge if it is, in fact, strictly assumed. Later chapters provide experimental results that wonderfully overlap with the examples above, showing that at least some of this common sense is able to be gleaned automatically by a machine. Within AI, the need to represent such general, sometimes complicated, and often context-dependent, knowledge motivated Minsky (1974) to introduce frames:

4

A frame is a data-structure for representing a stereotyped situation, like being in a certain kind of living room, or going to a child’s birthday party. Attached to each frame are several kinds of information. Some of this information is about how to use the frame. Some is about what one can expect to happen next. Some is about what to do if these expectations are not confirmed. Schank (1975), who had been working on similar notions independently, extended the terminology of frames to include scripts and plans: types of frames respectively meant for reasoning about events, and goals or behaviors. Schank wrote: What is a frame anyway? It has been apparent to researchers within the domain of natural language understanding for some time that the eventual limit to our solution of that problem would be our ability to characterize world knowledge. In order to build a real understanding system it will be necessary to organize the knowledge that facilitates understanding. [...] a frame is a general name for a class of knowledge organizing techniques that guide and enable understanding. Two types of frames that are necessary are SCRIPTS and PLANS. Scripts and plans are used to understand and generate stories and actions, and there can be little understanding without them. For example, we might say there exists one or more bridging inferences, or that we are making use of a particular stored script, in understanding the two sentence narrative: (9) (10)

John heard steps behind him. He began to run.

This comes from Schubert and Hwang (2000), as an example of what they called implicit question-answering, which involves searching for corroborative or antagonistic connections between tentative explanations and predictions evoked by a new sentence and those evoked by prior sentences. This agrees with Schank (1975), who wrote: The

5

inference process that is the core of understanding is not random but rather is guided by knowledge of the situation one is trying to understand, and McCarthy (1959): a program has common sense if it automatically deduces for itself a sufficiently wide class of immediate consequences of anything it is told and what it already knows. From the linguistic pragmatics community, Simmons (2009) gave the following example in order to illustrate the complexity of knowledge that may be assumed in a discourse: (11)

Ann: Are we going to have a picnic?

(12)

Bob: It’s raining.

Here in order for speaker Ann to take Bob’s response as relevant, Ann’s reasoning must include: Suppose Bob assumes that one doesn’t picnic in the rain, an assumption licensed by the fact that this weather contingency is common knowledge. Taken as a whole, there is widespread agreement for the need to adequately represent and acquire large amounts of general world knowledge.

1.1

Assumptions

Within the field of Artificial Intelligence (AI), automatically acquiring common sense is described as an attack on the knowledge acquisition bottleneck : while we can design machines with sophisticated capabilities for planning and reasoning, many believe that synthetic human level intelligence requires access to the same sorts (and scale) of knowledge that we as humans take for granted in our everyday interactions with each other and the world. While this bottleneck applies to knowledge pertaining to all aspects of intelligence, the work contained within this thesis is biased towards the knowledge we employ in our interactions via natural language. At the same time, I follow common wisdom in assuming language understanding is AI-complete,2 and as such, my focus on text is not meant to imply that the knowledge 2

An informal term which borrows the notion of completeness from computational complexity theory;

6

I am concerned with acquiring is limited in use only to Natural Language Processing (NLP). As said by Hobbs (1987): We use words to talk about the world. Therefore, to understand what words mean, we must have a prior explication of how we view the world. Or more succinctly, to understand language means to understand the world ; here I take for granted that an expansive knowledge base (KB) covering everyday facts, as derived from textual resources, will be of use to many within AI, both within NLP and beyond. Chapter 2 contains a few examples of such knowledge being used within AI, which I provide as motivation, not as definitive proof of utility. In referring to knowledge I have in mind the sorts of things we might encode symbolically, using some type of formal representation, such as first order logic. Missing from this document will be any epistemological discussion of what it really means to know something and whether such knowledge can ever be captured symbolically. As with the utility of a common-sense KB, I will take it as a given that there are at least some areas of knowledge that can be symbolically represented, with an accompanying semantics that at least roughly corresponds to human intuition.

1.2

Overview of Thesis

Relevant background is provided in Chapter 2: a summary of the approaches to acquiring knowledge, an argument for the feasibility of the text-based approach, and a description of the two extraction systems I base the experimental results of this dissertation upon. Chapter 3 places my work within the larger picture of learning and understanding generic knowledge. This dissertation is focused on establishing what is at least occasionally true about the world, which can then be strengthened based on what is most an AI-complete problem is one that if you had a synthetic agent capable of handling that task, you’d be able to then solve any other AI-complete problem. The commonly given example is computer vision (acknowledging that vision requires more complex sensory equipment than text based interaction with a computer).

7

regularly stated or asked about. For example,3 while both sentence (13) and (14) may be possible in a strict sense, I am concerned particularly with cases like (13). (13)

People may eat a hamburger.

(14)

People may be abducted by aliens.

Gathering these possibilistic facts can be viewed as just one part of a larger, longrunning problem of how to represent and acquire contextual descriptions that determine just when a given rule is applicable, how likely that rule is to apply, and what we expect as the consequences. The knowledge I am concerned with here can be considered “top level”: properties or tendencies that tend to hold in typical situations, without further contextualization needed. But as I’ll explain, typical or usually is a slippery notion, one that will eventually need to be better understood, even for the seemingly context-independent knowledge explored here. Within her examples, Prince claimed that we all know that Countries have a president, and also that Countries have a queen: respectively, these are only applicable in the context of democracies or monarchies, and yet we tend to omit these particulars when stating the generalization. Chapter 4 describes the differences in goals and methodology between various textbased extraction systems. Chapter 5 shows the importance of gazetteers (collections of class/instance pairings) in extraction, especially in gathering what I call unary attributes. Chapter 6 gives a method for acquiring these gazetteers using Web-scale resources. In Chapter 7 I show how we can use existing word sense hierarchies in determining the proper level of concept abstraction, something we’re unable to do using “flat” gazetteers. The results of this method can be viewed as constructing a limited form of conditional knowledge. In Chapter 8 I present a topic model based approach for performing soft class assignment rather than the “hard” clusters or hierarchies represented by gazetteers and traditional ontologies. Finally, I close with a summary and concluding remarks on future directions for study.

3

Due to Len Schubert.

8

2

Background

This chapter begins with examples from the AI community where researchers have made beneficial use of common-sense knowledge. I then provide an overview of the three established avenues for building up such a knowledge collection, one of which being text-based extraction. Focusing on text, I subdivide this approach into work that relies on understanding direct assertions (such as found in reference materials) versus systems that try to glean from natural discourse what is there only implicitly (or at least, not directly asserted). I then address a claim that pragmatic principles suggest against the feasibility of finding common-sense knowledge in natural discourse. I transition from a linguistic counter-argument to a sort of existence proof, in the form of summary descriptions for two existing systems that perform text-based extraction. First I present Knext, a document-based system created by Lenhart Schubert, and then I summarize the initial work of Marius Pa¸sca, along with myself and collaborators, focused on extracting class attributes from search engine query logs.

2.1

Examples of Use

Common-sense knowledge has already begun to be put to use within AI, a trend which will rise as the extent, accuracy, and complexity of knowledge acquisition techniques improve. Here I give just a few motivating examples, taken from the disciplines of vision, language, and assisted cognition.

9

Vision

Rabinovich et al. (2007) describe experiments in Computer Vision that made

use of knowledge in object classification. The small number of labels available in their training data were expanded by using Google Sets1 (Tong and Dean, 2003), a tool allowing you to dynamically create clusters of related terms, based on a small set of initial seeds (examples provided to an algorithm designed to “grow” the size of your set). For example, knowing that a tennis racket is more likely to be in the same context as a tennis ball than a lemon, was shown to be useful in differentiating a collection of pixels comprising a small, round, green-yellow blob.

Language Koo et al. (2008) were interested in improving the quality of a syntactic dependency parser through the use of large quantities of unlabelled data. This was done by building a binary tree over all unique words in a corpus through agglomerative clustering. Clusters were merged based on minimizing the KL divergence2 using the pre- and post-merged vocabulary. While their approach relied exclusively on a bigram language model to validate merging decisions, examples provided by the authors showed resultant sub-trees with clear signs of semantic separation. By using large scale textual resources to induce a (lightweight) ontology that then led to measurable gains in parsing accuracy, I take this as one of the first examples of automatically acquired (simple) knowledge being used to help process language. Work is currently under way at a number of research institutions to make use of more structured symbolic knowledge to improve parsing.

Assisted Cognition Pentney et al. (2006) described a monitoring system for assisting the elderly or disabled in their homes, showing that basic knowledge about household items and activities could be used to improve the results of a probabilistic inference model for performing scene understanding. Their data consisted of readings from RFID tags, placed on a large number of everyday items such as would be found in a typical household. Subjects wearing a special sensor bracelet would then move about a 1 2

http://labs.google.com/sets

An entropy based measure, the Kullback-Leibler (KL) divergence is usually defined using base-2 P log: KL(p; q) = x p(x) log2 p(x) , with p and q representing probability distributions over x ∈ X. q(x)

10

mock household performing tasks the system was meant to categorize. From the Open Mind Indoor Common Sense (Gupta and Kochenderfer, 2004) collection their system knew things like, People eat when they are hungry, with likelihood assignments given to these facts based on frequencies derived from web-based knowledge extracted by the KnowItAll system (Etzioni et al., 2004). The authors showed an 8% improvement in labelling accuracy (from 80 to 88%) when their system used this outside knowledge to initialize their model.

2.2

Ways to Gather Knowledge

If common sense is a useful thing for our applications to have, how then should we go about collecting it? Methods for building up such repositories may be divided into three categories: • Knowledge Engineering: pay knowledge engineers to enter it manually. • Crowd Sourcing: pay or otherwise encourage a large number of volunteers to each make minor, manual contributions. • Information Extraction: acquire the knowledge automatically from existing texts.

2.2.1

Knowledge Engineers

The Cyc project (Lenat, 1995) is the best-known attempt to to encode a comprehensive common-sense KB by employing professional knowledge engineers. Unfortunately, decades into this effort the community has yet to see a blossoming of AI based on their work. This reflects a difficulty in relying on expert human efforts: people can write down knowledge only so quickly, and there is a lot of it that needs to be encoded. This motivates the crowd sourcing approach, described next: figure out a way to greatly increase the number of annotators (without having to pay them all full-time salaries). A separate difficulty, touched on by Morbini and Schubert (2009), involves not the comprehensiveness, but the form in which Cyc’s knowledge is encoded, which bears

11

little in common with natural language, and thus appears difficult to integrate into systems that have been built assuming “language based” knowledge. The problems inherent in professional knowledge engineering become especially challenging when considering the sorts of knowledge that require experts in a given domain. Results from Project Halo (Friedland and Allen, 2003) suggest the cost of properly encoding a single page of knowledge from a textbook on Chemistry to be $10,000. While I believe these costs will continue to decline as better knowledge authoring tools are developed, it remains the case that relying on human annotators gives up the modern day advantage of being able to run hundreds or even thousands of computers autonomously, day and night.

2.2.2

Crowd Sourcing

An alternative method, still reliant on human effort, is to try crowd sourcing 3 the problem: replacing professional knowledge engineers for the wisdom of the crowd, as advocated by Stork (1999). Since everyone knows common-sense things, then anyone should be able to contribute, especially if one uses a transmission medium that is easily understood: natural language, instead of, e.g., CycL (the Cyc knowledge representation language). The hope is that one can get a very large number of volunteers to each make a small contribution, that when added up, will greatly surpass the quantity of knowledge one could acquire through hiring a small number of people professionally. Singh (2002) describes the Open Mind Common Sense project, which is the bestknown of these efforts, while more recently, the Verbosity project (von Ahn et al., 2006) was introduced under the growing set of Games with a Purpose.4 3

A relatively new term, usually meaning to put a problem online, where (potentially anonymous)

volunteers may each perform a small amount of effort, together totaling a massive contribution towards its solution. Amazon’s Mechanical Turk (https://www.mturk.com/mturk/) is the best-known platform for general crowd sourcing. 4

http://www.gwap.com/gwap/

12

2.2.3

More Automation Needed

These efforts show that human feedback is a useful tool in knowledge acquisition, but they cannot escape the fact that even in the case of crowd sourcing, human time is vastly more scarce than computational cycles. A better strategy involves letting our computers do as much of the work for us as is deemed possible, and only then employ humans to filter and provide additional suggestions. This is especially true in the case of simple relations like hypernym patterns (e.g., a dog is-a animal ), which have been known for years to be amenable to automatic extraction (Hearst, 1992; Snow et al., 2005). Hobbs and Navarretta (1993) provided a nice description of how experts and machines might work together, laying out a pipeline for KA similar to that used today.5 More recently, Chklovski (2003b) instilled the capability of simple reflection into the Open Mind system, allowing it to prioritize questions it would pose to volunteers. For example, given that newspapers and books both have pages, contain information, etc., then if it is told that a book may be burned, it becomes worthwhile to ask whether this also holds for newspapers. This thread of work continued in the Learner project (Chklovski, 2003a), with the goal of maximizing the impact of what time volunteers were willing to share. Note as well the relevance of research on building tools to assist lexicographers, and grammar writers (see, e.g., Sekine et al. (1992), or Mitamura et al. (1993), and more recently, the OntoNotes project of Hovy et al. (2006)). By considering the distributional properties of terms or linguistic structures that appear in a corpus, systems can suggest or partially fill-in information that seems most likely to follow when compared to previous annotations. As the quality and complexity of fully automatic knowledge extraction improves, we should expect to see more such “mob/machine” collaboration efforts along these lines; 5

Although that work pre-dates the exceptionally large datasets and computational horsepower cur-

rently available, and thus relies more heavily on the human component than seen in contemporary research.

13

this is evidenced by work such as by Snow et al. (2008), and recently by Hoffman et al. (2009), which directly aimed at this hybrid arrangement for managing content in Wikipedia.

2.2.4

Information Extraction

The last few decades have resulted in an internet carrying massive amounts of textual data, along with powerful computers for storage and processing. This combination gave birth6 to the field of large scale Information Extraction (IE), an area of research spanning Information Retrieval, Machine Learning/Applied Statistics, Databases, Knowledge Representation and Natural Language Processing. To date, the majority of the IE community has focused on fact extraction, which looks for information about specific individuals; locations, dates, times, etc. This sort of knowledge is useful for tasks such as answering simple questions, e.g., Who invented the traffic cone? (Voorhees and Tice, 2000). The goal here is acquiring more general sorts of knowledge, dealing with how the world operates or tends to be. For example, rather than being able to fill the empty slot in, X invented the traffic cone, I am concerned with knowing that, People may invent things. I will refer to this pursuit as a subtopic of IE, called Knowledge Acquisition (KA).

2.3

Knowledge Acquisition

(Text-based) Knowledge Acquisition is also known as Machine Reading, Knowledge Base Formation, Learning by Reading (LbR), Open Information Extraction, and strongly overlaps with the area of Computational (Lexical) Semantics. Topics within KA include the discovery of: hypernym relations, e.g., Bambara ndang is-a bow lute (Hearst, 1992; Ponzetto and Strube, 2007); general propositions, e.g., 6

However, projects now describable as open domain IE have existed at least as far back as Frump

(DeJong, 1982), which were built prior to the easy availability of massive datasets and the horsepower to exploit them.

14

Children may live with relatives (Schubert, 2002; Liakata and Pulman, 2002; Clark et al., 2003); characteristic attributes of concept classes, e.g., Countries may have presidents (Pa¸sca and Van Durme, 2007; Almuhareb and Poesio, 2004); paraphrase rules, e.g., X wrote Y if and only if X is the author of Y (Lin and Pantel, 2001; Bhagat, 2009); and common verb-verb relations, e.g., buy happens-before sell (Chklovski and Pantel, 2004). Closely related are the tasks of Semantic Role Labelling (SRL) (Gildea and Jurafsky, 2002; Swier and Stevenson, 2004); deriving selectional preferences (Zernik, 1992; Resnik, 1993b; Clark and Weir, 1999); extracting partial predicate-argument structure (Abney, 1996); and mapping natural language into database queries (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Wong and Mooney, 2007).

2.3.1

Approaches

Approaches to text based Knowledge Acquisition can be divided into two categories: • Understanding Direct Assertions: such as reading the contents of a dictionary (e.g., eXtended WordNet7 (Harabagiu et al., 1999)) or an encyclopedia (e.g., MindNet (Richardson et al., 1998)). • Understanding Implicit Assertions: either through indirect transmission, such as via presuppositional contexts, or abstracting from examples. It is this later method that I explore in this thesis: acquiring knowledge that is taken to be implicit by the writer, where we hope to glean an understanding of the world through the way people talk about it, without their telling us directly. This is not to discount acquiring knowledge from direct assertions, which is an area of active study (see in addition, e.g., Weld et al. (2008)). 7

http://xwn.hlt.utdallas.edu/

15

2.4

Feasibility of Text-based Approach

It has been previously suggested (Havasi et al., 2007) that the Cooperative Principle implies that common-sense knowledge would be rarely expressed and thus too difficult to extract from text:8 Grice’s theory of pragmatics [here: (Grice, 1975)] states that when communicating, people tend not to provide information which is obvious or extraneous. If someone says “I bought groceries”, he is unlikely to add that he used money to do so, unless the context made this fact surprising or in question. This means that it is difficult to automatically extract common-sense statements from text, and the results tend to be unreliable and need to be checked by a human. We should first note that Gricean principles lead to different operating constraints depending on the type of discourse being considered; encyclopedic content, such as found in Wikipedia, is certainly textual, but is of a different sort than the natural dialogue apparently referred to by Havasi et al. (2007). In particular, what can or should be taken for granted changes when authoring a reference work, as compared to speaking with a friend. The following are the first sentences of the Wikipedia pages9 for Groceries,10 Retailing and Sale. (15)

A grocery store is a store established primarily for the retailing of food.

(16)

Retailing consists of the sale of goods or merchandise from a fixed location, such as a department store, boutique or kiosk, or by mail, in small or individual lots for direct consumption by the purchaser.

8

Havasi (p.c.) explains that Havasi et al. (2007) should be interpreted as saying merely that we

shouldn’t expect to acquire all common knowledge from text. My reply here, whether to a straw man or not, remains a useful exercise. 9 10

As referenced at 5:48 PM EST, September 19, 2009. The term Groceries redirects to the page for Grocery Store.

16

(17)

A sale is the pinnacle activity involved in selling products or services in return for money or other compensation.

While I would agree that the current state of NLP technology should not be trusted to adequately put those sentences together in order to ascertain, People buy groceries with money, it is still the case that the correct knowledge is expressed in (a chain of) direct assertion in the above sentences. Thus we should at least not blame implicitness in this example for our inability to automatically acquire this proposition. But even for more natural discourse, the point is moot. According to a Gricean analysis, we should not expect speakers to provide in a discourse utterances whose primary intent is to convey basic shared, such as in sentence (18) or (19). However, this does not prevent such knowledge from being expressed through indirect means, such as through presuppositions arising from referring expressions, as seen in (20) and (21). (18) ? I used money to buy groceries. (19) ? I bought groceries using money. (20)

I forgot money to buy groceries.

(21)

I forgot money for buying groceries.

It may be argued in reply that only direct assertions of common-sense knowledge may be clearly and unambiguously interpreted. In this case, used money to buy is no more meaningful than money for buying. That is, (18) and (19) leave as much room for the interpretation of the role of money in a (grocery) buying act as do (20) or (21).11 Note as well that a one word addition to (18) gives us sentence (22), an acceptable statement, with a definite NP bearing an existential presupposition (Prince, 1978). It is hard to see the difficulty in getting from this to Money may be used to buy groceries. 11

One might then try to argue that neither indirect nor direct mentions may explicitly convey

common-sense assumptions, owing to a possible disconnect between language and the internal representation of knowledge. This is a more complicated line, which I will not follow here, except to state directly that I am operating under the assumption that the internal representation of knowledge is strongly aligned with its expression in language (see Schubert and Hwang (2000) and Chapter 3 for discussion).

17

(22)

I used the money to buy groceries.

In order to show that this point extends beyond referring expressions, consider sentences (23) and (24). With respect to most groups, the first statement is common sense, and thus its utterance would not be cooperative. This as opposed to the second sentence, which in most contexts would convey novel information (excepting discourse such as held at an ornithological conference), and entails the more general fact conveyed by sentence (23). (23) ? Birds fly. (24)

Birds fly on average 10.7 meters per second.

We can verify that the main assertion of sentence (23), implicit in sentence (24), is not the primary assertion of the utterance by relying on a test employed by von Fintel (2004), adapted from Shannon (1976). Under the Wait a minute! I didn’t know that S! test, sentence (25) succeeds, while sentence (26) does not. This suggests that the assertion, S = Birds fly, is presuppositional.12 (25) (In response to (24)) Hey wait a minute! I didn’t know that birds could fly! (26)

# Hey wait a minute! I didn’t know that birds could fly on average 10.7 meters per second!

This verifies that there exist types of common-sense knowledge beyond that found in referring expressions that can be cooperatively expressed in natural discourse, but without being directly asserted. Thus it seems feasible that we might automatically extract at least some quantity of common-sense knowledge through semantic interpretation of text. I will further support this by describing a system in the following section, Knext, which does in fact exploit these existential and modifying contexts to automatically mine common knowledge. 12

In judging acceptability of (25) I am of course ignoring the pragmatic issue that everyone knows

birds can fly.

18

A seemingly distinct approach, used both by Knext (in some cases) and relied on entirely by the second system given below, takes direct assertions or queries about individuals, and abstracts from these examples to form general statements about broader concept classes. For example, in sentence (27), if we know Tweety is-a bird, then we know it is possible that Birds fly. (27)

Tweety flew out the room.

I would suggest that these approaches are in fact based on the same underlying principle. In order to utter (27) I need knowledge of the language, in the form of sortal restrictions,13 that authorizes things of type bird to be in subject position to the verbal predicate fly. Further, I assume that as a native speaker, the hearer has this same knowledge of the language. Therefore most sortal restrictions as well as many examples of what are formally classified as presuppositions, might be labelled, under the terminology of Prince (1978), as tacit assumptions by the speaker, or here, knowledge that is being expressed, or assumed, implicitly. In short, speakers assume hearers bring certain knowledge to bear in a discourse. As shown here, it seems we should be able to glean evidence of these assumptions by analyzing sentences of certain forms, even if they do not state these assumptions directly.

2.5

Knowledge from Documents

The Knext project is aimed at extracting structured knowledge from natural text,14 and was first described by Schubert (2002). Logical forms are based on Episodic Logic (Schubert and Hwang, 2000), a formalism designed to accommodate in a straightforward way the semantic phenomena observed in all languages, such as predication, logical compounding, generalized quantification, modification and reification of predicates and propositions, and event reference. An implementation of EL exists as the Epilog system 13

Sortal incorrectness is exemplified by sentences such as The color of copper is forgetful (Thomason,

1972). 14

As compared to, e.g., tables and lists that may be spidered and scraped in bulk from the internet.

19

(Some e0 : [e0 at-about Now0 ] [(Many x : [x ((attr athletic.a) (plur youngster.n))] [x want.v (Ka (become.v (plur ((attr professional.a) athlete.n))))]) ** e0 ])

Figure 2.1: Example EL formula: square brackets indicate a sentential infix syntax of form [subject pred object ...], Ka reifies action predicates, and attr “raises” adjectival predicates to predicate modifiers; e0 is the situation characterized by the sentence.

(Schaeffer et al., 1993), which supports both forward and backward inference, along with various specialized routines for dealing with, e.g., color, time, class subsumption, etc. Epilog is under current development as a platform for studying a notion of explicit self-awareness as defined by Schubert (2005). As an indication of EL’s NL-like syntax, Figure 2.1 contains the output of Epilog’s parser/logical-form generator for the sentence, Many athletic youngsters want to become professional athletes. Knext operates by attempting to abstract world knowledge “factoids” from documents, based on the logical forms derived from parsed sentences. The idea is that nominal pre- and post-modifiers, along with subject-verb-object relations, captured in logical forms similar to that seen below, give a glimpse of the common properties and relationships in the world – even if the source sentences describe invented situations. By focusing on those syntactic elements that are most likely to be correctly analyzed by state-of-the-art parsers, Knext is able to extract knowledge from a wide variety of open domain text. An example from Schubert and Tong (2003) is provided in Table 2.2, giving the factoids (both pre- and post-verbalization) obtained from the following sentence (stem-

20

A NAMED-ENTITY MAY ENTER A ROOM. A FEMALE-INDIVIDUAL MAY HAVE A ROOM. A FEMALE-INDIVIDUAL MAY SLEEP. A FEMALE-INDIVIDUAL MAY HAVE CLOTHES. CLOTHES CAN BE WASHED. (:I (:Q DET NAMED-ENTITY) ENTER[V] (:Q THE ROOM[N])) (:I (:Q DET FEMALE-INDIVIDUAL) HAVE[V] (:Q DET ROOM[N])) (:I (:Q DET FEMALE-INDIVIDUAL) SLEEP[V]) (:I (:Q DET FEMALE-INDIVIDUAL) HAVE[V] (:Q DET (:F PLUR CLOTHE[N]))) (:I (:Q DET (:F PLUR CLOTHE[N])) WASHED[A])

Figure 2.2: Example output from Knext: upper-case sentences are automatically generated verbalizations of the abstracted LFs shown beneath them. Keywords like :i, :q, and :f are used to indicate infix predication, unscoped quantification, and function application.

ming from the Brown corpus): Rilly or Glendora had entered her room while she slept, bringing back her washed clothes. Knext’s extraction procedure is as follows: 1. Parse each sentence using a Treebank-trained parser (Collins, 1997; Charniak, 2000). 2. Preprocess the parse tree, for better interpretability (e.g., distinguish different types of SBAR phrases and different types of PPs, identify temporal phrases, etc.). 3. Apply a set of 80 interpretive rules for computing unscoped logical forms (ULFs) of the sentence and all lower-level constituents in a bottom-up sweep; at the same time, abstract and collect phrasal logical forms that promise to yield stand-alone propositions (e.g., ULFs of clauses and of pre- or post-modified nominals are prime candidates). The ULFs are rendered in Episodic Logic (e.g., Schubert and Hwang (2000)), a highly expressive representation allowing for generalized quantifiers, predicate modifiers, predicate and sentence reification operators, and other devices found in NL. The abstraction process drops modifiers present in lower-level ULFs

21

(e.g., adjectival premodifiers of nominal predicates) in constructing higher-level ULFs (e.g., for clauses). In addition, named entities are generalized as far as possible using several gazetteers (e.g., for male and female given names, US states, world cities, actors, etc.) and some morphological processing. 4. Construct complete sentential ULFs from the phrasal ULFs collected in the previous step; here some filtering is performed to exclude vacuous or ill-formed results. 5. Render the propositions from the previous step in (approximate) English; again significant heuristic filtering is done here. The initial development of Knext was based on the use of hand-constructed parse trees in the Penn Treebank version of the Brown corpus. In later chapters I describe experiments with Knext that required extending the system to make use of parse trees obtained with statistical parsers applied to larger corpora, such as the British National Corpus (BNC), a 100 million-word, mixed genre collection, along with Web corpora of comparable size (see Chapters 5 and 4 for details). Today Knext may be considered a fully automatic open domain knowledge extraction system.15 Methods used in the evaluation of Knext are first discussed in Chapter 4.

2.6

Knowledge from Queries

The following summarizes the efforts of Pa¸sca and Van Durme (2007) and Pa¸sca et al. (2007), which differs from the sentence-based, deeper linguistic analysis discussed in the previous section.

2.6.1

Initial Investigation

Contrasting prior work in pattern-based information extraction (as compared to the interpretive approach by, e.g., Knext), which was primarily focused on the recognition 15

As the required extensions all fall under the label of straightforward engineering, the term “fully

automatic” or at least “almost fully automatic” could be applied even for the original 2002 system of Schubert.

22

of predefined relations as expressed in documents, Pa¸sca and Van Durme (2007) built a data-driven system for the acquisition of arbitrary properties, or attributes, for concept classes, using search engine query logs. For example, instances of the class Country may have a Capital City, or instances of the class Painter may have the attribute Works. By searching for the attributes ascribed to individual members of a specified class, it should be possible to determine those attributes that are most commonly shared. This is especially true when those instances are named entities such as France or President Carter, which tend to unambiguously refer. The use of search engine queries as a data source was motivated by the hypothesis that attributes that define a class tend to be what people ask questions about. Which is to say, the relative frequency of queries posed to a search engine, for instances of some semantic class, imply the possession and relative importance of certain attributes by that class. If users frequently search for the capital city of things we know to be countries, then this strongly suggests it as a conceptual attribute. The approach was as follows: 1. Specify target classes, e.g., (Linux, MacOS, Solaris, ...), for the class OperatingSystem. 2. Collect candidate attributes through domain-independent patterns, e.g., X of Y. For example, the query string “kernel size of Linux” gives Kernel Size for class instance Linux. 3. Rank candidates based on weighted aggregate frequency, where full fledged natural language queries such as “what is the population of iquitos” have higher contribution than patterns manually determined a priori to be less promising. 4. Heuristically filter candidate list; discard candidate if part of a known proper name, e.g., (Toyota) of Boston, discard attributes subsumed by one of higher rank, where subsumption was determined by substring matching,16 e.g., Street 16

Which usually translates to an assumption of attributes being head final, with modifiers being

subsective.

23

Class

Size

Examples of Instances

Drug

346

Ibuprofen, Tobradex, Prilosec

Company

738

Ford, Xerox, Longs Drugs Stores

Painter

1011

Georgia O’Keefe, Ossip Zadkine

City

591

Hyderabad, Albuquerque, Tokyo

Country

265

Yemen, India, Paraguay, Egypt

Table 2.1: Classes used, their respective number of instances, and examples.

Map vs. Map. Initial experiments as reported in Pa¸sca and Van Durme (2007) focused on five concept classes, as seen in Table 2.1. Drug, Company, Painter, City, and Country were used as representative classes that correspond respectively to the general categories of artifacts, organizations, roles, and geographical entities. We had the goal of discovered attributes being relatively distinct for a given class (not be shared by a large number of other concepts). For example, users regularly search for the photo or picture of all sorts of instances, yet we did not believe these to be attributes of much importance. This notion of importance was based on trying to find those attributes that were most characteristic of a given class. For example, we would prefer Studio for Painter as compared to Daughter. Note that basing the supposed importance of an attribute on distinctiveness is an imperfect solution. Consider the attribute Cost for the class Drug; most would consider cost an important attribute for a wide variety of classes, e.g., Car, Food, etc., and therefore a selection criterion based purely on specificity could remove attributes that humans presumably would not. I describe in Chapter 7 a hierarchical method for determining the proper level of conceptual abstraction that an attribute should apply, which can be seen as a more informed version of the distinctiveness approach. I believe determining relative attribute importance using something other than occurrence frequency will require application-specific metrics (how the knowledge is actually being used), a topic not explored in this dissertation.

24

Figure 2.3: Precision as a function of rank. Dotted lines represent lists that have been heuristically filtered; black stands for lists reranked according to Scorereranked ; grey represents Scoref requency .

2.6.2

Measuring Accuracy

The score of an attribute, A for a given class, C was originally calculated based on weighted query log frequency, with weights determined ad hoc during development:

Scoref requency (C, A) = Wf (C, A). In order to filter non-class specific attributes, a Pointwise Mutual Informaiton (PMI)like factor (Turney, 2001) was added:

Scorereranked (C, A) = Wf (C, A) × Sf (C, A) × log

Wf (C, A) × N , Wf (C) × Wf (A)

where N is the total frequency over all (class, attribute) pairs and Sf (C, A) is a smoothing factor to prevent emphasis of rare attributes (also determined through development). Precision results at various rank n for the classes City and Drug are presented in Figure 2.3. We found that heuristic filtering had a dramatic effect on accuracy, with reranking by specificity having little effect.

25

Precision Class @10

@20

@30

@40

@50

Country

0.95

0.93

0.90

0.89

0.82

Drug

0.90

0.93

0.83

0.79

0.79

Company

0.90

0.75

0.63

0.56

0.50

City

0.85

0.73

0.72

0.66

0.65

Painter

0.95

0.93

0.85

0.84

0.75

Table 2.2: Precision at various ranks n.

Accuracy was determined based on labeling by hand the top 500 entries returned for each class in our initial experiments, then adding supplemental labels as needed if later revisions of the system brought previously unseen attributes into the top 50. Attribute candidates were labeled according to a simple, sliding scale; the labels vital, OK, wrong were given an accuracy weight of 1, 0.5 and 0. For example, (Country, President) would be vital, while (City, Restaurant) would merely be OK. Attributes such as Users for Drug or Diary for Painter were considered wrong. Quality was deemed as high across all classes at rank 10 (see Table 2.2). Going to rank 50 we notice a significant drop in precision. Especially in the case of Company, a number of errors at deeper rank come from the lesser senses of class instances. For example, Volkswagen may refer to the company, or the cars that the company produces (or the dealerships that sell the cars, etc.). Table 2.3 gives the top 10 attributes as extracted for each class. Note in particular the overlap between the attributes for Country and the examples17 I gave earlier from Prince (1978). 17

Countries have: a leader, a president, a queen, a duke, citizens, land, borders, a language and a

history.

26

Country

capital, population, president, map, capital city, currency, climate, flag, culture, leader

Drug

side effects, cost, structure, benefits, mechanism of action, overdose, long term use, price, synthesis, pharmacology

Company

ceo, future, president, competitors, mission statement, owner, website,organizational structure, logo, market share

City

population, map, mayor, climate, location, geography, best, culture, capital, latitude

Painter

paintings, works, portrait, death, style, artwork, bibliography, bio, autobiography, childhood Table 2.3: Top 10 attributes per class.

2.6.3

Measuring Coverage

We build systems for acquiring knowledge for the obvious reason that we don’t currently have it all collected and stored. This makes it difficult to assess the coverage (recall) of an extraction algorithm. Downey et al. (2005) proposed a generative model with strong assumptions on the correlation between the relative frequency of a fact expressed in text with its real-world likelihood. While promising, this model is meant for filtering errors from results that actually appear in text; there’s nothing in the model that suggests whether there remains additional knowledge that remains unmentioned in a corpus. Lacking a standard methodology for evaluating coverage, we performed a number of limited comparisons against established resources and user responses. While the results below qualitatively suggest reasonable coverage by our method, it is important to note that these are with respect to somewhat “clean” or “standard” classes, such as Country. It may be that user queries are less informative for less popular concepts (see Pa¸sca et al. (2007) and Pa¸sca and Van Durme (2008) for results on more diverse concept collections), and we simply do not have a proper gold standard to inform us.

27

Attribute

Rank

Attribute

capital

1

emperor

population

2

date of independence

Rank 50

Prime Minister

19

currency

6

leader

10

area

23

capital city

5

Queen

76

President

3

GDP

32

size

20

Table 2.4: Attributes from TREC questions regarding Country, ranked by frequency (top to bottom, left to right), with their corresponding positions in our rank list.

TREC Questions Figure 2.4 lists the 13 most queried for attributes of Country in the first 1,893 questions used in the Question Answering track of TREC (Voorhees and Tice, 2000). As seen, six of the top ten (1, 2, 10, 5, 3, 6) of our discovered attributes are direct string matches to the entries within the top ten TREC queries based on countries.

CIA Factbook As another evaluation for the class Country, we compared our results against the attributes defined in the country tables of the CIA Factbook18 . Unsurprisingly, we find that simple attributes such as Flag or Map are discovered, as compared to those such as Household income consumption by percentage share or Manpower reaching military service age annually. Figure 2.5 gives examples of this comparison. 18

http://www.cia.gov/cia/publications/factbook

28

Attribute

Rank

natural resources

21

terrain

132

climate

7

Attribute

Rank

irrigated land

-

administrative divisions

-

infant mortality rate

-

Table 2.5: Examples of our comparison for the class Country, against the CIA Factbook.

Class

Attribute

Rank

Class

Attribute

Painter

nationality

Drug

Rank

30

Painter

influences

is it addictive

-

Painter

awards

-

Drug

side effects

1

Country

income

88

Company

is it nonprofit

-

Country

neighbors

-

Company

competitors

4

City

mayor

3

City

quality of living

-

City

taxes

30

11

Table 2.6: Examples from user survey, matched against our results.

User Survey Finally, we performed a limited survey in which participants were given two example classes with associated attributes,19 then asked to do the same for the five target classes. Users were not constrained to any minimum or maximum length. The most useful result of this survey was to highlight the lack of attention we had paid to unary attributes,20 such as Addictive? for Drug or Nonprofit? for Company. The example results shown in Figure 2.6 are based on conservatively equating the free-form responses of participants to those in our result set. 19

Dog and Planet, with examples such as {Size, Color, Breed} and {Composition, Distance from

Earth, Moon(s)}. 20

So-called because of their correspondence to unary predications, e.g., Addictive(x).

29

2.6.4

Comparison of Data Sources

That search engine query logs may provide better results for the task of pattern-based class attribute extraction is supported the work of Pa¸sca et al. (2007). For an increased set of 20 classes,21 extraction results were compared between a system based on 50 million queries versus one run over 100 million web documents. Both systems were based on the same manually defined query patterns. The one meaningful difference in extractors was that, in the case of documents, more effort was needed in matching entities. Specifically, simple rules for determining membership within a larger NP were defined to limit faulty matches. As reported, the sentence “Human activity has affected Earth’s surface temperature during the last 130 years.” provides a valid match to the HeavenlyBody instance, Earth, while the sentence “The market share of France Telecom for local traffic was 80.9% in December 2002.” does not provide a match for the Country instance France. The accuracy of the query log-based system was higher for all classes. At rank ten, the most dramatic differences came from such classes as BasicFood (0.25, 1.0), or HeavenlyBody (0.35, 1.0), with the given pairs referring to average precision for documents and queries, respectively. It would appear that this difference in accuracy can be most strongly attributed to two factors: (1) the primary sense of a term appears to dominate more strongly in queries than in web documents; (2) our extraction patterns, such as X’s Y, or Y of X less ambiguously denote attributes when used in a query. For example, in reference to point (1), the incorrect attribute Temple extracted from documents for the class HeavenlyBody suggests a greater volume of document-based text dedicated to Roman gods than in our query logs. This effect does not necessarily show queries to be better than documents for all classes; a separate study may be beneficial to compare accuracy for classes with a high concentration of strongly ambiguous terms. 21

Actor, BasicFood, CarModel, CartoonCharacter, City, Company, Country, Drug, Flower, Heav-

enlyBody, Mountain, Movie, NationalPark, Painter, ProgLanguage, Religion, SoccerTeam, University, VideoGame, and Wine.

30

And yet, these results do suggest that at least for whichever sense is dominant according to users of search engines, query logs are a more useful resource for attribute extraction, when relying on shallow extraction patterns. As to (2), the document derived attribute Bowl for the class BasicFood reinforces our hypothesis that query logs are specifically biased towards attributive uses of our extraction patterns.

2.7

Summary

In this chapter I first provided an overview of methods for building a large scale commonsense knowledge base: (A) pay professionals; (B) ask volunteers; or (C) acquire it automatically from text. While these methods may be combined, this dissertation focuses purely on the third option. Within the text-based approach I made a distinction between interpreting knowledge directly stated – including both factoids and more general knowledge (e.g., Albany is the capital of New York, or, as taken from a hypothetical reference entry, A bus is a type of automobile) – and knowledge that is implicit, to be acquired through means such as induction from examples, understanding presuppositional contexts, and exploiting certain textual modalities (e.g., search engine query logs). I presented a counter-argument to statements made by proponents of option (B), and then gave descriptions for the two extraction projects I base the work of later chapters upon.

31

3

Framework and Challenges

The previous chapter provided background on the task of implicit knowledge acquisition from textual resources, including a summary of two projects on which I base subsequent chapters. Here I lay out the framework these efforts assume, which I then use to describe some of the challenges in implicit KA, including the need to decide how to interpret what is being extracted. In the next chapter I will give more details concerning the differences in extraction methodology, both in how the knowledge is acquired, and in how it is interpreted. Appendix A provides a short overview of natural language generic sentences, a basic knowledge of which is assumed at various points of this, and subsequent chapters. To summarize: generics are used by humans to express rules that they take to underlie patterns observed in the world, and usually have strong quantificational force. In the many cases where this would seem to lead to error, such as in: A bird lays eggs, we assume the existence of implicit constraints (e.g., (female) bird ) in the domain restrictor that allows for the strong reading. Cases such as: Lightning rarely strikes people, show that generics are not universally strongly quantified. The key property of a generic are their nomic, or rule-like, character. In comparison, and as will be described in more detail here, systems such as Knext give results that can be taken to possess at least weak quantificational force. That is, KA systems capture patterns of various levels of regularity. These patterns, whether or not they reflect underlying rules of the world, can be useful in applied systems

32

for performing predictions. Taking an example from the Appendix, it may be useful to know that Dogs are born on Earth, even though this doesn’t express an essential property (either of dogs, or of Earth). In order to improve the ability of a system to make predictions, KA researchers would like to strengthen their patterns in order that they have as strong a quantificational force as possible,1 which would then satisfy one of the two main properties that seem to underlie a generic sentence. This requires understanding the contexts under which a given pattern holds, otherwise described as fleshing out the constraints captured in a generic’s domain restrictor.

3.1

Framework

Natural language allows for complicated phenomena that humans, in some manner, manage to deal with and understand. It follows that machines meant for interacting with humans using natural language should therefore have facilities for representing and reasoning with the same range of complexities seen in natural language. This premise underlies the development of Episodic Logic (EL) (Schubert and Hwang, 2000), first mentioned in Chapter 2. For the work presented in this dissertation, as well as the majority of related, contemporary efforts, the representative power of full EL goes beyond what is immediately required. For example, the current version of the Knext system uses a pseudo-logical form which can be viewed as an underspecified form of EL, omitting some of the more distinctive elements of that representation (e.g., the situation characterizing relation: **). This is not to discount the use of a rich representation: as I discuss in the next chapter, since language understanding is one of the primary goals motivating knowledge acquisition, it may be a wise engineering decision to base one’s extraction system on a framework of rich, compositional interpretation from the start. That is, extraction through understanding is a design path allowing for future extensions to handle additional complexities found in natural language sentences.2 1

That is, It is usually, or always the case that ... tends to be more useful than It is sometimes, or

rarely the case that .... 2

For instance, preliminary extensions to Knext have been attempted for extracting causal relations,

33

Whether using a more or less expressive representation, most existing work relies on some variant of these basic components: • conceptual classes (e.g., City), • class instances (e.g., Boston, or a city), and • predications that express properties of, or relate, instances or classes (e.g., Capital, or Mayor Of ). In the terminology of Carlson (1977a) (see Appendix A), conceptual classes can be thought of as kinds, with class instances corresponding to objects, or in some cases, object stages.3 Under this framework we can describe implicit knowledge acquisition as discovering predications holding of class instances (objects), and abstracting from these instances to more general statements about the class (kind). The results of this abstraction process take the form of rules, bearing some form of generic quantification. The presence of this quantification tends to go unmentioned in most contemporary work in KA, and even when acknowledged (such as in papers dealing with Knext), the strength of the quantifier is left unspecified. In the following section I make this framework more precise, including the introduction of additional components, such as generic quantification and situational variables.

3.2

Representation Language

For the remainder of this dissertation I will formalize examples using the symbolic notation laid out in this section, borrowing heavily from Episodic Logic (Schubert and which take greater advantage of EL’s representative power. I do not cover that work here, but see Van Durme (2008) for related discussion. 3

While the individual/stage-level distinction is important here, note that I am not concerned in this

thesis with acquiring knowledge that would be verbalized as a generic sentence pertaining to an objectlevel NP, containing a stage-level predicate, as in Rover barks, or Obama plays basketball (i.e., habituals pertaining to a specific individual).

34

Hwang, 2000). The majority of what follows should be readily understood by anyone with an acquaintance with First Order Logic (FOL).

3.2.1

Terms vs. Predicates

Proper noun phrases will be treated as constant terms, e.g., Jack, while common nouns will be treated as individual-level predicates, applied to variable terms, e.g., Dog(x).

3.2.2

Brackets

Brackets follow a Lisp-like syntax, where predicates fall within syntactic scope of enclosing parenthesis. If we let Π be an arbitrary predicate symbol, then (Π x), is what might commonly be written as Π (x). For example, (Dog x), as compared to Dog(x). Following Episodic Logic, examples will at times make use of an infix syntax, where the first term of a predication is written to the left of the predicate symbol.4 This will be marked by the use of square brackets instead of parenthesis, as in [x Π y], as compared to (Π x y). 4

Strictly, the EL equivalence is [x Π y] ≡ (Π y x) ≡ ((Π y) x), because of the “curried” predicate

semantics assumed in EL. However, I am assuming a more standard relational semantics here; a curried predicate Πc can be expressed in terms of its relational counterpart Πr as Πc = λyλx(Πr x y).

35

This syntax can be helpful in cases where a standard prefix syntax can lead to confusion, as in the logical form for John loves Mary:5 (Love John Mary), which might mistakenly be read as Mary loves John. Using square brackets we have [John Love Mary]. Logical forms expressed using this format can often be read (almost) as easily as the English sentences from which they are derived.

3.2.3

Formula Symbols

The symbols Φ and Ψ will be used to refer to arbitrary formulas. When I wish to make clear that such a formula has free variables x1 through xn , I will write Φ x1 ...xn .

3.2.4

Quantification

Formulas with quantifiers are written in a uniform syntax, with a variable restrictor and nuclear scope (matrix clause). The following: (∀x : Φ x Ψ x ), is equivalent to a universally quantified conditional, i.e., (∀x Φ x ⇒Ψ x ), while 5

Where tense in this example is ignored, Love is taken as a verbal predicate, and John and Mary

are taken as constant terms.

36

(∃x : Φ x Ψ x ) is equivalent to existentially quantified conjunction, i.e., (∃x Φ x ∧ Ψ x ). However, restricted quantifiers are more general than the unrestricted quantifiers of FOL. For example, (Most x: Φ x Ψ x ), has no representation in terms of unrestricted quantification and standard connectives.

3.2.5

Underspecified Quantifier Scope

The relative ordering of quantifier scopes for variables introduced by quantified nominals are often underspecified in the results presented here, with variables left implicit. This is represented through use of angle brackets, h i . For example, the sentence A man owns a dog, when construed nongenerically, would be represented as (ignoring tense): [hA Mani Own hA Dogi ], which, if interpreting the indefinite as introducing existential quantification, could expand to either of the following: (∃x: [x Man] (∃y: [y Dog] [x Own y])), or (∃y: [y Dog] (∃x: [x Man] [x Own y])).

37

3.2.6

Situation Variables

A defining element of Episodic Logic is that sentences as a whole are taken to characterize situations. Situation variables are needed here in order to represent stage-level predicates, such as Read, or Eat. For ease of presentation, I will forgo use of the full EL treatment of situations and instead make use of a Davidsonian approach, associating event variables directly with verbal predicates.6 For example, the logical form,7 [hA Mani Read hA Booki ] construed as meaning something like, A man may occasionally read a book, would carry an implicit event variable for the book-reading events bound by the underspecified quantifier on the event variable implicit in the verbal predicate. If we for now leave out the quantifier that binds it, then when this variable is added it would appear as [hA Mani Read hA Booki e]. Motivated by Musan (1999), I take the primary nominal in these generics as hearernew, and therefore introduce a quantifier over portions 8 of an individual. The stage6

As noted in Schubert and Hwang (2000), there is an equivalence between this representation and

in EL, when strictly considering atomic sentences in positive-polarity environments. All results within this dissertation fall under this heading (i.e., they are simple positive sentences whose arguments thus lie in upward-entailing environments, as in Drugs may have a cost entails that Products may have a cost), and thus their EL representation would be equivalent to the Davidsonian one. 7

Later in the chapter I work through examples that explicitly contain kind terms as arguments, which

in those cases derive from bare plurals (e.g., Dogs). It is accepted that singular nouns with indefinite determiners, especially when paired with a simple present verb, behave similarly to kind referring, bare plurals. This rough equivalence is implicitly relied on in a variety of examples in this dissertation. However, the exact semantics remain a thorny issue, and examples exist strongly suggesting that they can’t be taken as strictly equivalent. 8

A term due to Len Schubert (p.c.), which refers to Musan’s maximal temporal segments of indi-

viduals, picked out by quantified nominals. I use portions here to prevent potential confusion of terms borrowed from Carlson.

38

versus individual-level predicate distinction of Carlson (1977a) is treated here as a distinction in which predicates give rise to situation variables in the logical form (stage-level predications do, individual-level predications do not). This is not to deny a temporal component to (some) individual-level predicates such as Smart. Rather, in this case the temporal reference of the subject and predicate are assumed to be mutually constrained via meaning postulates (which would come into play in the understanding process9 ). For example, in the (non-generic) sentence, There is a smart woman, (∃x: [x Woman] [x Smart]), the situation of x being smart is constrained to lie temporally within the womanportion x (of some individual) introduced by the restricted quantifier, because Smart is existence-implying (Musan, 1999). Note that if individual-level predicates were treated as uniformly situation- (or time-)dependent, then even mathematical statements such as An even integer has a prime factor, would be construed situationally (or temporally). Neither of the KA systems described in the previous chapter currently have the ability to distinguish individual- versus stage-level predications. Therefore situational variables are left implicit by default, lest they be introduced gratuitously into individuallevel predications.

3.2.7

Adverbial Quantifiers

Lewis (1975) provided the following listing of adverbial quantifiers, where bracketed items were considered to have comparable strength, though Lewis acknowledged their semantics differed in some ways from the non-bracketed examples. 1. always, invariably, universally, without-exception 2. sometimes, occasionally, [once] 9

E.g., the Epilog system (Schaeffer et al., 1993) (but note that automated understanding of, or

reasoning with, these logical forms is outside the scope of this dissertation).

39

3. never 4. usually, mostly, generally, almost always, with few exceptions, [ordinarily], [normally] 5. often, frequently, commonly 6. seldom, infrequently, rarely, almost never These adverbs naturally correspond to various levels of quantificational force. In this dissertation I am primarily concerned with the groups (2) and (4), which I use here to describe what I will call weak, versus strong, patterns of predication. A weak pattern is one such as Sometimes, a man may read a book, as compared to a strong pattern: Usually, countries have elections. As discussed in Appendix A, generic sentences tend to be thought of as containing an element from (4), e.g., Dogs (usually) bark, or, (Mostly,) birds fly. Thus, natural language generic sentences, being stronger statements than many of predication patterns we can currently extract, constitute a subset of the background knowledge I am after here. Syntactically, these quantifiers will rarely make an appearance in provided examples, as they arise from the unrolling of the underspecified quantifiers already present. Lewis took these adverbials to be unselective quantifiers, quantifying over cases, which are effectively tuples of all free variables in the quantified sentence.10 One motivation for this approach was to sidestep the problem of donkey anaphora, discussed below. von Fintel (1994) gave an approach dependent on the quantifier being exclusively bound to a situation variable, with constraints bundled into the domain restrictor.11 I take adverbials to quantify over single variables, but do not require that they be of a situational type: as said earlier, some propositions simply do not have a situational reading, such as mathematical truths. 10

The “logician’s” quantifiers, ∀ and ∃ are, in their standard treatment, selective in that they bind

only a single variable. 11

In the case of existence-implying, individual-level predicates, which I’ve taken to be non-situational,

von Fintel might add to the restrictor something of the effect, [x Extant].

40

Consider the sentence, A woman is smart, which carries an individual-level predicate, Smart, leading to the logical form [hA Womani Smart]. If we take this to have moderate generic force we have the expanded form (Often x: [x Woman] [x Smart]). As a more complicated example, start with the underspecified logical form for the sentence, A man reads a book : [hA Mani Read hA Booki ], and then assume it carries weak generic force. The generic aspect may be construed in several ways, either as purely situational, or as involving quantification both over a type of situation and over one of the two entity types (the syntactic subject and object): (Occasional e: Φ e [hA Mani Read hA Booki e]) (Occasional x: [Ψ x ∧ [x Man]] (Occasional e: Φ e [x Read hA Booki e])) (Occasional y: [Ψ ’y ∧ [y Book]] (Occasional e: Φ e [hA Mani Read y e]))

In the second and third of these logical forms, the outermost domain restrictor is made up of some presupposed constraint Ψ , conjoined with the explicitly stated constraint (e.g., that the individual is of kind Man), whereas in all three logical forms the situational restrictor Φ is entirely presuppositional, since no explicit situational constraints are given.12 12

An example of a generic sentence with an explicit situational constraint would be ”A man reads a

book when he is bored”, but the logical forms of general factoids considered in this thesis do not contain modifying material.

41

See Appendix A, but especially von Fintel (1994) and Ahn (2004) for a linguistic semantic/pragmatic discussion on the difficulties of properly defining an generic restrictor, especially when dealing with stage-level predicates.

Deriving a Habitual Reading A generic sentence consisting of a stage-level predicate applied to a kind leads to a habitual reading. For example, Dogs bark, is read as, All or most dogs at least occasionally bark. There exists an elegant process13 for understanding this reading as the result of online type-shifting.14 In the following I spell out two examples. The first deals with an individual-level predicate applied to a kind. This is in order to highlight the type coercion that gets us from an individual-level predicate over objects to one over kinds. In the second example I give a type-shifting operator that signifies a habitual reading, getting us from a stage- to an individual-level predicate. This rule in addition to that given in the first example will allow us compositional treat habitual generic sentences dealing with kinds. This process is provided in order to show more clearly the underlying processes being assumed in the examples provided elsewhere in the dissertation, at least for those focused on a conversion from a generic into logical form. When background knowledge is extracted automatically from examples, different processes are used, as we are not interpreting generic assertions directly. The examples below help to verify that it is possible to mechanically get from a generic sentence to the sorts of logical forms targeted in automatic extraction.

Example 1

To begin, the sentence, Dogs are furry, gives the logical form: [(K Dog) Furry],

13

This process owes to Len Schubert.

14

The use of such online type-shifting to preserve a compositional semantics has a long tradition, seen

(for example) in the work of Partee and Rooth (1983) on correcting type-mismatches in conjunctions.

42

where (K Dog) is dog-kind, and Furry, like Smart, is taken as individual-level.15 More precisely, Furry is an individual-level predicate that applies to objects: we do not take dog-kind to itself possess the property of being furry, it is the elements of the class that each (usually) have that property. An individual-level predicate, Π , that applies to objects, may coerced to apply to kinds by the general rule: Π → (λk (G x: [x Instance-of k] [x Π ])). As applied to our example we have: [(K Dog) (λk (G x: [x Instance-of k] [x Furry]))], which, after lambda conversion, gives: (G x: [x Instance-of (K Dog)] [x Furry]). Through the use of a meaning postulate relating kinds to conceptual classes, we can rewrite this more succinctly as: (G x: [x Dog] [x Furry]).

Example 2

Taking the sentence, Dogs bark, we start with the logical form: [(K Dog) Bark].

Stage-level predicates in simple present typically carry a habitual reading. This is represented as the addition of an adverbial operator applied to predicate, which we take to be implicit in the English sentence: [(K Dog) (occasionally Bark)]. 15

The operator K comes from EL, which maps predicates to their respective kinds. Note that the

elements of the conceptual class, Dog, correspond to those individuals that satisfy an “instance-of” relationship with (K Dog), e.g., [Rover Instance-of (K Dog)].

43

This adverbial maps stage- to individual-level predicates: (occasionally Bark) is an individual-level predicate which applies to objects. As in the first example, we can coerce this predicate into one that applies to kinds: [(K Dog) (λk (G x: [x Instance-of k] [x (occasionally Bark)]))], which leads to: (G x: [x Dog] [x (occasionally Bark)]). We can expand (occasionally Bark) through the introduction of event variables: (G x: [x Dog] [x (λy (Occasional e [y Bark e]))]), which gives: (G x: [x Dog] (Occasional e [x Bark e])). This can be verbalized as: Most dogs are such that there exists at least occasional events where they bark, or, as given earlier: All or most dogs at least occasionally bark, which is the most natural reading of the original sentence, Dogs bark. If we wished to make this more clear in the expansion of (occasionally Bark), we might instead use the quantifier (with the same intended semantics): exists-occasional.

3.2.8

Donkey Anaphora and Skolemized Scripts

In order to handle the donkey anaphora16 that arise from some generic sentences, I assume the mechanics proposed by Schubert (1999) (improved in Schubert (2009)), who 16

So-called because of the examples employed by Geach (1962), foremost being: Every farmer who

owns a donkey beats it. Such sentences are problematic as the pronoun lies outside the binding scope of the donkey-variable, yet seems to refer to that variable, this cannot be directly represented in FOL. Approaches to representing these types of sentences focus on the dynamic nature of quantified variables being introduced within the restrictor, where these variables then need some formal mechanism in order to be made available within the matrix, seemingly (when using FOL) outside the scope of their quantifier.

44

starts from a position of wanting to directly interpret generic sentences and passages, and ends with representations that strongly resemble the frames or scripts of Minksy and Schank. In essence, he enables dynamic Skolemization in non-positive polarity environments (such as the restrictor of a generically quantified description or story) via automatically generated definition-like statements called Skolem conditionals. The corresponding Skolem functions allow representations of donkey anaphora without resort to dynamic semantics (e.g., DRT (Kamp, 1981) or File Change Semantics (Heim, 1982)), and align closely with (semantic) roles or slots as traditionally understood in frames and scripts. As compared to Minsky’s and Schank’s early work, the Skolemized scripts have a precise semantics, while still retaining the intuition of a structure that bundles together expected events or properties as stereotypical contexts. In contrast with formalisms such as DRT that depend on dynamic semantics, and thus make meanings dependent on “left context”, the use of Skolem functions in Schubert’s proposal give scripts and their parts a modularity that is conducive to building the sort of large-scale, order-independent KB we are interested in here. Below is an example of Schubert’s, similar to the opening example in the introduction: (28)

John ran to a nearby house. The door was locked.

We can imagine a frame that says that in general, a house has certain parts such as windows, a kitchen, etc. These parts become accessible as Skolem functions (dependent on whichever house is under consideration). Reference to the particular house in (28) implicitly introduces a number of objects related to the house, through application of those Skolem functions. In particular, the door can be interpreted as the value of the Skolem function for the door introduced in the frame for the house, but applied to the particular house at issue. Start with the underspecified logical form for Houses have doors: [hA Housei Have hA Doori ].

45

If I use G to stand for a generic quantifier of undetermined strength, and assume that it is the house that is generically quantified, then we can expand the above into (G x: [Φ x ∧ [x House]] (∃y: [y Door] [x Have y])). The methods explored in this thesis do not give rise to the more complex forms of generic knowledge that motivated the development of Skolemized scripts; their description here is provided in part to confirm that the earlier representative components rest within a larger framework meant to adequately scale to more complex knowledge. However, Skolem functions do offer a naturalness of expression with respect to rolelike reference in normal language (e.g., the door of the house): to gain this benefit, and to retain a level of uniformity with future work, it would be possible to make liberal use of Skolem functions in the examples throughout this dissertation. For example, the previous logical form could be transformed into the following simply by Skolemizing the existentials contained within the nuclear scope: (G x: [Φ x ∧ [x House]] [[(D x) Door] ∧ [x Have (D x)]]), where the Skolem function, D, may be read as door-of, and where the above could then be explicitly verbalized as, Generally, for a house, x, (D x) is a door, and x has (D x). In order to minimize the use of constructs that may be unfamiliar to some readers, I forgo such a path, and will make use of Skolemization only when needed. To close this section, I offer an example of a Skolemized script for the sentence, A person who buys a book will read it. Throughout the example I assume a non-universal reading, and ignore event variables,17 In order to highlight the problem posed by donkey sentences, first consider the following improper logical form: (Usual x: [[x Person] ∧ (∃y: [y Book] [x Buy y])] [x Read y]). 17

In this case: each individual that purchases multiple books may each read just one, and this

would still satisfy a possible reading of the conditional. As another example, consider the similarly constructed: A farmer who buys a donkey will ride it home, where a farmer might naturally purchase multiple donkeys, but ride just one of them home (with the rest in tow).

46

Note the variable y in the nuclear scope, which appears outside the quantifier found within the restrictor: this free variable has no meaning in a non-dynamic logic. Skolemizing y in the restrictor with respect to x, and replacing the free occurrence of y with this Skolem term leaves a well formed formula: (Usual x: [[x Person] ∧ [(B x) Book] [x Buy (B x)]] [x Read (B x)]), but this is still not what we want. Consider, for example, a model where no-one who buys a book reads it,18 and (B x) stands for “body of x”. Such a model supports the above generic because of the (trivial) falsity in the restrictor.19 Schubert’s Skolem conditionals are rules meant to be automatically introduced, to “nail down” the introduced Skolem function(s) such that they stand for their intended purpose. In this case: (∀x: (∃y: [y Book] [x Buy y]) [[(B x) Book] ∧ [x Buy (B x)]]). With this rule defined in parallel to the introduction of the Skolem function in the previous logical form, the generic has the intended meaning. For a discussion on making this less stylistically cumbersome, through the use of concept definitions, see Schubert (2009).

3.3

Resources

Instances are often referred to by name, such as John or Lower Manhattan. To abstract from such instances we need gazetteers: resources that map instances to classes. Automatically acquiring gazetteers is discussed in Chapter 6. In other cases, instances are referred to using a common noun phrase, such as a city, or the teacher, which carry their class label on their sleeve. I discuss the importance of these object-level common noun phrases (NPs) in Chapter 5, as well as providing ways for filtering certain forms of predications that arise from prenominal adjectival modification. 18

E.g., everyone buys a book just to then destroy it at a book-burning.

19

Assuming that all else is defined intuitively, in particular, a body is not a book.

47

3.4

Level of Abstraction

Within AI, classes are usually considered to be structured within a hierarchical ontology, where if class A subsumes class B, then instances of B are also considered instances of A. For example: Dogs are Mammals implies that if Rover is a dog, then Rover is a mammal. If given access to such an ontology we can work towards determining the proper level of class abstraction for a given predication, e.g., is it that Beagles bark, Dogs bark, or should we say that Mammals bark ?20 This has been explored by, for example, Pantel et al. (2007), Pa¸sca (2008), and Reisinger and Pa¸sca (2009). Further, we might try to form generic conditionals that state, for a given predication, it is usually the case that it will hold of instances a certain small number of classes. This is explored in Chapter 7, where I provide examples such as, if a male builds X, then X is probably a structure, a business, or a group.

3.5

The Interpretation Problem

The central task of implicit KA is to go from predications concerning instances to a logical proposition that can be interpreted as a generalization about some class. As said, this challenge shares the troubles associated with representing and interpreting generic sentences, and while there is yet to be a robust formal account of generics, there are a number of practical ways we may choose to interpret our results in light of these challenges. If we want our results to be interpretable as knowledge, then we need to take into account potential contextual constraints, and the troubles in determining the appropriate strength of quantification. Experiments described in this dissertation can be viewed as taking a variety of approaches:

No Interpretation Rather than acquiring propositions with formal truth values, we can instead claim to be extracting semantic patterns that can be used as evidence 20

The first is true, but the second is preferable, with the last being true only if we use a weaker

quantifier than in the other two cases (e.g., Some v.s. Most).

48

to predict future such patterns that we may see in documents or search queries, i.e., semantic language modelling. This view could be appropriate for the engineer concerned specifically with immediately improving the quality of NLP systems, and is one way to view the work of Chapter 8.

Assert Weak Quantification

The direct output from the Knext system can be

verbalized as weak generics, specifying properties or event descriptions that are occasionally instantiated. For example, the sentence A person may panic does not attempt to express a characteristic property about people, but only that we shouldn’t be surprised if such an event were to occur. Statements of this form are “safe” in that they can be interpreted as true knowledge, but they carry less predictive power than what I earlier termed a strong generic. Evaluations have shown that most such Knext extracted propositions that are supported by at least two unique sentences can be properly interpreted as claims that at least occasionally hold. The statements can be thought of as establishing the “ground rules” of what might happen in the world, on which we can then base further work. If we continue to take occasionally as an adverb expressing weak generic force, then the additional qualification of at least can be given a more formal definition: Knext statements can be taken as generics that minimally have weak quantificational force, but in some cases may express rules or properties that hold more strongly. For example, it is a true statement that At least occasionally, trees have branches, and yet we know this can be made stronger, e.g., Usually, trees have branches, or even, Almost always, trees have branches.21

Focus on Characteristic Properties

Pa¸sca and Van Durme (2007), by focusing on

finding the most characteristic attributes for a given class, can be thought of as trying to find those propositions that would hold for as many of a class’s instances as possible, with no contextual restrictions other than those contained within the proposition itself.22 21

The students of the course CS 444 at the University of Rochester are currently investigating methods

for doing this automatically, for at least some types of Knext output. 22

The additional assumption underlying the notion of a characteristic attribute, that it be somehow

distinguishing with respect to other conceptual classes, was not a main point in the initial work (Pa¸sca

49

Since the results are taken to have no hidden contextual restrictions, and are taken to hold for a majority of the instances of the kind, they can be safely interpreted as strong generics. For example, the class attribute leader for the concept Country corresponds to the generic sentence, Usually countries have leaders, which can be represented as (Usual x: [x Country] (∃y: [[y Leader] ∧ [x Have y]])), where the domain restrictor simply requires that x must be an instance of the class Country. As seen in those results, attributes like president were deemed acceptable, despite holding only for a well defined subset of countries. As pointed out in the Introduction, this attribute in particular was used by Prince (1978) as an example of stereotypical knowledge, and yet, as said, it carries a contextual restriction left implicit in the verbalization. Subsequent work (e.g., Pa¸sca (2008)) can be viewed as increasing the likelihood that a given (class,attribute) combination could be interpreted as a strong generic, by seeking to find just the level within an ontology such that the property expressed is, in fact, distinguishing (hypothetically, such an approach might revise the example attribute to apply only to Democratic Countries).

Specify Context As said earlier, Chapter 7 can be viewed as a step towards building conditionals with more interesting contextual restrictions. In this way, at least some types of weaker propositions may be strengthened by determined the full extent of contextual constraints (i.e., the contents of the domain restrictor). Recent efforts have attempted to capture episodic contexts, such as the work of Chambers and Jurafsky (2008). The authors, following work such as Lin (1998) and Chklovski and Pantel (2004), introduced a probabilistic model for learning script-like structures which they termed narrative event chains. As an example, if in a given document someone pleaded, admits and was convicted, then it is likely they were also sentenced, or paroled. Regarding the representation of context, in the sorts of Skolem script examples Schubert has given so far, these script- and frame-like structures make no distinction among and Van Durme, 2007), where characteristic was cashed out as meaning something like highly likely. For instance, both a Country and a City may have had Leader as a “characteristic” attribute.

50

the likelihoods of the various assertions in the body of the structure, given the truth of their antecedents. While the fact that all existentials are Skolemized in principle allows splitting of the frame/script assertions into separate “mini-frames/scripts” (all with the same generic quantifier), the question is how we would model the dependence of the likelihood of one assertion on the truth or falsity of another, (e.g., the likelihood of a restaurant patron leaving a tip depends on whether the food is served by waiters or is self-served). This would apparently require introducing separate generic quantifiers for the various assertions, in each case potentially with inclusion of assumptions about other aspects of the overall frame or script (enabled by the Skolem functions). Such an approach may be compatible with piecemeal acquisition of Markov-chain like representations of scripts from large text corpora (as opposed to direct interpretation of generic passage as Schubert originally intended), such as hinted at in the results of Chambers and Jurafsky (2008).

3.6

Causal Knowledge

Episodic Logic was designed with an eye towards representing causality.23 An assertion laying out a causal rule is a special sort of generic, involving not just underspecified contexts and quantifiers of ambiguous strength, but also a metaphysical causal relation being attributed to some pairs of items, such as an antecedent situation and a consequent situation. For example, we would interpret sentence (29) to say that in the context of a car driving along a road in a normal manner, if that road is wet, then in some comparatively large number of such situations the car will get into an accident. Further, the situation of the road being wet within that context is accorded special status as the cause of the consequent event. (29)

Wet roads cause car crashes.

Acquiring such causal generics goes beyond the types of knowledge targeted in this dissertation. See Girju (2002) and Hobbs (2005) as introductions to the causal literature, 23

Example analyses can be found in Van Durme (2008).

51

Type car

Miles Travelled

Crashes

Miles/Crash

Google

Teraword

1,682,671 million

4,341,688

387,562

30,700,000

1,748,832

12,401 million

101,474

122,209

2,070,000

269,158

6,619 million

83

79,746,988

5,100,000

603,933

motorcycle airplane

Table 3.1: Values for Miles Travelled, Crashes, and Miles/Crash are for travel in the United States, in the year 2006 (U.S. Department of Transportation, 2009), where plane crashes are considered any event in which the plane was damaged (whether or not there were human injuries). Google gives the estimated number of documents returned by Google search, between 10-10:30 AM EST Oct. 21, 2009, for the queries: “car (crash|accident)”, “(airplane|plane) (crash|accident)”, and “motorcycle (crash|accident)”. Teraword results are the sum of bigram statistics extracted from Brants and Franz (2006) using case insensitive matching of the Google search queries.

Word

Teraword

Word

spoke

11,577,917

breathed

725,034

Teraword

laughed

3,904,519

hugged

610,040

murdered

2,843,529

blinked

390,692

984,613

exhaled

168,985

inhaled

Table 3.2: Unigram frequencies taken from Brants and Franz (2006), using case insensitive matching.

aimed at computational linguists.

3.7

Reporting Bias and Quantifier Strength

The issue of quantifier strength relates to the discussion by Carlson (1995) (see Appendix A), on whether the truth of natural language generics rests on inductive evidence. If so, then the concern is determining how many examples we need to observe in order to form a belief about future observed instances of a class. Making the induction problem even more challenging, note that the frequency with which situations of a certain type are described in text do not necessarily correspond

52

to their relative likelihood in the world, or even the subjective frequencies captured in human beliefs.I will refer to this potential discrepancy between reality and its description in text as reporting bias. For example, in Table 3.1 we see that the likelihood of a motorcycle crash in the United States is more likely than a car crash, which itself is much more likely than a plane crash, for each mile travelled.24 However, if we consider the respective raw frequencies taken from either internet searches or corpus data, the mentions of motorcycle accidents are half as frequent as plane crashes, despite being far more likely. In Table 3.2 we find many more mentions of murdered than we do for hugged, breathed, or blinked. In short, Occurrences in a corpus cannot necessarily be thought of in the same way as occurrences in the world. This point is particularly relevant for work such as Chambers and Jurafsky (2008), where it is not clear whether such methods can capture real world tendencies, as compared to expected paths in (textual) narration. Alternatively, it may be that such a distinction is not meaningful for at least some sorts of event chains, where human expectations as to real world likelihoods are found to correlate with textually derived statistics. Tables 3.1 and 3.2 are meant as suggestive. However, if reporting bias is in fact not a real problem for KA, then it remains for the KA community to show this to be the case. Otherwise, future work remains in determining whether, and how, it can be corrected for. In the worst case, implicit KA faces an upper bound on the extent of human knowledge that may be accurately captured (prompting the sorts of hybrid approaches mentioned in the previous chapter).

3.8

Summary

This dissertation assumes a framework minimally consisting of: (A) collections of instances, corresponding to object-referring NPs, (B) classes, corresponding to kind24

Preferable for this argument would be statistics broken down by the average number of trips taken,

but unfortunately such statistics are difficult to come by.

53

referring NPs, where known instances are members of one or more known classes, and (C) the ability to detect predications holding about these known instances or relating them. The experiments in subsequent chapters deal with how to acquire the necessary resources (such as gazetteers, which map instances to their class labels), and how to improve the quality of the propositions being extracted. After introducing a representation language for formalizing the output of a knowledge acquisition system, I discussed various ways in which we can interpret such results, while noting the relation to the problems associated with interpreting generic sentences. In order to enable automated inference, we need to consider the circumstances or context in which a particular item of common knowledge holds. This has long been recognized in AI and linguistics, with a variety of attempts made at formulating a proper representation. One of these attempts, presented in Schubert (2009), can be viewed as providing a formal account of scripts (Schank, 1975). One of the key areas of future work in knowledge acquisition will be to take a representation such as Schubert’s and work to acquire more complex contextual restrictions than considered in this dissertation, including those of a causal nature. Formalizing a notion of context has been attempted by researchers in a variety of fields (linguistics, AI, philosophy, etc.), and I have not attempted here a comprehensive survey. Within AI, in addition to the earlier referenced work of Minsky and Schank, see work such as McCarthy (1993) and efforts that have followed in these traditions, both theoretical and applied. For an example of the later, Clark et al. (2005) lays out extensions for representing context within LCC’s25 textual Question Answering (QA) system.

25

http://languagecomputer.com

54

4

Comparison of Approaches

Several early studies in large-scale text processing, such as Liakata and Pulman (2002), Gildea and Palmer (2002) and Schubert (2002) showed that having access to a sentence’s syntax enabled credible, automated semantic analysis. These studies suggest that the use of increasingly sophisticated linguistic analysis tools could enable an explosion in available symbolic knowledge. Nonetheless, much of the subsequent work in extraction has remained averse to the use of the linguistic deep structure of text; this decision is typically justified by a desire to keep the extraction system as computationally lightweight as possible. The acquisition of background knowledge is not an activity that needs to occur online: as long as the extractor will finish in a reasonable period of time, the speed of such a system is an issue of secondary importance.1 Accuracy and usefulness of knowledge should be of paramount concern, especially as the increase in available computational power makes such “heavy” processing less of an issue. In this chapter I compare the methodology behind Knext, a system built for acquiring knowledge, to the open domain information extraction system TextRunner, as well as to the query log-based system described in Chapter 2. The experiments were aimed at a comparative assessment of linguistically based knowledge extraction, and pattern-based information extraction. The intent is to examine the qualitative simi1

An argument can be made, such as for the emerging area of real-time search, that factoid-centric

IE does in fact need to be done in a rapid manner. That sort of extraction is not what is aimed at in this thesis.

55

larity of results between these systems, given the differing goals of the various system developers.

4.1

TextRunner

TextRunner (Banko et al., 2007) is a recent project aimed at large-scale, open extraction of tuples of text fragments representing verbal predicates and their arguments. This systems does part-of-speech tagging of a corpus, identifies noun phrases with a noun phrase chunker, and then uses tuples of nearby noun phrases within sentences to form apparent relations, using intervening material to represent the relation. Apparent modifiers such as prepositional phrases after a noun or adverbs are dropped. Every candidate relational tuple is classified as trustworthy (or not) by a Bayesian classifier, using such features as parts of speech, number of relevant words between the noun phrases, etc. The Bayesian classifier is obtained through training on a parsed corpus, where a set of heuristic rules determine the trustworthiness of apparent relations between noun phrases in that corpus. As a preview of an example I will discuss later, here are two relational tuples in the format extracted by TextRunner2 (the people) use (force), (the people) use (force) to impose (a government). No attempt is made to convert text fragments such as “the people” or “use – to impose” into logically formal terms or predicates. Thus much like semantic role-labelling systems, TextRunner is an information extraction system, under the terminology used here; however, it comes closer to knowledge extraction than the former, in that it often strips away much of the modifying information of complex terms (e.g., leaving just a head noun phrase).

56

Domain

Num. Pages

%

Type

en.wikipedia.org

1,616,279

13.8

ref

www.answers.com

1,318,097

11.3

ref

www.amazon.com

257,868

2.2

shop

www.imdb.com

182,087

1.6

ent

www.britannica.com

59,269

0.5

ref

findarticles.com

56,173

0.5

misc

www.geocities.com

52,262

0.4

misc

www.city-data.com

50,891

0.4

ref

www.tv.com

41,699

0.4

ent

www.cduniverse.com

40,859

0.3

shop

Table 4.1: Top 10 most frequent domains and their relative percent contribution to the corpus (by document count), where I have labelled their type to be one of: reference material (ref), online shopping (shop), entertainment facts or news (ent), and miscellaneous (misc).

57

4.2

Dataset

Experiments were based on sampling 1% of the sentences from each document contained within a corpus of 11,684,774 web pages harvested from 1,354,123 unique top level domains. The top five contributing domains made up 30% of the documents in the collection. There were 310,463,012 sentences in all, the sample containing 3,000,736. Of these, 1,373 were longer than a preset limit of 100 tokens, and were discarded.3 Sentences containing individual tokens of length greater than 500 characters were similarly removed.4 As this corpus derives from the work of Banko et al. (2007), each sentence in the collection is paired with zero or more tuples as extracted by the TextRunner system. Note that while websites such as Wikipedia.org contain large quantities of (semi-) structured information stored in lists and tables, the focus here is entirely on natural language sentences. In addition, as the extraction methods discussed in this paper do not make use of intersentential features, the lack of sentence to sentence coherence resulting from random sampling had no effect on the results.

4.3

Extraction

Sentences were processed using the syntactic parser of Charniak (2000). From the resultant trees, Knext extracted 7,406,371 propositions, giving a raw average of 2.47 per sentence. Of these, 4,151,779 were unique, so that the average extraction frequency per sentence is 1.78 unique propositions. Post-processing left 3,975,197 items, giving a per sentence expectation of 1.32 unique, filtered propositions. Selected examples regarding knowledge about people appear in table 4.2. 2

Boldface indicates items recognized as head nouns.

3

Typically enumerations, e.g., There have been 29 MET deployments in the city of Florida since the

inception of the program : three in Ft. Pierce , Collier County , Opa Locka , ... . 4

For example, Kellnull phenotypes can occur through splice site and splice-site / frameshift muta-

tions301,302 450039003[...]3000 premature stop codons and missense mutations.

58

A PERSON MAY ... SING TO A GIRLFRIEND

PICK UP A PHONE

EXPERIENCE A FEELING

CARRY IMAGES OF A WOMAN

BURN A SAWMILL

FEIGN A DISABILITY

WALK WITH A FRIEND

PRESENT A PAPER

DOWNLOAD AN ALBUM

RESPOND TO A QUESTION

SING TO A GIRLFRIEND

LIKE (POP CULTURE)

BUY FOOD

RECEIVE AN ORDER FROM A GENERAL

KNOW STUFF

CHAT WITH A MALE-INDIVIDUAL

MUSH A TEAM OF (SEASONED SLED DOGS) OBTAIN SOME NUMBER OF (PERCULA CLOWNFISH) LOOK FOR A (QUALITY SHAMPOO PRODUCT)

Table 4.2: Verbalized propositions concerning the class Person. For the same sample, TextRunner extracted 6,053,983 tuples, leading to a raw average of 2.02 tuples per sentence. As described by its designers, TextRunner is an information extraction system; one would be mistaken in using these results to say that Knext “wins” in raw extraction volume, as these numbers are not in fact directly comparable (see section on Comparison).

4.4

Evaluation

Extraction quality was determined through manual assessment of verbalized propositions drawn randomly from the results. Initial evaluation was done using the method proposed in Schubert and Tong (2003), in which judges were asked to label propositions according to their category of acceptability; abbreviated instructions may be seen in Figure 4.1.5 Under this framework, category one corresponds to a strict assessment of acceptability, while an assignment to any of the categories between one and three may 5

Judges consisted of the authors of Van Durme and Schubert (2008) and two volunteers, each with

a background in linguistics and knowledge representation.

59

1.

A REASONABLE GENERAL CLAIM e.g., A grand-jury may say a proposition

2.

TRUE BUT TOO SPECIFIC TO BE USEFUL e.g., Bunker walls may be decorated with seashells

3.

TRUE BUT TOO GENERAL TO BE USEFUL e.g., A person can be nearest an entity

4.

SEEMS FALSE e.g., A square can be round

5.

SOMETHING IS OBVIOUSLY MISSING e.g., A person may ask

6.

HARD TO JUDGE e.g., Supervision can be with a company

Figure 4.1: Instructions for categorical judging. Category

% Selected

Kappa

% Selected

Kappa

1

49%

0.4017

50%

0.2822

1, 2, or 3

54%

0.4766

60%

0.3360

judges

judges w/ volunteers

Table 4.3: Percent propositions labeled under the given category(s), paired with Fleiss’ Kappa scores. Results are reported both for the primary judges (one and two), along with two volunteers.

be interpreted as a weaker level of acceptance. As seen in Table 4.3, average acceptability was judged to be roughly 50 to 60%, with associated Kappa scores signalling fair (0.28) to moderate (0.48) agreement. Judgement categories at this level of specificity are useful both for system analysis at the development stage, as well as for training judges to recognize the disparate ways in which a proposition may not be acceptable. However, because of the rates of agreement observed, evaluation moved to the use of a five point sliding scale (Figure 7.3). This scale allows for only a single axis of comparison, thus collapsing the various ways in which a proposition may or may not be flawed into a single, general notion of acceptability. The primary judges evaluated 480 propositions sampled randomly from amongst bins corresponding to frequency of support (i.e., the number of times a given proposition was

60

THE STATEMENT ABOVE IS A REASONABLY CLEAR, ENTIRELY PLAUSIBLE GENERAL CLAIM AND SEEMS NEITHER TOO SPECIFIC NOR TOO GENERAL OR VAGUE TO BE USEFUL: 1.

I agree.

2.

I lean towards agreement.

3.

I’m not sure.

4.

I lean towards disagreement.

5.

I disagree.

Figure 4.2: Instructions for scaled judging. extracted). 60 propositions were sampled from each of 8 such ranges.6 As seen in the first graph of Figure 4.3, propositions that were extracted at least twice were judged to be more acceptable than those extracted only once. Less intuitive is that as frequency of support increased further, the level of judged acceptability remained roughly the same. This suggests that methods for implicit KA require only a small number of examples to make reasonable, if tentative generalizations, but in addition, errors based on a particular parse error and/or rule combination are going to repeat themselves as well.

4.5

Comparison

To highlight differences between an extraction system targeting knowledge (represented as logical statements) as compared to information (represented as segmented text fragments), the output of Knext is compared to that of TextRunner for two select inputs. These examples were not randomly chosen, and as such, errors (from either system) should not be taken as representative of overall extraction quality: that syntax is beneficial for the sort of semantic argument bracketing being done by TextRunner has been argued previously by Gildea and Palmer (2002), Punyakanok et al. (2008), and in particular, Poon and Domingos (2009), whose syntax-reliant system showed improved 6

(0, 20 , 21 , 23 , 24 , 26 , 28 , 210 , 212 ), i.e., (0,1], (1,2], (2,8], ... .

61

accuracy in a QA task as directly compared to TextRunner.7

4.5.1

Basic

A defining quote from the book, “An armed society is a polite society”, is very popular with those in the United States who support the personal right to bear arms.

From this sentence TextRunner extracts the tuples:8 (A defining quote) is a (polite society ”), (the personal right) to bear (arms). We might manually translate this into a crude sort of logical form: is-a(a-defining-quote, polite-society-”), to-bear(the-personal-right, arms).

Better would be to consider only those terms classified as head, and make the assumption that each tuple argument implicitly introduces its own quantified variable: ∃x,y. quote(x) ∧ society(y) ∧ is-a(x,y), ∃x,y. right(x) ∧ arms(y) ∧ to-bear(x,y).

Compare this to the output of Knext:9 ∃x. society(x) ∧ polite(x), ∃x,y,z. thing-referred-to(x) ∧ country(y) ∧ exemplar-of(z,y) ∧ in(x,z), ∃x. right(x) ∧ personal(x), 7

Arguing a similar line to that given here, Poon and Domingos (2009) wrote: [Relying on a syntactic

analysis] enables us to leverage advanced syntactic parsers and (indirectly) the available rich resources for them. More importantly, it separates the complexity in syntactic analysis from the semantic one, and makes the latter much easier to perform. 8

Tuple arguments are enclosed in parenthesis, with the items recognized as head given in bold. All

non-enclosed, conjoining text makes up the tuple predicate. 9

For comparison purposes, scoped, simplified versions of Knext’s ULFs are shown. As discussed in

Chapter 3, Knext propositions are in fact viewed as generic conditionals with a minimum of “occasional” quantificational force. For example, (G x: [x Quote] [[(B x) Book] ∧ [x From (B x)]]).

62

∃x,y. quote(x) ∧ book(y) ∧ from(x,y), ∃x. society(x) ∧ armed(x),

which is automatically verbalized as: a society can be polite, a thing-referred-to can be in an exemplar-of a country, a right can be personal, a quote can be from a book, a society can be armed.

4.5.2

Extended Tuples

While Knext uniquely recognizes, e.g., adjectival modification and various types of possessive constructions, TextRunner more aggressively captures constructions with extended cardinality. For example, from the following: James Harrington in The Commonwealth of Oceana uses the term anarchy to describe a situation where the people use force to impose a government on an economic base composed of either solitary land ownership, or land in the ownership of a few.

TextRunner extracts 19 tuples, some with three or even four arguments, thus aiming beyond the binary relations that most current systems are limited to. That so many tuples were extracted for a single sentence is explained by the fact that for most tuples containing N > 2 arguments, TextRunner will also output the same tuple with N − 1 arguments, such as: (the people) use (force), (the people) use (force) to impose (a government), (the people) use (force) to impose (a government) on (an economic base). In addition, tuples may overlap, without one being a proper subset of another: (a situation) where (the people) use (force), (force) to impose (a government), (a government) on (an economic base) composed of (either solitary land ownership).

63

This overlap raises the question of how to accurately quantify system performance. When measuring average extraction quality, should samples be drawn randomly across tuples, or from originating sentences? If from tuples, then sample sets will be biased (for good or ill) towards fragments derived from complex syntactic constructions. If sentence based, the system fails to be rewarded for extracting as much from an input as possible, as it may conservatively target only those constructions most likely to be correct. With regards to volume, it is not clear whether adjuncts should each give rise to additional facts added to a final total; optimal would be the recognition of such optionality. Failing this, perhaps a tally may be based on unique predicate head terms? As a point of merit according to its designers, TextRunner does not utilize a parser (though as mentioned it does part of speech tagging and noun phrase chunking). This is said to be justified in view of the known difficulties in reliably parsing open domain text as well as the additional computational costs. However, a serious consequence of ignoring syntactic structure is that incorrect bracketing across clausal boundaries becomes all too likely, as seen for instance in the following tuple: (James Harrington) uses (the term anarchy) to describe (a situation) where (the people),

or in the earlier example where from the book, “An armed society appears to have been erroneously treated as a post-nominal modifier, intervening between the first argument and the is-a predicate. Knext extracted the following six propositions, the first of which was automatically filtered in post-processing for being overly vague10 *a male-individual can be in a named-entity of a named-entity, a male-individual may use a (term anarchy), persons may use force, a base may be composed in some way, a base can be economic, a (land ownership) can be solitary. 10

The primary judges viewed the third, fifth and sixth propositions to be both well-formed and useful.

64

Average Assessment 4 3 2

Average Assessment 4 3 2

1

Core

1

Natural

judge 1 judge 2

5

5

judge 1 judge 2

0

2

4 6 8 10 Frequency (lg scale)

12

0

2

4 6 8 10 Frequency (lg scale)

12

Figure 4.3: As a function of frequency of support, average assessment for propositions derived from natural and core sentences.

4.6

Extracting from Core Sentences

We have noted the common argument against the use of syntactic analysis when performing large-scale extraction viz. that it is too time consuming to be worthwhile. We are skeptical of such a view, but decided to investigate whether an argument-bracketing system such as TextRunner might be used as an extraction preprocessor to limit what needed to be parsed. For each TextRunner tuple extracted from the sampled corpus, core sentences were constructed from the predicate and noun phrase arguments,11 which were then used as input to Knext for extraction. From 6,053,981 tuples came an equivalent number of core sentences. Note that since TextRunner tuples may overlap, use of these reconstructed sentences may lead to skewed propositional frequencies relative to “normal” text. This bias was very much in evidence in the fact that of the 10,507,573 propositions extracted from the core sentences, only 3,787,701 remained after automatic postprocessing and elimination of 11

Minor automated heuristics were used to recover, e.g., missing articles dropped during tuple

construction.

65

Natural

Core

Overlap

judge 1

3.35

3.85

2.96

judge 2

2.95

3.59

2.55

Table 4.4: Mean judgements (lower is better) on propositions sampled from those supported either exclusively by natural or core sentences, or those supported by both.

duplicates. This gives a per-sentence average of 0.63, as compared to 1.32 for the original text. While the raw number of propositions extracted for each version of the underlying data look similar, 3,975,197 (natural) vs. 3,787,701 (core), the actual overlap was less than would be expected. Just 2,163,377 propositions were extracted jointly from both natural and core sentences, representing a percent overlap of 54% and 57% respectively. Quality was evaluated by each judge assessing 240 randomly sampled propositions for each of: those extracted exclusively from natural sentences, those extracted exclusively from core sentences, those extracted from both (table 4.4). Results show that propositions exclusively derived from core sentences were most likely to be judged poorly. Propositions obtained both by Knext alone and by Knext- processing of TextRunner-derived core sentences (the overlap set) were particularly likely to be judged favorably. On the one hand, many sentential fragments ignored by TextRunner yield Knext propositions; on the other, TextRunner’s output may be assembled to produce sentences yielding propositions that Knext otherwise would have missed. Ad-hoc analysis suggests these new propositions derived with the help of TextRunner are a mix of noise stemming from bad tuples (usually a result of the aforementioned incorrect clausal bracketing), along with genuinely useful propositions coming from sentences with constructions such as appositives or conjunctive enumerations where TextRunner outguessed the syntactic parser as to the correct argument layout. Future work may consider whether (syntactic) language models can be used to help prune core sentences before being given to Knext.

66

government, war, team, history, rest, coast, COUNTRY census, economy, population, independence side effects, influence, uses, doses, DRUG*

manufacturer, efficacy, release, graduates, plasma levels, safety makeup, heart, center, population, history,

CITY* side, places, name, edge, area works, art, brush, skill, lives, sons, PAINTER* friend, order quantity, muse, eye windows, products, word, page, review, film, COMPANY team, award, studio, director

Table 4.5: By frequency, the top ten attributes a class MAY HAVE. Emphasis added to entries overlapping with those reported by Pa¸sca and Van Durme. Results for starred classes were derived without the use of prespecified lists of instances. The second graph of Figure 4.3 differs from the first at low frequency of support. This is the result of the partially redundant tuples extracted by TextRunner for complex sentences; the core verb-argument structures are those most likely to be correctly interpreted by Knext, while also being those most likely to be repeated across tuples for the same sentence.

4.7

Class Properties

While TextRunner is perhaps the extraction system most closely related to Knext in terms of generality, there is also significant overlap with work on class attribute extraction. Pa¸sca and Van Durme (2007) recently described this task, going on to detail an approach for collecting such attributes from search engine query logs. As an example, the search query “president of Spain” suggests that a Country may have a president. If one were to consider attributes to correspond, at least in part, to things a class MAY HAVE, CAN BE, or MAY BE, then a subset of Knext’s results may be discussed in terms of this specialized task. For example, for the five classes used in those authors’ experiments, Table 4.5 contains the top ten most frequently extracted things each class

67

1

2+

1

2+

corr.

MAY HAVE

2.80

2.35

2.50

2.28

0.68

MAY BE

3.20

2.85

2.35

2.13

0.59

CAN BE

3.78

3.58

3.28

2.75

0.76

judge 1

judge 2

Table 4.6: Mean assessed acceptability for properties occurring for a single class (1),

Average Assessment 4 3 2

1

and more than a single class (2+). Final column contains Pearson correlation scores.

5

judge 1 judge 2

0

2 4 6 8 10 12 Number of Classes (lg scale)

Figure 4.4: Mean quality of class attributes as a function of the number of classes sharing a given property.

MAY HAVE, as determined by Knext, without any targeted filtering or adaptation to the task. For each of these three types of attributive categories the primary judges evaluated 80 randomly drawn propositions, constrained such that half (40 for each) were supported by a single sentence, while the other half were required only to have been extracted at least twice, but potentially many hundreds or even thousands of times. As seen in table 4.6, the judges were strongly correlated in their assessments, where for MAY HAVE and MAY BE they were lukewarm (3.0) or better on the majority of those seen. In a separate evaluation judges considered whether the number of classes sharing a given attribute was indicative of its acceptability. For each unique attributive

68

proposition the class in “subject” position was removed, leaving fragments such as that bracketed: a robot [can be subhuman]. These attribute fragments were tallied and binned by frequency,12 with 40 then sampled from each. For a given attribute selected, a single attributive proposition matching that fragment was randomly drawn. For example, having selected the attribute can be from a us-city, the proposition SOME NUMBER OF SHERIFFS CAN BE FROM A US-CITY was drawn from the

390 classes sharing this property. As seen in Figure 4.4, acceptability rose as a property became more common.

4.8

Summary

Work such as TextRunner (Banko et al., 2007) has pushed extraction researchers to consider larger and larger datasets. This represents significant progress towards the greater community’s goal of having access to large, expansive stores of general world knowledge. However, if “deep” language understanding and common-sense reasoning involve items as complex and structured as the EL example provided in Chapter 2, then automated knowledge acquisition cannot simply be a matter of accumulating rough associations between word strings, along the lines “(Youngster) (want become) (professional athlete)”. Rather, acquired knowledge needs to conform with a systematic, highly expressive KR syntax such as EL. The results presented here support the position that advances made over decades of research in parsing and semantic interpretation do have a role to play in large-scale knowledge acquisition from text. The price paid for linguistic processing is not excessive, and an advantage is the logical formality of the results, and their versatility, as indicated by the application to class attribute extraction.

12

Ranges : (0, 20 , 21 , 23 , 26 , ∞)

69

5

Using Classes

As described in Chapter 2, recent work on the task of acquiring attributes for concept classes has focused on the use of pre-compiled lists of class representative instances, where attributes recognized as applying to multiple instances of the same class are inferred as being likely to apply to most, or all, members of that class. For example, the class US President might be represented as a list containing the entries Bill Clinton, George Bush, Jimmy Carter, etc. Phrases found in a document, such as Bill Clinton’s chief of staff ..., or search queries such as chief of staff bush, provide evidence that the class US President has as an attribute chief of staff. Usually the focus of such systems has been on on binary attributes, such as the example chief of staff, while less attention has been paid to unary class attributes such as illegal for the class Drug, or warm-blooded for the class Animal.1 These attributes are most typically expressed in English through prenominal adjectival modification, with the nominal serving as a class designator. When attribute extraction is based entirely on instances and not the class labels themselves, this form of modification goes undiscovered. In what follows I explore both the impact of gazetteers in attribute extraction as well as the acquisition and filtering of unary class attributes, through a process based on logical form generation from syntactic parses derived from the British National Corpus (BNC). 1

Almuhareb and Poesio (2004) treat unary attributes as values of binary attributes; e.g., illegal might

be the value of a legality attribute. But for many unary attributes, this is a stretch.

70

5.1

Attribute Extraction via Knext

In order to study the contribution of lists of instances (i.e., generalized gazetteers) to the task of attribute extraction, the version of Knext as presented by Schubert (2002) was modified to provide output of a form similar to that of the extraction work of Pa¸sca and Van Durme (2007). These systems are both described in Chapter 2. Knext’s abstracted, propositional output was automatically verbalized into English, with any resultant statements of the form, A(N) X MAY HAVE A(N) Y, taken to suggest that the class X has as an attribute the property Y. Knext was designed from the beginning to make use of gazetteers if available, where a phrase such as Bill Clinton vetoed the bill supports the (verbalized) proposition A PRESIDENT MAY VETO A BILL. just as would The president vetoed the bill. I instrumented the system to record which propositions did or did not require gazetteers in their construction, allowing for a numerical breakdown of the respective contributions of known instances of a class, versus the class label itself. Pa¸sca and Van Durme (2007) described the results of an informal survey asking participants to enumerate what they felt to be important attributes for a small set of example classes. Some of these resultant attributes were not of the form targeted by the authors’ system. For example, nonprofit was given as an important potential attribute for the class Company, as well as legal for the class Drug. These attributes correspond to unary predicates as compared to the targeted binary predications underlying such attributes as cost for the class Drug. We extracted such unary attributes by focusing on verbalizations of the form, A(N) X CAN BE Y as in AN ANIMAL CAN BE WARM-BLOODED, which would lead to the logical form (G x:[x Animal] [x Warm-blooded]).

71

5.2 5.2.1

Experimental Setting Corpus Processing

Initial reports on the use of Knext were focused on the processing of manually created parse trees, on a corpus of limited size (the Brown corpus of Kucera and Francis (1967)). For the experiments presented here, Knext was run over the British National Corpus (BNC Consortium, 2001), after each sentence had been processed with the parser of Collins (1997).2 BNC was chosen because of its breadth of genre, its substantial size (100 million words) and its familiarity (and accessibility) to the community.

5.2.2

Gazetteers

Knext’s gazetteers were used as-is, and which were defined based on a variety of sources: miscellaneous publicly available lists, as well as manual enumeration. The classes covered can be seen in the Results section in Table 5.2, where the minimum, maximum and mean size were 2, 249, and 41, respectively.

5.2.3

Filtering out Non-predicative Adjectives

Beyond the pre-existing Knext framework, additional processing was introduced for the extraction of unary attributes in order to filter out vacuous or unsupported propositions derived from non-compositional phrases. This filtering was performed through the creation of three lists: a whitelist of accepted predicative adjectives; a graylist containing such adjectives that are meaningful as unary predicates only when applied to plural nouns; and a blacklist derived from Wikipedia topic titles, representing lexicalized, non-compositional phrases. Whitelist The creation of the whitelist began with calculating part-of-speech (POS) tagged bigram counts using the Brown corpus. The advantage of using a POS-tagged bigram model lies in the saliency of phrase structures, which enabled frequency calculations for both attributive and predicative uses of a given adjective. Attributive counts 2

Thanks to Daniel Gildea for providing this parsed data.

72

were based on instances when an adjective appears in the pre-nominal position and modifies another noun. Predicative counts were derived by summing over occurrences of a given adjective after all possible copulas. These counts were used to compute a p/a ratio - the quotient of predicative count over attributive count - for each word classified by WordNet (Fellbaum, 1998) as a having an adjectival use. After manual inspection, two cut-off points were chosen at ratios of .06 and 0, as seen in Table 5.1. Words not appearing in the Brown corpus (i.e. having 0 count for both uses), were sampled and inspected, with the decision made to place the majority within the whitelist, excluding just those with suffixes including -al, -c, -an, -st, -ion, -th, -o, -ese, -er, -on, -i, -x, -v, and -ing. This process resulted in a combined whitelist of 14,249 (usually) predicative adjectives. p/a ratio (r)

Cut-off decision

r ≥ .06

keep the adjective*

0 < r < .06

remove the adjective*

otherwise

keep the adjective*

Table 5.1: Cut-off decision given the p/a ratio of an adjective. *Note: except for handselected cases. Graylist A short (33 words) list was constructed, containing adjectives that are generally inappropriate as whitelist entries, but could be acceptable when applied to plurals. For example, the verbalized proposition OBJECTS CAN BE SIMILAR was deemed acceptable, while a statement such as AN OBJECT CAN BE SIMILAR is erroneous because of a missing argument. Blacklist From an exhaustive set of Wikipedia topic titles was derived a blacklist consisting of entries that had to satisfy four criteria: 1) no more than three words in length; 2) has no closed-class words, such as prepositions or adverbs; 3) must begin with an adjective and end with a noun (determined by WordNet); and 4) does not contain any numerical characters or miscellaneous symbols that are usually not meaningful in

73

English. Therefore, each title in the resultant list is a short noun phrase with adjectives as pre-modifiers. It was observed that in these encyclopedia titles, the role of adjectives is predominantly to restrict the scope of the object that is being named (e.g. CRIMINAL LAW), rather than to describe its attributions or features (e.g. DARK EYES). More often than not, only cases similar to the second example can be safely verbalized as X CAN BE Y from a noun phrase Y X, with Y being the pre-nominal adjective. This list was further refined by examining trigram frequencies as reported in the web-derived n-gram collection of Brants and Franz (2006). For each title of the form (Adj N) ..., trigram frequencies were gathered for adverbial modifications such as (very Adj N) ..., and (truly Adj N) .... Intuitively, high relative frequency of such modification with respect to the non-modified bigram supports removal of the given title from the blacklist. Trigram counts were collected using the modifiers: absolutely, almost, entirely, highly, nearly, perfectly, truly and very. These counts were summed for a given title then divided by the aforementioned bigram score. Upon sampled inspection, all threeword titles were kept on the blacklist, along with any two-word title with a resultant ratio less than 0.028. For example, the titles Hardy Fish, Young Galaxy, and Sad Book were removed, while Common Cause, Bouncy Ball, and Heavy Oil were retained.

5.3

Results

From the parsed BNC, 6,205,877 propositions were extracted, giving an average of 1.396 propositions per input sentence. These results were then used to explore the necessity of gazetteers, and the potential for extracting unary attributes. Quality judgements were performed using the 5 point evaluation scale described in Chapter 4, repeated here in Figure 7.3.

5.3.1

Necessity of Gazetteers

From the total set of extracted propositions, 638,809 could be verbalized as statements of the form X MAY HAVE Y. There were 71,531 unique classes (X ) for which at least

74

THE STATEMENT ABOVE IS A REASONABLY CLEAR, ENTIRELY PLAUSIBLE GENERAL CLAIM AND SEEMS NEITHER TOO SPECIFIC NOR TOO GENERAL OR VAGUE TO BE USEFUL: 1.

I agree.

2.

I lean towards agreement.

3.

I’m not sure.

4.

I lean towards disagreement.

5.

I disagree.

Figure 5.1: Instructions for scaled judging. a single candidate attribute (Y ) was extracted, with 9,743 of those having at least a single such attribute that was supported by a minimum of two distinct sentences. Table 5.2 gives the number of attributes extracted for the given classes when using only gazetteers, when using only the given names as class labels, and when using both together. While instance-based extraction generated more unique attributes, there were still a significant number of results derived based exclusively on class labels. Further, as can be seen for cases such as Artist, class-driven extraction provided a large number of attribute candidates not observed when relying only on gazetteers (701 total candidate attributes were gathered based on the union of 441 and 303 candidates respectively extracted with, and without a gazetteer for Artist). Note that this volume measure is potentially biased against class-driven extraction, as no effort was made to pick an optimal label for a given gazetteer, (the original handspecified class labels were retained). For example, one might expect the label Drink to generate more, yet still appropriate, propositions than Beverage, Actor and/or Actress as compared to Show Biz Star, or the semantically similar Book versus Literary Work. This is suggested by the entries in the table based on using super-types of the given class, as well as in Figure 5.2, which favorably compares top attributes discovered for select classes against those reported elsewhere in the literature. Table 5.3 gives the assessed quality for the top ten attributes extracted for five of

75

Class

Both

Gaz.

Label

Both

Gaz.

Label

777

698

96

722

651

98

Country

7,285

5,993

1,696

Humanitarian

5

5

0

US State

1,289

1,286

609*

Religious Leader

127

127

0

US City

2,216

2,120

813*

Criminal/Outlaw

30

30

6/4*

World City

4,780

4,747

813*

Company

3,968

1,553

2,941

Beverage

53

53

0

85

83

2

Tycoon

19

10

10

Scientist

798

750

60

TV Network

71

71

0

Religious Holiday

594

593

65*

706

441

303

3

3

65*

Medicine

29

2

27

71

71

26*

Weekday

1,234

1,232

2

673

673

0

Month

2,282

1,875

474

Sports Celebrity

45

45

0

Dictator

533

509

28

Activist Organization

63

63

0

Conqueror

103

84

19

Martial Art

3

3

0

Philosopher

672

649

37

Government Agency

295

294

2

Conductor

118

74

45

Criminal Organization

0

0

0

Singer

220

179

49

US President

596

596

1,421*

Band

349

58

303

Political Leader

568

568

170*

King

811

208

664

Supreme Court Justice

0

0

18*

Queen

541

17

532

Emperor

436

211

259

Pope

235

123

113

Fictitious Character

227

227

180*

32

27

5

9

9

0

Planet

289

163

141

Engineer/Inventor

10

10

73/13*

River

402

168

253

Famous Lawyer

0

0

72*

Deity

1,037

1,027

19

Writer

1,116

957

236

Architect

72

67

63

Film Maker

42

33

9

Show Biz Star

82

82

0

35,723

29,518

8,506

Continent

Artist

Adventurer

Class Composer

Service Agency

Civic Holiday Military Commander Intl Political Entity

Literary Work

TOTAL

Table 5.2: Extraction volume with and without using gazetteers. *Note: When results are zero after gaz. omission, values are reported for super-types, such as Holiday for the sub-type Civic Holiday, or City for US City. A/B scores reported for each class used separately, e.g., Engineer/Inventor.

76

BasicFood K (Food ): quality, part, taste, value, portion.. D: species, pounds, cup, kinds, lbs, bowl.. Q: nutritional value, health benefits, glycemic index, varieties, nutrition facts, calories..

Religion K: basis, influence, name, truths, symbols, principles, strength, practice, origin, adherent, god, defence.. D: teachings, practice, beliefs, religion spread, principles, emergence, doctrines.. Q: basic beliefs, teachings, holy book, practices, rise, branches, spread, sects..

HeavenlyBody

Painter

KG (Planet): surface, orbit, bars, history, atmosphere..

KG (Artist): works, life, career, painting, impression,

K (Planet): surface, history, future, orbit, mass, field..

drawings, paintings, studio, exhibition..

K (Star ): surface, mass, field, regions..

K (Artist): works, impression, career, life, studio..

D: observations, spectrum, planet, spectra, conjunction,

K (Painter ): works, life, wife, eye..

transit, temple, surface.. Q: atmosphere, surface, gravity, diameter, mass,

Q’: paintings, works, portrait, death, style, artwork, bibliography, bio, autobiography, childhood..

rotation, revolution, moons, radius..

Figure 5.2: Qualitative comparison of top extracted attributes; KG is Knext using gazetteers, K (class) is Knext for a class label similar to the heading, D and Q are document- and query-based results as reported in (Pa¸sca et al., 2007), Q’ is query-based results reported in (Pa¸sca and Van Durme, 2007).

77

Class

Both

Gazetteer

Class Label

King

1.2

1.9

1.3

Composer

1.5

1.5

2.1

River

1.9

1.9

1.5

Continent

1.5

1.9

2.0

Planet

1.9

3.2

1.6

Table 5.3: Average judged acceptability for the top ten attributes extracted for the given classes when using/not-using gazetteer information.

the classes in Table 5.2. As can be seen, class-driven extraction can produce attributes of quality assessed at par with attributes extracted using only gazetteers. The noticeable drop in quality for the class Planet when only using gazetteers (3.2 mean judged acceptability) highlights the recurring problem of word sense ambiguity in extraction. The names of Roman deities, such as Mars or Mercury, are used to refer to a number of conceptually distinct items, such as planets within our solar system. Two of the attributes judged as poor quality for this class were bars and customers, respectively derived from the noun phrases: (NP (NNP Mars) (NNS bars)), and (NP (NNP Mercury) (NNS customers)). Note that in both cases the underlying extraction is correctly performed; the error comes from abstracting to the wrong class. These NPs may arguably support the verbalized propositions, e.g.: A CANDY-COMPANY MAY HAVE BARS, and A CAR-COMPANY MAY HAVE CUSTOMERS. These examples point to additional areas for improvement beyond sense disambiguation: non-compositional phrase filtering for all NPs, rather than just in the cases of adjectival modification (Mars bar is a Wikipedia topic); and relative discounting of patterns used in the extraction process.3 This later technique is commonly used in specialized extraction systems, such as constructed by Snow et al. (2005) who fit a logistic regression model for hypernym (X is-a Y) classification based on WordNet, and Girju et al. (2003) who trained a classifier to look specifically for part-whole relations. 3

For example, (NP (NNP X) (NNS Y)) may be more semantically ambiguous than, e.g., the possessive

construction (NP (NP (NNP X) (POS ’s)) (NP (NNS Y))).

78

5.3.2

Unary Attributes % of Collection

Size Original

CAN BE

Original total

6,204,184

100

-

Filtered total

5,382,282

87

-

Original CAN BE

2,895,325

46

100

Filtered CAN BE

2,073,417

33

72

Whitelist

812,146

15

28

Blacklist

19,786

1
J × |S| : If |CL | < K : Set PJK = PJK ∪ {hI, Li|I ∈ S, hI, Li ∈ P}

Figure 6.1: Algorithm for extracting hinstance, class labeli pairs. instances paired with L is at least J of the size of S. For example, if 37 instances in a cluster of 50 elements each had the label president, then if 37/50 > J, president would be viable. If a label is viable based on the intra-cluster constraint, we then verify whether it is acceptable according to the inter-cluster constraint K. CL is the set of all clusters where at least one member of each cluster is paired with the label L. If the number of such clusters, |CL |, is less than K, we consider L to be a good label for the supporting instances in S. Each instance in S that is paired with L is added to our filtered collection PJK , representing an assignment of those instances to the class specified by L. Continuing our example, if there were 5 clusters that each had at least one element labeled president, and K > 5, then each of the elements in the cluster under considera-

87

tion having the label president would be recorded as true instances of that class.

6.1.2

Discussion

From an information retrieval perspective, the clusters provided as input to the extraction algorithm can be seen as documents, whereas the class labels are equivalent to document terms. In this light, the extraction algorithm offers what the traditional TF×IDF weighting scheme offers in information retrieval. The normalized term frequency, TF, is the number of instances in a cluster initially assigned a given label divided by the total number of instances in that cluster. Note that while this TF-like score is a 0-1 measure on relative frequency, it is possible in our case for the TF of distinct labels assigned to members of the same cluster to sum to a value greater than one.3 Our parameter J directly constrains the TF as described. IDF is usually considered to be the log of the total number of documents, first divided by the number of documents with the given term. This value is used based on the belief that terms (labels) with wide distribution across documents (clusters) are less significant than those occurring more narrowly. As done with IDF values, our use of the parameter K allows for limiting the “spread” of a term (class label). However, we desired the ability to regulate this spread directly in terms of the number of clusters covered (e.g., 1, 2, ..., 30, ...), rather than the log of the relative percentage.

6.2 6.2.1

Experimental Setting Data

Experiments relied on the unstructured text available within a collection of approximately 100 million Web documents in English, as available in a Web repository snapshot from 2006 maintained by the Google search engine. The textual portion of the documents were cleaned of html, tokenized, split into sentences and part-of-speech tagged 3

That is, it may occur that many of the instances in a single cluster share the same labels. For

example, the labels politician, public speaker, and elected official may each be assigned to the same 50% of a cluster (giving them each a TF of 0.5).

8000

1500

88

6000

J= 0.01 0.05 0.09

0

0

2000

500

4000

1000

J= 0.1 0.2 0.3 0.4

1

2

3 K

4

5

0

5

10

15 K

20

25

30

Figure 6.2: Number of classes extracted at strict, and less prohibitive settings of J and K. Size

Number

Size

Number

≤∞

8,572

≤ 25

4,322

≤ 500

8,311

≤ 10

1,681

≤ 50

6,089

≤5

438

Table 6.1: For J = 0.01, K = 30, number of classes whose size ≤ N .

using the TnT tagger (Brants, 2000). Clusters of related terms were collected similarly to Lin and Pantel (2002). Initial hinstance, labeli pairs were extracted in a manner similar to Hearst (1992).

6.2.2

Extraction

Classes were extracted across a range of parameter settings. As was expected, Figure 6.2 shows that as J (the requirement on the number of instances within a class that must share a label for it to be viable) is lowered, one sees a corresponding increase in the number of resultant classes. Similarly, the more distributed across clusters a label is allowed to be (K), also the larger the return. Table 6.1 shows the distribution of class sizes at a particular parameter setting.

89

J

K

|PJK |

|IJK |

|IJK \ I400k |

|LJK |

0



44,178,689

880,535

744,890

7,266,464

0.01

30

715,135

262,837

191,012

8,572

0.20

5

52,373

36,797

21,309

440

Table 6.2: For given values of J and K, the number of: instance-label pairs (PJK ); instances (IJK ); instances after removing those also appearing in WN400k (IJK \ I400k ); class labels (LJK ). Note that when J = 0 and K = ∞, PJK = P .

6.3

Evaluation

Unless otherwise stated, evaluations were performed based on results gathered using either wide (J = 0.01, K = 30) or narrow (J = 0.20, K = 5) settings of J and K, under the procedure described. Certain experiments made use of an automatically expanded version of WordNet (Snow et al., 2006) containing 400,000+ synsets, referred to here as WN400k. In those cases where the judgement task was binary (i.e., good or bad ), subscripts given for precision scores reflect Normal based, 95% confidence intervals.

6.3.1

Instance Vocabulary

To determine the quality of the underlying instance vocabulary, one hundred randomly selected instances were assessed for narrow and wide settings, independent of their proposed class labels (relevant population sizes shown in Table 6.2). As can be seen in Table 6.3, the vocabulary was judged to be near perfect. Quality was determined by both judges4 coming to agreement on the usefulness of each term. As a control, an additional set of ten elements, drawn from WN400k, were mixed within each sample (for a total of 110). Of these control items, only one was deemed questionable.5 Examples of both positive and negative judgements can be seen in Table 6.4. Instances such as fast heart rate, local produce, and severe itching were considered proper, 4

The authors of Van Durme and Pa¸sca (2008).

5

The questionable item was christopher reeves, whose proper spelling, christopher reeve, is not con-

tained in WN400k.

90

J

K

Good

%

0.01

30

97/100

97±3.3 %

0.20

5

98/100

98±2.7 %

Table 6.3: Assessed quality of underlying instances.

Instance

Good?

Instance

Good?

fast heart rate

yes

electric bulb

yes

local produce

yes

south east

yes

severe itching

yes

finding directions

yes

moles and voles

no

h power

no

Table 6.4: Select examples from instance assessment.

despite adjectival modification, as they are common enough terms to warrant being treated as unique vocabulary items (each of these appear in unmodified form in WordNet 3.0). Allowing instances such as moles and voles is complicated by the worry of needing to allow an exponential number of such conjunctive pairings. While the presence of a pair in text gives evidence towards considering it as a stand-alone instance, conjunctives were conservatively rejected unless they formed a proper name, such as a movie title or a band name.6 Types of events, such as finding directions or swimming instruction, were explicitly allowed as these may have their own attributes.7

6.3.2

Class Labels

Table 6.5 summarizes the results of manually evaluating sets of 100 randomly selected pairs for both wide (J = 0.01, K = 30) and narrow (J = 0.20, K = 5) parameter settings. In order to establish a baseline, a similar sample was assessed for pairs taken 6

Thus excluding such gems as oil and water which (as popularly known) don’t mix.

7

E.g., Finding directions [is frustrating], or Swimming instruction [is inexpensive].

91

J

K

0

PJK

PJK \ {hI400k , ·i}

Eval

Precision

Eval

Precision



34/100

34±9.3 %

27/100

27±8.1 %

0.01

30

86/100

86±6.9 %

75/100

75±8.5 %

0.20

5

91/100

91±5.6 %

95/100

95±4.3 %

Table 6.5: Quality of pairs, before and after removing instances already in WN400k.

Instance

Class

Good?

go-karting

outdoor activities

yes

ian and sylvia

performers

yes

italian foods

foods

yes

international journal

professional journal

no

laws of florida

applicable laws

no

farnsfield

main settlements

no

ellroy

favorite authors

no

Table 6.6: Interesting or questionable pairs.

directly from the input data. To evaluate novelty, an additional assessment was done after removing all pairs containing a term occurring in WN400k. As shown, the method was successful in separating high quality class/instance pairs from the lower average quality input data. Even when removing many of the more common instances as found in WN400k, quality remained high. Also in this case there appeared a statistically significant difference between the wide and narrow settings. That the number of classes dramatically rose inversely to J, yet still retained high quality at 0.01, supports the intuition that the labeling method of Pantel and Ravichandran (2004) might ignore potentially useful sets of instances that have the misfortune of being scattered across a small number of semantically related clusters. Examples from the evaluation on pairs can be seen in Table 6.6. Subjective labels such as favorite authors were considered bad because of difficulties in assigning a clear

92

interpretation.8 Similarly disallowed were class labels overly reliant on context, such as main settlements and applicable laws. The instance international journal is an example of imperfect input data: most likely a substring of, e.g., international journal of epidemiology. Conversely, ian and sylvia are a pair of singers that performed as a group, which was allowed. One observed pair, hwild turkeys, small mammalsi, led to a manual search to determine possible source sentences. The double-quoted query “mammals such as * wild turkeys” 9 was submitted to the Google search engine, giving a total of six results, including: - [White Ash] Provides a late winter food source for birds and small mammals such as wild turkey, evening grosbeak, cedar waxwings and squirrels. - A wide diversity of songbirds and birds of prey, as well as mammals such as deer, wild turkeys, raccoons, and skunks, benefit from forest fragmentation. - Large mammals such as deer and wild turkeys can be seen nearly yearround. The first sentence highlights the previously mentioned pitfalls inherent in using template based patterns. The second sentence is arguably ambiguous, although the most natural interpretation is false. The final sentence is worse yet, exemplifying the intuition that Web text is not always trustworthy.

6.3.3

Expanding a Class

In some cases it may be necessary or desired to expand the size of a given class through select relaxation of constraints (i.e., less restrictive values of J and/or K for pre-specified 8

For instance, Albert Einstein may be a famous scientist, suggesting the label as a worthwhile

class, but what about Alan Turing? Under a conjunctive reading, [[x Famous] ∧ [x Scientist]], we might say no, but under a functional reading, (using here the EL operator attr): [x ((attr Famous) Scientist)], it may be more appropriate (e.g., Alan Turing is famous amongst scientists, or Compared to other scientists, Alan Turing is famous). 9

Query was performed with and without wildcard.

93

a pre-specified label L). To understand the potential effects on quality such relaxation may have, three classes based on size were randomly selected from three separate ranges: small classes (< 50), prestigious private schools, telfair homebuilders, plant tissues; medium classes (< 500), goddesses, organisms, enzymes; and large classes (>= 500), flavors, critics, dishes. Each of these classes were required to have shown a growth in size greater than 50% between the most and least restrictive parameter settings explored. Up to 50 instances were sampled from the minimum sized versions of each class. From the respective maximum sized versions, a similar number of instances were sampled from those left remaining once elements also appearing in the minimum set were removed (i.e., the sample came only from the instances added as a result of loosening parameter constraints).10 For each of the three small classes, accuracy was judged 100% both before and after expansion; proving that even small, precise classes do not always necessarily land together in distributionally similar clusters. If a class expands as constraints are loosened, then new members must derive from clusters previously not contributing to the class. In some cases this might result in a class being incorrectly “spread out” over clusters of instances that are related to, but not precise members of, the given class. For example: for the class goddesses, many of the additional instances were actually male deities; in the case of enzymes, most of the newly added instances were amino acids which made up, but did not fully constitute, an enzyme. Quantitatively, the average number of such nearly correct instances increased from 40% to 66% for the class enzymes and 29% to 44% for the class goddesses.

6.3.4

Handling Pre-nominal Adjectives

Many of the classes obtained by this method, especially those with few members, had labels containing pre-nominal adjective modification. For example, Table 6.7 gives each 10

For example, a class of 10 members under restrictive settings that grew to a size of 16 as constraints

were relaxed be a viable “small” candidate, with 8 elements then sampled each from the min (=10) and max (=16) versions of the class.

94

writers

american, ancient, british, christian, classical, contemporary, english, famous, favorite, french, great, greek, indian, prominent, roman, romance, spanish, talented, veteran

weapons

advanced, bladed, current, dangerous, deadly, lethal, powerful, projectile, small, smart, sophisticated, traditional

Table 6.7: Candidate refinements discovered for the classes writers and weapons. S

Ratio

%

S

Ratio

%

212

0/4

0%

26

342/961

36%

211

1/32

3%

25

627/1566

40%

210

4/73

5%

24

860/1994

43%

29

19/143

13%

23

852/1820

47%

28

68/259

26%

22

455/936

49%

27

170/541

31%

21

108/243

44%

Table 6.8: For each range, the number of classes with a label whose first term has an adjective reading (S = bsizec). of the discovered refinements for the classes writers and weapons, most of which would be evaluated as being distinct, useful subclasses.11 Table 6.8 shows the number of classes, by size, whose label contained two words or more, and where the first word appeared in the adjective listing of WordNet 3.0. For instance, of the 32 classes containing between 211 and 212 instances, only one label (3%) was adjective initial. Compare this to the 936 classes containing between 4 and 8 instances, where 455 (49%) of the labels began with a term with an adjective reading. By limiting the number of clusters (via K) that may contain instances of a class, 11

Under the context of this project, prenominal modification was treated as creating subclasses, as

compared to the work discussed in Chapter 5, where they were taken as evidence for unary attributes.

95

Instance

Class

lamborghini murcielago

real cars

spanish primera division

domestic leagues

dufferin mall

nearby landmarks

colegio de san juan de letran

notable institutions

fitness exercise

similar health topics

Table 6.9: Examples of hinstance, classi pairs sampled from amongst classes with less than 10 members, and where the label was deemed unacceptable. this necessarily penalizes classes that may legitimately be of exceptionally large size. Discovery of these classes is sacrificed in order that large numbers of bad labels, which characteristically tend to occur across many clusters, are filtered out. However, it is possible to recover some of these large classes by recognizing and removing adjectival refinements from accepted labels, and merging the results into a larger, coarser class; for example each of the sub-types of writers being considered members of single, simplified class.12 When leading adjectives were removed, with the exception of lexicalized terms as found in WordNet 3.0,13 8,572 classes reduced to 3,397. Informal sampling gave evidence that the smallest classes being extracted tended to be of lower quality than those of larger size, primarily because of leading subjective adjectives (examples seen in Table 6.9). This prompted an evaluation of 100 pairs taken from classes with less than 10 elements, which were judged to have an average quality of just 71±5 %. After removing initial adjectives this improved to 91%. Adding a check for lexicalized terms raised this to 92% (where a 1% gain is not statistically significant for a sample of this size). 12

This heuristic fails for nonsubsective adjectives, such as those known as privatives, exemplified by

fake flower and former democracy (Kamp and Partee, 1995). 13

E.g., {new york} clubs, {romance languages}, {gold medal} winners.

96

6.3.5

Potentially Noisy Web Documents

Work such as Gordon et al. (2009) has considered the relative quality of KA results extracted from online materials as compared to traditional corpora.14 Here an example of the impact of weblogs could be seen in the extracted pair, hnascars, race carsi. This pair was initially judged to be an error, until a post to a web forum was discovered which discussed why nascars was not a real word, despite however many people use it. Not wishing to be prescriptivist, the pair valid was deemed valid.

6.3.6

Task-Based Evaluation

For a better understanding of the usefulness of the extracted classes beyond the high accuracy scores achieved in manual evaluations, a separate set of experiments used the extracted classes as input data for the task of extracting attributes (e.g., circulatory system, life cycle, evolution and food chain) of various classes (e.g., marine animals). The experiments followed an approach introduced in Pa¸sca (2007), which acquires lists of ranked class attributes from query logs, based on a set of instances and a set of seed attributes provided as input for each class. The only modification was a tweak in the internal representation and ranking of candidate attributes, to allow for the extraction of attributes when five seed attributes are provided for only one class, rather than for each input class as required in the original approach. Thus, a ranked list of attributes were extracted automatically for each of the classes generated by our algorithm when J is set to 0.01 and K is set to 30, from a random sample of 50 million unique, fullyanonymized queries in English submitted by Web users to the Google search engine in 2006. Similarly as in previous work (summarized in Chapter 2), each attribute in the extracted ranked lists was assigned a score of 1, if the attribute was vital, i.e., it must be present in an ideal list of attributes of the class; 0.5, if the attribute was okay, as 14

The intuition is that much of the bulk in online content is user generated, without editorial super-

vision (such as in books or newspapers), and therefore more likely to contain both grammatical and factual errors. The potential for genre bias is a related worry, something I’ve considered in the context of using web data in psycholinguistic experiments (e.g., Van Durme et al. (2009a)).

97

it provides useful but non-essential information; or 0, if the attribute was incorrect. Precision at some rank N in a list was thus measured as the sum of the assigned values of the first N candidate attributes, divided by N . When evaluated over a random sample of 25 classes out of the larger set of classes acquired from text, the open-domain classes extracted in this paper produced attributes at accuracy levels reaching 70% at rank 10, and 67% at rank 20. For example, for the class label forages, which was associated to a set of instances containing alsike clover, rye grass, tall fescue, sericea lespedeza etc., the ranked list of extracted attributes was [types, picture, weed control, planting, uses, information, herbicide, germination, care, fertilizer, ...]. Further results are found in Pa¸sca and Van Durme (2008).

6.4

Related Work

The given algorithm is most similar to that described by Pantel and Ravichandran (2004), in that both begin with clusters of instances grouped by distributional similarity. However, where the goal of those authors was to assign labels that best fit for a given cluster, the method presented here is meant to assign instances that best fit for a given label. To highlight this distinction, consider a hypothetical cluster of proper nouns, each standing for a US politician. Some of these may be senators, some of them presidents, a few may be persons of political importance that have never actually held office. The most coherent label for this set may be politician or representative, which Pantel and Ravichandran apply universally to all members of the set. Meanwhile, a second cluster may exist containing mostly historical American leaders such as Abe Lincoln, George Washington, and Ben Franklin, this set having the dominant label of president. The method presented here is aimed at teasing apart these related sets in order to assign the label, e.g., president, to instances of multiple clusters, and only to those instances where there is direct evidence to support it. This allows for more conservative class assignment (leading to higher precision), and a greater diversity of classes. On 1,432 clusters, Pantel and Ravichandran reported a labelling precision of 72%, with the average cluster size not provided. The relevant entries in Table 6.10 refer to their success at hypernym

98

labeling based on labels extracted for top three members of each cluster; a task more directly comparable to the focus here. All refers to their precision for all hypernym labels collected, while Proper is for the subset of All dealing with proper nouns. Snow et al. (2006) gave a model for instance tagging with the added benefit of possibly choosing the proper sense of a proposed label. By taking advantage of the pre-existing structural constraints imposed by WordNet, the authors report a fine grain (sense differentiated) labelling precision of 68% on 20,000 pairs. While not directly reported, one may derive the non sense differentiated precision based on given fine grain, and disambiguated precision scores. For the task targeted in this paper their system achieved 69.4% accuracy on 20,000 pairs (termed WN20k in Table 6.10).15 By comparison, at narrow settings of J and K, over 50,000 pairs were labeled with a judged precision of 91%, and without the use of a manually specified taxonomy. The comparative evaluation scores reported in Table 6.10 derive from judgements made by the respective judges independently, on different datasets. As such, the scores inherently include subjective components which suggest that the scores should be taken as only reference, with respect to estimated coverage and precision of different methods. It is reasonable to interpret the results from Table 6.10 as an indication that the method given here is competitive with state of the art (as of Van Durme and Pa¸sca (2008)), though the results should not be used for strict objective ranking against previous work. Wang and Cohen (2007) gave state of the art results for a seed-based approach to class clustering, a sub-part of the problem considered here. The authors need specify three examples for a given class in order for their system to automatically induce a wrapper for scraping similar instances from semi-structured content in Web documents. Average precision scores of 93-95% were reported for experiments on 12 classes, across English, Chinese and Japanese texts. A trade-off for accuracy is the need to specify seeds. In comparison, the method here requires no seeds, does not make use of semistructured data, and provides class labels along with representative instances. Note that 15

Snow et al. give c1 /total = 68/100 as fine grain prec. when total = 20, 000, with an associated

disambiguated prec. of c1 /(c1 + c2 ) = 98/100. Label precision prior to sense selection must therefore be (c1 + c2 )/total = 69.4%.

99

Source

|P|

Precision

WN1k

1,000

93.0%

WN20k

20,000

69.4%

CBC Proper

65,000

81.5%

159,000

68.0%

52,373

91.0%

715,135

86.0%

CBC All JK Narrow JK Wide

Table 6.10: Comparison to coverage and precision results reported in the literature, CBC (Pantel and Ravichandran, 2004), WN (Snow et al., 2006). one would be required to specify roughly 10,000 examples for this seed-based method to acquire even the 3,397 merged, adjective pruned classes described earlier. A hybrid approach was seen in later work (Talukdar et al., 2008), which used the results reported on in this chapter, along with class/instance pairs extracted from semistructured content, as seeds in a larger bootstrapping system.

6.5

Summary

This chapter details a method for the extraction of large numbers of concept classes along with their corresponding instances. Through the use of two, simply regulated constraints, imperfect collections of semantically related terms and hinstance, labeli pairs may be used together to generate large numbers of classes with state of the art precision. These gazetteers are a primary component to many knowledge extraction systems, such as those described in Chapter 2.

100

7

Using Ontologies

The ultimate goal motivating this dissertation is to enable the construction of systems with common-sense reasoning and language understanding abilities. As has been discussed, such systems will require large amounts of general world knowledge, and large text corpora are an attractive potential source of such knowledge. However, current natural language understanding methods are not general and reliable enough to enable broad assimilation, in a formalized representation, of explicitly stated knowledge in encyclopedias or similar sources. As well, such sources typically do not cover the most obvious facts of the world, such as that ice cream may be delicious and may be coated with chocolate, or that children may play in parks. As described in Chapters 3, Knext output describes properties or situations that “at least occasionally” obtain in the world, like those about ice cream and children just mentioned, but these are quite weak as general claims, and – being unconditional – are unsuitable for inference chaining. Consider however the fact that when something is said, it is generally said by a person, organization or text source; this a conditional statement dealing with the potential agents of saying, and could enable useful inferences. For example, in the sentence, The tires were worn and they said I had to replace them, they might be mistakenly identified with the tires, without the knowledge that saying is something done primarily by persons, organizations or text sources. Similarly, looking into the future one can imagine telling a household robot, The cat needs to drink something, with the expectation that the robot will take into account that if a cat drinks something, it is usually water or milk (whereas people would often have broader

101

options). The work reported here is aimed at deriving generalizations of the latter sort from large sets of weaker propositions, by examining the hierarchical relations among sets of types that occur in the argument positions of verbal or other predicates. The generalizations aimed at here are certainly not the only kinds derivable from text corpora (as the extensive literature referenced on finding isa-relations, partonomic relations, paraphrase relations, etc. attests), but as just indicated they do seem potentially useful. Also, thanks to their grounding in factoids obtained by open knowledge extraction from large corpora, the propositions obtained are very broad in scope, unlike knowledge extracted in a more targeted way. In the following I outline an approach to obtaining strengthened propositions from large sets of automatically acquired factoids. Positive results are reported, while making only limited use of standard corpus statistics, concluding that future endeavors exploring knowledge extraction and WordNet should go beyond the heuristics employed in recent work. The goal in this work, with respect to the example given, would be to derive with the use of a large collection of Knext outputs, a general statement such as If something sleeps, it is probably either an animal or a person, which gives the logical form1 (Usual x: (∃e: [x Sleep e]) [[x Animal] ∨ [x Person]]).

7.1 7.1.1

Resources WordNet and Senses

While the community continues to make gains in the automatic construction of reliable, general ontologies, the WordNet sense hierarchy (Fellbaum, 1998) remains the resource of choice for many computational linguists requiring an ontology-like structure. In the work discussed here I explore the potential of WordNet as an underlying concept hierarchy on which to base generalization decisions. 1

Where I have here taken “sleeps” to mean simply that x has slept on some occasion, e.

102

The use of WordNet raises the challenge of dealing with multiple semantic concepts associated with the same word, i.e., employing WordNet requires word sense disambiguation in order to associate terms observed in text with concepts (synsets) within the hierarchy. In their work on determining selectional preferences, both Resnik (1997) and Li and Abe (1998) relied on uniformly distributing observed frequencies for a given word across all its senses, an approach later followed by Pantel et al. (2007).2 Others within the knowledge acquisition community have favored taking the first, most dominant sense of each word (e.g., see Suchanek et al. (2007) and Pa¸sca (2008)). As will be seen, the algorithm given here does not select word senses prior to generalizing them, but rather as a byproduct of the abstraction process. Moreover, it potentially selects multiple senses of a word deemed equally appropriate in a given context, and in that sense provides coarse-grained disambiguation. This also prevents exaggeration of the contribution of a term to the abstraction, as a result of being lexicalized in a particularly fine-grained way.

7.1.2

Propositional Templates

While the procedure given here is not tied to a particular formalism in representing semantic context, in these experiments I make use of propositional templates, based on the verbalizations arising from Knext logical forms. Specifically, a proposition F with m argument positions generates m templates, each with one of the arguments replaced by an empty slot. Hence, the statement, a man may give a speech, gives rise to two templates, a man may give a

, and a

may give a speech. Such templates

match statements with identical structure except at the template’s slots. Thus, the factoid a politician may give a speech would match the second template. The slot-fillers from matching factoids (e.g., man and politician) form the input lemmas to our abstraction algorithm described below. 2

Rahul Bhagat, personal communication.

103

Additional templates are generated by further weakening predicate argument restrictions. Nouns in a template that have not been replaced by a free slot can be replaced with an wild-card, indicating that anything may fill its position. While slots accumulate their arguments, these do not, serving simply as relaxed interpretive constraints on the original proposition. For the running example we would have: a a ? may give a

may give a ?, and,

, yielding observation sets pertaining to things that may give, and

things that may be given.3 Attention is not solely restricted to two-argument verbal predicates: examples such as a person can be happy with a

, and, a

can be magical, can be seen in

Section 7.4.

7.2

Deriving Types

The method for type derivation assumes access to a word sense taxonomy, providing: W : set of words, potentially multi-token N : set of nodes, e.g., word senses, or synsets P : N → {N ∗ } : parent function S : W → (N + ) : sense function L : N × N → Q≥0 : path length function

L is a distance function based on P that gives the length of the shortest path from a node to a dominating node, with base case: L(n, n) = 1. When appropriate, I write L(w, n) to stand for the arithmetic mean over L(n0 , n) for all senses n0 of w that are dominated by n.4 In the definition of S, (N + ) stands for an ordered list of nodes. I refer to a given predicate argument position for a specified propositional template simply as a slot. W ⊆ W will stand for the set of words found to occupy a given slot 3

It is these most general templates that best correlate with existing work in verb argument preference

selection; however, a given Knext logical form may arise from multiple distinct syntactic constructs. 4

E.g., both senses of female in WN are dominated by the node for (organism, being), but have

different path lengths.

104

function Score (n ∈ N , α ∈ R+ , C ⊆ W ⊆ W) : C 0 ← D(n) \ C P

return

w∈C 0 L(w,n) |C 0 |α

function DeriveTypes (W ⊆ W, m ∈ N+ , p ∈ (0, 1]) : α ← 1, C ← {}, R ← {} B while too few words covered while |C| < p × |W | : n0 ← argmin Score(n, α, C) n∈N \ R

R ← R ∪ {n0 } C ← C ∪ D(n0 ) if |R| > m : B cardinality bound exceeded – restart α ← α + δ, C ← {}, R ← {} return R Figure 7.1: Algorithm for deriving slot type restrictions, with δ representing a fixed step size.

(in the corpus employed), and D : N →W ∗ is a function mapping a node to the words it (partially) sense dominates. That is, for all n ∈ N and w ∈ W , if w ∈ D(n) then there is at least one sense n0 ∈ S(w) such that n is an ancestor of n0 as determined through use of P. For example, we would expect the word bank to be dominated by a node standing for a class such as company as well as a separate node standing for, e.g., location. Based on this model I give a greedy search algorithm in Figure 7.1 for deriving slot type restrictions. The algorithm attempts to find a set of dominating word senses that cover at least one of each of a majority of the words in the given set of observations. The idea is to keep the number of nodes in the dominating set small, while maintaining high coverage and not abstracting too far upward.

105

For a given slot I start with a set of observed words W , an upper bound m on the number of types allowed in the result R, and a parameter p setting a lower bound on the fraction of items in W that a valid solution must dominate. For example, when m = 3 and p = 0.9, this says we require the solution to consist of no more than 3 nodes, which together must dominate at least 90% of W . The search begins with initializing the cover set C, and the result set R as empty, with the variable α set to 1. Observe that at any point in the execution of DeriveTypes, C represents the set of all words from W with at least one sense having as an ancestor a node in R. While C continues to be smaller than the percentage required for a solution, nodes are added to R based on whichever element of N has the smallest score. The Score function first computes the modified coverage of n, setting C 0 to be all words in W that are dominated by n that haven’t yet been “spoken for” by a previously selected (and thus lower scoring) node. Score returns the sum of the path lengths between the elements of the modified set of dominated nodes and n, divided by that set’s size, scaled by the exponent α. Note when α = 1, Score simply returns the average path length of the words dominated by n. If the size of the result grows beyond the specified threshold, R and C are reset, α is incremented by some step size δ, and the search starts again. As α grows, the function increasingly favors the coverage of a node over the summed path length. Each iteration of DeriveTypes thus represents a further relaxation of the desire to have the returned nodes be as specific as possible. Eventually, α will be such that the minimum scoring nodes will be found high enough in the tree to cover enough of the observations to satisfy the threshold p, at which point R is returned.

7.2.1

Non-reliance on Frequency

As can be observed, this approach makes no use of the relative or absolute frequencies of the words in W , even though such frequencies could be added as, e.g., relative weights on length in Score. This is a purposeful decision motivated both by practical and

106

theoretical concerns. Practically, a large portion of the knowledge observed in Knext output is infrequently expressed, and yet many tend to be reasonable claims about the world (despite their textual rarity). For example, a template shown in Section 7.3, a

may wear

a crash helmet, was supported by just two sentences in the BNC. However, based on those two observations this method was able to conclude that usually If something wears a crash helmet, it is probably a male person. Initially this project began as an application of the closely related MDL approach of Li and Abe (1998), but was hindered by sparse data. It was observed that the absolute frequencies from the data were often too low to perform meaningful comparisons of relative frequency, and that different examples in development tended to call for different trade-offs between model cost and coverage. This was due as much to the sometimes idiosyncratic structure of WordNet as it was to lack of evidence.5 Theoretically, the goal of this work is distinct from related efforts in acquiring, e.g., verb argument selectional preferences. That work is based on the desire to reproduce distributional statistics underlying the text, and thus relative differences in frequency are the essential characteristic. In contrast, this work aims for general statements about the real world, which in order to gather, text is relied on as a limited, proxy view. E.g., given 40 hypothetical sentences supporting a man may eat a taco, and just 2 sentences supporting a woman may eat a taco, I would like to conclude simply that a person may eat a taco, remaining agnostic as to relative frequency, as there is no reason to view corpus-derived counts as being (strongly) tied to the likelihood of corresponding situations in the world; they simply tell us what is generally possible and worth mentioning.6 5

For the given example, this method (along with the constraints of Table 7.1) led to the overly

general type, living thing. 6

See discussion in Chapter 3 on reporting bias.

107

Word

#

Gloss

abstraction

6

a general concept formed by extracting common features from specific examples

attribute

2

an abstraction belonging to or characteristic of an entity

matter

3

that which has mass and occupies space

physical entity

1

an entity that has physical existence

whole

2

an assemblage of parts that is regarded as a single entity

Table 7.1: hword, sense #i pairs in WordNet 3.0 considered overly general for these experiments.

7.3 7.3.1

Development Tuning to WordNet

The method as described thus far is not tied to a particular word sense taxonomy. Experiments reported here relied on the following model adjustments in order to make use of WordNet (version 3.0). The function P was set to return the union of a synset’s hypernym and instance hypernym relations.7 Regarding the function L , WordNet is constructed such that always picking the first sense of a given nominal tends to be correct more often than not (see discussion by McCarthy et al. (2004)). To exploit this structural bias, I employed a modified version of L that results in a preference for nodes corresponding to the first sense of words to be covered, especially when the number of distinct observations were low (such as earlier, with crash helmet):   1− 1 |W | L(n, n) =  1

∃w ∈ W : S(w) = (n, ...) otherwise

Note that when |W | = 1, then L returns 0 for the term’s first sense, resulting in a score of 0 for that synset. This will be the unique minimum, leading DeriveTypes to act as the first-sense heuristic when used with single observations. 7

See Miller and Hristea (2006) for details on this distinction.

108

Propositional Template A

CAN BE WHISKERED

GOVERNORS MAY HAVE A

Number 4 -S

CAN BE PREGNANT

28 105

A PERSON MAY BUY A A

MAY BARK

6

A COMPANY MAY HAVE A A

MAY SMOKE

A

CAN BE TASTY

713 8 33

A SONG MAY HAVE A A

A

4

31

CAN BE SUCCESSFUL

664

CAN BE AT A ROAD

20

CAN BE MAGICAL

96

CAN BE FOR A DICTATOR

5

MAY FLOAT

5

GUIDELINES CAN BE FOR

-S

A

MAY WEAR A CRASH HELMET

A

MAY CRASH

4 2 12

Table 7.2: Development templates, paired with the number of distinct words observed to appear in the given slot.

Parameters were set based on manual experimentation using the templates seen in Table 7.2. Acceptable results were found when using a threshold of p = 70%, and a step size of δ = 0.1. The cardinality bound m was set to 4 when |W | > 4, and otherwise m = 2. In addition, a small number of hard restrictions on the maximum level of generality were determined manually. Nodes corresponding to the word/sense pairs given in Table 7.1 were not allowed as abstraction candidates, nor their ancestors, implemented by giving infinite length to any path that crossed one of these synsets.

109

7.3.2

Observations

The method assumes that if multiple words occurring in the same slot can be subsumed under the same abstract class, then this information should be used to bias sense interpretation of these observed words, even when it means not picking the first sense. In general this bias is crucial to the approach, and tends to select correct senses of the words in an argument set W . But an example where this strategy errs was observed for the template a

may bark, which yielded the generalization that If something

barks, then it is probably a person. This was because there were numerous textual occurrences of various types of people “barking” (speaking loudly and aggressively), and so the occurrences of dogs barking, which showed no type variability, were interpreted as involving the unusual sense of dog as a slur applied to certain people. The template, a

can be whiskered, had observations including both face and

head. This prompted experiments in allowing part holonym relations (e.g., a face is part of a head) as part of the definition of P , with the final decision being that such relations lead to less intuitive generalizations rather than more, and thus these relation types were not included. The remaining relation types within WordNet were individually examined via inspection of randomly selected examples from the hierarchy. As with holonyms it was decided that using any of these additional relation types would degrade performance. A shortcoming was noted in WordNet, regarding its ability to represent binary valued attributes, based on the template, a

can be pregnant. While it was possible to

successfully generalize to female person, there were a number of words observed which unexpectedly fell outside that associated synset. For example, a queen and a duchess may each be a female aristocrat, a mum may be a female parent,8 and a fiancee has the exclusive interpretation as being synonymous with the gender entailing bride-to-be; again, none of these terms are subsumed by female person.

110

Propositional Template A

Number

MAY HAVE A BROTHER

28

A ? MAY ATTACK A

23

A FISH MAY HAVE A

38

A

CAN BE FAMOUS

A ? MAY ENTERTAIN A A

MAY HAVE A CURRENCY

A MALE MAY BUILD A A

CAN BE FAST-GROWING

665 8 18 42 15

A PERSON MAY WRITE A

47

A ? MAY WRITE A

99

A PERSON MAY TRY TO GET A

11

A ? MAY TRY TO GET A

17

A

MAY FALL DOWN

5

A PERSON CAN BE HAPPY WITH A

36

A ? MAY OBSERVE A

38

A MESSAGE MAY UNDERGO A

14

A ? MAY WASH A

5

A PERSON MAY PAINT A

8

A

9

MAY FLY TO A ?

A ? MAY FLY TO A A

CAN BE NERVOUS

Table 7.3: Templates chosen for evaluation.

4 131

111

If something is famous, it is probably a person1 , an artifact1 , or a communication2 . If ? writes something, it is probably a communication2 . If a person is happy with something, it is probably a communication2 , a work1 , a final result1 , or a state of affairs1 . If a fish has something, it is probably a cognition1 , a torso1 , an interior2 , or a state2 . If something is fast growing, it is probably a group1 or a business3 . If a message undergoes something, it is probably a message2 , a transmission2 , a happening1 , or a creation1 . If a male builds something, it is probably a structure1 , a business3 , or a group1 .

Table 7.4: Examples, both good and bad, of resultant statements able to be made postderivation. I have manually selected one word from each derived synset, with subscripts referring to sense number. Types are given in order of support, and thus the first are examples of “Primary” in Table 7.5.

7.4

Experiments

From the entire set of BNC-derived Knext propositional templates, evaluations were performed on a set of 21 manually selected examples, together representing the sorts of knowledge for which I am most interested in deriving strengthened argument type restrictions. All modification of the system ceased prior to the selection of these templates,9 which was performed with no knowledge of the underlying words observed for any particular slot. Further, some of the templates were purposefully chosen as potentially problematic, such as, a ? may observe a

, or a person may paint a

. Without additional context, templates such as these were expected to allow for exceptionally broad sorts of arguments. For these 21 templates, 65 types were derived, giving an average of 3.1 types per slot, and allowing for statements such as seen in Table 7.4. One way in which to measure the quality of an argument abstraction is to go back 8

Serving as a good example of distributional preferencing, the primary sense of mum is as a flower.

9

Len Schubert and I each choose half the templates, independently.

112

S Method

T

j

Prec

Recall

derived

80.2

first

j

Type

F.5

Prec

Recall

F.5

39.2

66.4

61.5

47.5

58.1

81.5

28.5

59.4

63.1

34.7

54.2

all

59.2

100.0

64.5

37.6

100.0

42.9

derived

90.0

50.0

77.6

73.3

71.0

72.8

first

85.7

33.3

65.2

66.7

45.2

60.9

all

69.2

100.0

73.8

39.7

100.0

45.2

All

Primary

Table 7.5: Precision, Recall and F-score (β = 0.5) for coarse grained WSD labels using the methods: derive from corpus data, first-sense heuristic and all-sense heuristic. Results are S T calculated against both the union j and intersection j of manual judgements, calculated for all derived argument types, as well as Primary derived types exclusively.

to the underlying observed words, and evaluate the resultant sense(s) implied by the chosen abstraction, where the majority of Knext propositions select for senses that are more coarse-grained than WordNet synsets. That is, coarse-grained word sense disambiguation (WSD) may be a potential proxy measure for the quality of the argument abstraction.10 This evaluation was performed using as comparisons the first-sense, and all-senses heuristics. The first-sense heuristic can be thought of as striving for maximal specificity at the risk of precluding some admissible senses (reduced recall), while the all-senses heuristic insists on including all admissible senses (perfect recall) at the risk of including inadmissible ones. Table 7.5 gives the results of two judges evaluating 314 hword, sensei pairs across the 21 selected templates. These sense pairs correspond to picking one word at random for each abstracted type selected for each template slot. Judges were presented with a sampled word, the originating template, and the glosses for each possible word sense 10

Allowing for multiple fine-grained senses to be judged as appropriate in a given context goes back

at least to Sussna (1993); discussed more recently by, e.g., Navigli (2006).

113

A

MAY HAVE A BROTHER

1 WOMAN : an adult female person (as opposed to a man); ”the woman kept house while the man hunted” 2 WOMAN : a female person who plays a significant role (wife or mistress or girlfriend) in the life of a particular man; ”he was faithful to his woman” 3 WOMAN : a human female employed to do housework; ”the char will clean the carpet”; ”I have a woman who comes in four hours a day while I write” *4 WOMAN : women as a class; ”it’s an insult to American womanhood”; ”woman is the glory of creation”; ”the fair sex gathered on the veranda”

Figure 7.2: Example of a context and senses provided for evaluation, with the fourth sense being judged as inappropriate.

(see Figure 7.2). Judges did not know ahead of time the subset of senses selected by the system (as entailed by the derived type abstraction). Taking the judges’ annotations as the gold standard, I report precision, recall and F-score with a β of 0.5 (favoring precision over recall, owing to a preference for reliable knowledge over more). In all cases the method gives precision results comparable or superior to the firstsense heuristic, while at all times giving higher recall. In particular, for the case of Primary type, corresponding to the derived type that accounted for the largest number of observations for the given argument slot, the method shows strong performance across the board, suggesting that the derived abstractions are general enough to pick up multiple acceptable senses for observed words, but not so general as to allow unrelated senses.

114

THE STATEMENT ABOVE IS A REASONABLY CLEAR, ENTIRELY PLAUSIBLE GENERAL CLAIM AND SEEMS NEITHER TOO SPECIFIC NOR TOO GENERAL OR VAGUE TO BE USEFUL: 1.

I agree.

2.

I lean towards agreement.

3.

I’m not sure.

4.

I lean towards disagreement.

5.

I disagree.

Figure 7.3: Instructions for evaluating Knext propositions. Finally, a test was designed to help determine whether the distinction between admissible senses and inadmissible ones entailed by the type abstractions were in accord with human judgement. To this end, I automatically chose for each template the observed word that had the greatest number of senses not dominated by a derived type restriction. For each of these alternative (non-dominated) senses, I selected the ancestor lying at the same distance towards the root from the given sense as the average distance from the dominated senses to the derived type restriction. In the case where going this far from an alternative sense towards the root would reach a path passing through the derived type and one of its subsumed senses, the distance was cut back until this was no longer the case. These alternative senses, guaranteed to not be dominated by derived type restrictions, were then presented along with the derived type and the original template to two judges, who were given the same instructions as used in earlier reported experiments (Chapters 4 and 5), which are repeated here in Figure 7.3. Results for this evaluation are found in Table 7.6, where we see that the automatically derived type restrictions are strongly favored over alternative abstracted types that were possible based on the given word. Achieving even stronger rejection of alternative types would be difficult, since Knext templates often provide insufficient context for full sense disambiguation of all their constituents, and judges were allowed to base their

115

assessments on any interpretation of the verbalization that they could reasonably come up with. Judge 1

Judge 2

Corr.

derived

1.76

2.10

0.60

alternative

3.63

3.54

0.58

Table 7.6: Average assessed quality for derived and alternative synsets, paired with Pearson correlation values.

7.5

Related Work

There is a wealth of existing research focused on learning probabilistic models for selectional restrictions on syntactic arguments. Resnik (1993a) used a measure he referred to as selectional preference strength, based on the KL-divergence between the probability of a class and that class given a predicate, with variants explored by Ribas (1995). Li and Abe (1998) used a tree cut model over WordNet, based on the principle of Minimum Description Length (MDL). McCarthy has performed extensive work in the areas of selectional preference and WSD, e.g., (McCarthy, 1997, 2001). Calling the generalization problem a case of engineering in the face of sparse data, Clark and Weir (2002) looked at a number of previous methods, one conclusion being that the approach of Li and Abe appears to over-generalize. Cao et al. (2008) gave a distributional method for deriving semantic restrictions for FrameNet frames, with the aim of building an Italian FrameNet. While the goals are related, their work can be summarized as taking a pre-existing gold standard, and extending it via distributional similarity measures based on shallow contexts (in this case, n-gram contexts up to length 5). I have presented results on strengthening type restrictions on arbitrary predicate argument structures derived directly from text. In describing Alice, a system for lifelong learning, Banko and Etzioni (2007) gave a summary of a proposition abstraction algorithm developed independently that is in some ways similar to DeriveTypes. Beyond differences in node scoring and their use

116

of the first sense heuristic, the approach taken here differs in that it makes no use of relative term frequency, nor contextual information outside a particular propositional template.11 Further, while this work is concerned with general knowledge acquired over diverse texts, Alice was built as an agent meant for constructing domain-specific theories, evaluated on a 2.5-million-page collection of Web documents pertaining specifically to nutrition. Minimizing word sense ambiguity by focusing on a specific domain was later seen in the work of Liakata and Pulman (2008), who performed hierarchical clustering using output from their Knext-like system first described in (Liakata and Pulman, 2002). Terminal nodes of the resultant structure were used as the basis for inferring semantic type restrictions, reminiscent of the use of CBC clusters (Pantel and Lin, 2002) by Pantel et al. (2007), for typing the arguments of paraphrase rules. Assigning pre-compiled instances to their first-sense reading in WordNet, Pa¸sca (2008) then generalized class attributes extracted for these terms, using as a resource Google search engine query logs (as was described in Chapter 3). Katrenko and Adriaans (2008) explored a constrained version of the task considered here. Using manually annotated semantic relation data from SemEval-2007, pre-tagged with correct argument senses, those authors chose the least common subsumer for each argument of each relation considered. The approach here keeps with the intuition of preferring specific over general concepts in WordNet, but allows for the handling of relations automatically discovered, whose arguments are not pre-tagged for sense and tend to be more wide-ranging. I note that the least common subsumer for many of the predicate arguments would in most cases be far too abstract.

7.6

Summary

As the volume of automatically acquired knowledge grows, it becomes more feasible to abstract from existential statements to stronger, more general claims on what usually 11

Banko and Etzioni abstracted over subsets of pre-clustered terms, built using corpus-wide distribu-

tional frequencies

117

obtains in the real world. Using a method motivated by that used in deriving selectional preferences for verb arguments, I have shown results in deriving semantic type restrictions for arbitrary predicate argument positions, with no prior knowledge of sense information, and with no training data other than a handful of examples used to tune a few simple parameters. Future work may include a return to the MDL approach of Li and Abe (1998), but using a frequency model that “corrects” for the biases in texts relative to world knowledge – for example, correcting for the preponderance of people as subjects of textual assertions, even for verbs like bark, glow, or fall, which we know to be applicable to numerous non-human entities.

118

8

Learning Soft Classes

Many previous efforts in generalizing knowledge extracted from text (e.g., Suchanek et al. (2007), Banko and Etzioni (2007), Pa¸sca (2008)) have relied on the use of manually created word sense hierarchies, such as WordNet. Unfortunately, as these hierarchies are constructed based on the intuitions of lexicographers or knowledge engineers, rather than with respect to the underlying distributional frequencies in a given corpus, the resultant knowledge collections may be less applicable to language processing tasks. Beyond the trouble in sense selection, the basic underlying hierarchical structure of resources like WordNet can be a source of error. Recall the example given in Chapter 7, regarding the (lack of) ancestry link between female person and female aristocrat. It was this sort of case that was meant by Hobbs and Navarretta (1993), in discussing the KA pipeline (emphasis my own): The top-down phase will of course be only as sophisticated as the representation scheme that is used. There are several varieties of representation schemes and kinds of knowledge that have been used. Most systems have a sort hierarchy in which concepts are placed. Ideally, this hierarchy should emerge from a study of the data. One has the feeling, however, that they are often pre-determined—pre-empirical, so to speak—and that they consequently force on the knowledge enterer choices that are inappropriate to the domain. In this chapter I provide results for the application of the Latent Dirichlet Allocation

119

LDA Topic Model framework of Blei et al. (2003) in order to automatically derive a set of classes based on underlying semantic data. These classes are “soft” in the sense that they are represented as multinomial distributions over instances, rather than as sets with hard membership. As will be explained, these classes are constructed based on the semantic contexts in which they appear, and can therefore be viewed as the results of probabilistic knowledge abstraction, or generalization. As compared to work such as by Cimiano et al. (2005), or Liakata and Pulman (2008), in this work terms are not restricted to a single underlying word sense, where instead “soft” sense allocation is determined by the data.

8.1 8.1.1

Generalizing Knowledge Background Knowledge

The experiments are based on the data from the previous chapter, consisting of a large collection of general knowledge extracted from the British National Corpus (BNC) by the Knext system. The problem specification starts similarly: given a collection of individual propositions, e.g., A Male May Build A House, automatically extracted from some corpus, construct conditional statements that can be viewed as stronger claims about the world, e.g., If A Male Builds Something, Then It Is Probably A Structure, A Business, Or A Group. Here this is revised to: given such a collection, construct conditional probability distributions that reflect the likelihood of a proposition being expressed with a particular argument, in text similar to that of the original corpus. For example, if we were to sample values for X, conditioned on a propositional template such as, A Male May Build X, we might see examples such as: A House, Some Steps, An Animal, A House, A Lyricism, A Church, Kardamili, Services, A Cage, A Camp, ... .1 1

These are the actual first 10 arguments randomly sampled from a 100 topic model during

development.

120

8.1.2

Model Description

Let each propositional template or contextual relation be indexed by r ∈ R = {1, ..., M }, limited in this work to having a single free argument a ∈ A , where A is finite. Assume some set of observations, taking the form of pairs ranging over R × A . Then the list of non-unique arguments seen occurring with relation r is written as, a r = (ar1 , ..., arNr ) ∈ A Nr . For example, the indices in a pair hr, ai might correspond to: hA Male May Build, A Housei , while the indices in an argument list, a r , might correspond to:(A House, A House, A House, Some Steps, ...). We are concerned here with Pr(a | r): the probability of an argument, a, given a relation, r. Let cr (a) =

Nr X

δ(a = ari )

i=1

be the number of times argument a, was observed with r. The maximum likelihood estimate (MLE) is then ˆ | r) = cr (a) . Pr(a Nr Assume the observation set is sparse, where the resultant MLE may incorrectly assign zero mass to events that have low (but non-zero) probability. Further assume that the distributions associated with distinct relations are not independent. For example, I expect the context A Firefighter May Eat X to have an argument pattern similar to A Secret Service Agent May Eat X. To capture this intuition I introduce a set of hidden topics, which here represent semantic classes as probability distributions over unique arguments. Under this model, we imagine a given argument is generated by first selecting some topic based on the relation, and then selecting an argument based on the topic. Where z ∈ Z = {1, ..., T } is a set of topics, let φz (a) = Pr(a | z) be the probability of an argument given a topic, and θr (z) = Pr(z | r) be the probability of a topic given a context. Both θr and φz represent multinomial distributions, whose parameters we will estimate based on training data.

121

α

z

θr

a

Nr M

β Figure 8.1:

φz

T

The smoothed LDA Model of Blei et al. (2003), in plate notation:

(non)shaded circles represent (un)observed variables, arrows represent dependence, and boxes with some term x in the lower right corner represents a process repeated x times.

The revised formula for the probability of a given r becomes

Pr(a | r) =

X z∈ Z

φz (a)θr (z) =

X

Pr(a | z) Pr(z | r).

z∈ Z

I use here the Latent Dirichlet Allocation (LDA) framework2 introduced by Blei et al. (2003), a generative model for describing the distribution of elements within, e.g., a document collection. Using the terminology of this model, documents are represented as a weighted mixture of underlying topics, with each topic representing a multinomial distribution over the vocabulary (see Figure 8.1). This model assumes observations are generated by a process in which each document’s topic distribution θr is first sampled from an underlying symmetric3 Dirichlet distribution with parameter α and then each word of the document is generated conditioned on θr and the multinomial distributions represented by each topic. Those topic distributions are themselves taken as draws from a symmetric Dirichlet distribution with parameter β: 2

Specifically, I use the smoothed LDA model of Blei et al. (2003), where the topic multinomials, φ,

are also taken to be draws from a Dirichlet prior, in addition to the document multinomials, θ. 3

A Dirichlet distribution over an n-dimensional multinomial is defined by n separate parameters, α1

to αn . A symmetric Dirichlet distribution constrains all n parameters to equal a single value, α.

122

θ ∼ Dirichlet(α), z | θr ∼ Multinomial(θr ), φ ∼ Dirichlet(β), a | z, φ ∼ Multinomial(φz ). With respect to topic model terminology, R may be considered a set of indices over documents, each associated with a list of observed words. In this case Nr is the size of document r.

8.1.3

Parameter Inference

Based on observations, we must estimate T + M distinct multinomial distributions represented by the probability functions φ1 , ..., φT and θ1 , ..., θM . Parameter inference was carried out using the Gibbs sampling procedure described by Steyvers and Griffiths (2007), the implementation of which I spell out in Figure 8.2.

8.2 8.2.1

Experiments Data

The dataset was constructed from the propositional templates described in the previous chapter. I considered just those templates with a frequency of 10 or more, which gave approximately 140,000 unique templates (relations), over a vocabulary of roughly 500,000 unique arguments. I held out 5% of this collection, randomly sampled, to use for evaluation, with the rest being used to train models of varying number of underlying topics.

8.2.2

Building Models

Following suggestions by Steyvers and Griffiths (2007), I fixed α = 50/T and β = 0.01, then constructed models across a variety of fixed values for T . As topic distributions

123

Given: R : set of relations a 1 , ..., a M : arguments seen for each relation C A Z : an |A | × T matrix C R Z : an M × T matrix C Z : a vector mapping topic ids to argument counts ar | C R : a vector mapping relation ids to argument counts, i.e., |a c(r, a, z) : a map from a relation, argument, topic triple to a count Parameters: T : number of topics α, β : Dirichlet parameters n : number of iterations Initialize: Initialize c(·, ·, ·), and the cells of C Z , C A Z , and C R Z to 0 For each r ∈ R : For each a ∈ a r : Draw a topic id, z, from (1, ..., T ), uniformly at random AZ RZ Increment c(r, a, z), C Z z , C az , and C rz

Inline-Function Sample: Let s be 0 Let V be a vector of size T For t from 1 to T :  AZ  RZ  C az + β C rz + α Let V t be CZ CR z + |A |β r + Tα Increment s by V t Normalize V by s Draw a topic id, z 0 , from (1, ..., T ), at random according to the multinomial V Algorithm: For e from 1 to n: For each r ∈ R : For each unique a ∈ a r : For each z such that c(r, a, z) > 0: Let x be c(r, a, z) For i from 1 to x: R AZ RZ Decrement c(r, a, z), C Z r , C az , and C rz z , C

Sample new topic id, z 0 R AZ RZ Increment c(r, a, z 0 ), C Z z 0 , C r , C az 0 , and C rz 0

Figure 8.2: Gibbs sampling procedure used for parameter inference, where snapshots of count matrices C A Z and C R Z may be taken at even intervals following burn-in, and then averaged and normalized to provide estimates of φ1 , ..., φT and θ1 , ..., θM .

3 10 200

● ●





25.0





24.5

Cross Entropy

# Topics

● ●

25.5

26.0

124

● ●

24.0



0

20

40



60

80

100

Iterations

Figure 8.3: Cross entropy of topic models with 3, 10 and 200 topics, evaluated on held out data, presented as a function of iterations of Gibbs sampling.

0

1

6

ha.d end.ni

An End

(k hplur person.ni)

People

hsm.q hplur eye.nii

Eyes

ha.d part.ni

A Part

(k hplur child.ni)

Children

ha.d head.ni

A Head

ha.d problem.ni

A Problem

(k hplur woman.ni)

Women

ha.d life.ni

A Life

Table 8.1: From the model with 10 underlying topics, the 3 most probable arguments from topics 0, 1 and 6. Presented both in the underlying pseudo logical form representation used by Knext, along with the associated English verbalization.

showed little variation after burn-in, for the exploratory work reported here I took θ and φ as the final configurations of the model after the last iteration of the chain.4 Figure 8.3 shows the cross entropy of three models on our held out data, as a function of iteration. As seen, more underlying topics allow for better predictions of argument patterns in unseen text. Increasing the number of topics significantly beyond 200 was computationally problematic for this dataset, under the given implementation.5 For the resultant models, examples of the most probable arguments given a topic 4

Post-hoc analysis of results using estimation based on parameter samples taken from evenly spaced

intervals after burn-in showed negligible difference. 5

The algorithm in Figure 8.2, implemented in C++, where available physical memory (˜ 2GB on the

machines used for experimentation) limited the size of C A Z and C R Z . Sparse matrices or hash-maps would allow for a larger number of topics, at the cost of less efficient access.

125

27

62

108

An Interest Can Be In X

X May Have A Region

A Person May Hear X

X May Undergo A Teaching

X May Have An Area

X May Ring

A Book Can Be On X

X May Have A Coast

X May Have A Sound

Table 8.2: From the model with 200 underlying topics, the 3 most probable templates when conditioned on topics 27, 62, and 108.

3

10

200

X May Be Cooked

Implications

5

A Web

5

Fish

1

A Person May Pay X

A Display

5

A Worker

1

A Fare

1

A Diversity Can Be In X

Opportunities

1

A Century

3

Volume

1

Table 8.3: Examples of propositional templates and arguments drawn randomly for evaluation, along with their judgements from 1 to 5 (lower is better).

can be seen in Table 8.1. Table 8.2 gives examples for the most probable templates to be observed in the training data, once the topic is fixed.

8.2.3

Evaluating Models

From the training corpus, 100 templates were sampled based on their frequency. Each template was evaluated according to the 5 point scale described in earlier chapters. Recall that a score of 1 corresponds to a template that can be combined with some argument in order to form a “Reasonably clear, entirely plausible general claim”, while a score of 5 corresponds to bad propositions. From the sample, templates such as X May Ask were judged poorly, as they appear to be missing a central argument (e.g., ... A Question, or ... A Person), while a small number were given low scores because of noise in Knext’s processing or the underlying corpus (e.g., X May Hm). Examples of high quality templates can be seen as part of Table 8.3. Average assessment of the first 100 templates was 1.99, suggesting that for the majority of our data there does exist at least one argument for which the fully instantiated proposition would be a reasonable claim about the world. The distribution of assess-

0

10

20

30

40

50

60

126

1

2

3

4

5

Figure 8.4: From a sample of 100 templates, number of those assessed at: 1 (64 of 100), 2 (7), 3 (7), 4 (10) and 5 (12).

# Topics Avg. Assessment

3

10

200

2.39

2.09

1.73

Table 8.4: Starting with 100 templates assessed at 2 or better, results of drawing 100 arguments (1 per template) from models built with 3, 10 and 200 topics.

ments may be seen in Figure 8.4. A further 38 propositions were sampled until we had 100 templates judged as 2 or better (this led to a final mean of 1.94 over the extended sample of 138 templates). For each of these high quality templates, one argument was drawn from each of three models. These arguments were then used to instantiate complete propositions, presented for evaluation absent the information of which argument came from which model. Table 8.3 gives three such h template, argumenti pairs, along with the assessed quality. In Table 8.4 we see that just as cross entropy decreased with the addition of more topics, the quality of sampled arguments improves as well.

8.2.4

Topic Pruning per Relation

The previous chapter gave an algorithm that assumed the missing argument for a given propositional template was type-restricted by at most a small number of distinct categories. For example, usually if there is some X such that A Person May Try To

127

Find X, then according to evidence gathered from the BNC, we expect X to usually be either a A Person or A Location. If we take topics to represent underlying semantic categories, then in order to enforce a similar restriction here, we need some way to constrain our model such that for each relation we keep track of at most k relevant topics. This could be achieved simply by post-processing each θˆr by setting to 0 all non-top k topics, and re-normalizing, but this would be overly severe: the assumption is that an argument is usually one of a few categories, while the proposed post-processing technique would equate to a claim that it is always of this restricted set. Here I investigated the effect of modifying the posterior probability such that Pr(a | r) is determined by using θr for just the relative top k, with leftover mass uniformly distributed amongst the remaining categories, weighted by the respective category probability. Let Z kr be those z ∈ Z such that there are no more than k − 1 such z 0 ∈ Z where θr (z 0 ) > θr (z). Then I define the k-constrained probability of a given r as:

 Prk (a | r) =

X

φz (a)θr (z) + λ1 

=

φz 0 (a)

z 0 6∈Z kr

z∈Z kr

X

X

Pr(z 0 )



λ2



φz (a)θr (z) +

z∈Z kr

 1 −

 X z∈Z kr

θr (z) 

X

z 0 6∈Z kr

φz 0 (a) P

Pr(z 0 )

z 0 6∈Z kr

Pr(z 0 )

 .

For a model built with 200 underlying topics, Figure 8.5 contains cross entropy results on held-out test data, when considering various levels of k. As seen, the majority of the predictive power for which arguments to expect for a given relation comes from the first two or three topics. If these topics are taken as (rough) semantic categories, then this agrees with the soft restriction employed in the previous chapter (where the parameter m in that algorithm is similar in function here to k). As an aside, note the relationship between that soft restriction, the k-constrained conditional probabilities here, and the use of an abnormality predicate when performing circumscription (McCarthy, 1980, 1986). McCarthy (1986) wrote: Nonmonotonic



10.25

10.75

11.25

128

● ●

9.75



0

2

● ● ● ● ● ● ●

4

6

8

10

Figure 8.5: On a 200 topic model, for k = 0, 1, ..., 10, cross entropy scores on held out data using k-constrained pruning.

reasoning has several uses. [... Such as] a very streamlined expression of probabilistic information when numerical probabilities, especially conditional probabilities, are unobtainable. As said at points throughout the dissertation: between problems of sparse data and worries of reporting bias, I do not assume that precise, conditional probabilities of human beliefs regarding the world are obtainable solely through existing extraction methods. Soft restriction, k-constrained probabilities, and the abnormality predicate: all are aimed at getting the “main gist” correct, succinctly, while isolating the remaining as “other”, or abnormal.

8.3

Related Work

Cimiano et al. (2005) automatically constructed ontologies based on extracted knowledge using the principles of Formal Concept Analysis (FCA), while Liakata and Pulman (2008) performed hierarchical clustering to derive a set of semantic classes as leaf nodes. In each case the respective authors derived a single word-sense per term, which was justified by focusing on domain specific texts. As described earlier in the thesis, Koo et al. (2008) presented results showing that a word hierarchy built in an unsupervised manner

129

could be used to improve accuracy in syntactic parsing: this is evidence for (at least shallow) semantic information being useful in NLP tasks. Although evaluated on syntactic constructions rather than logical forms, the work of Pantel and Lin (2002) on the Clustering By Committee (CBC) algorithm is perhaps the most similar in motivation to what I have presented here. A comparative strength of our model is that it is fully generative, which should allow for principled integration into existing text processing frameworks (in particular, as a semantic component in language modeling). Havasi (2009) explored the use of Singular Value Decomposition (SVD) techniques for clustering terms based on the semantic contexts from the Open Mind Common Sense (OMCS) project. As discussed in Chapter 2, the motivations behind that project are similar to those behind Knext, despite differences in acquisition methodology. Recently Brody and Lapata (2009) independently developed a semantic topic model framework for the task of word sense disambiguation. The authors built distinct models for each word, with individual topics standing for an underlying word sense.

8.4

Summary

For applied tasks in language processing, there is a desire for probabilistic ontologies, with structure derived from the corpus from which underlying semantic patterns were extracted. Here I have confirmed that Knext output can be used in the application of the LDA topic model framework to derive “soft” semantic classes, which is a step on the way towards building such ontologies. By treating the knowledge generalization problem described in the previous chapter as one of constructing conditional probability distributions, I have provided a data driven method for automatic evaluation of Knext results. If one may assume the average quality of an underlying set of extracted knowledge is high, then measuring the cross entropy of constructed models against held out data (Knext assertions) serves as a powerful time-saving device as compared to human evaluation.6 6

Keeping in mind the potential for negative impact on this approach because of reporting bias.

130

9

Conclusion

9.1

Summary

I began this dissertation with a summary of opinions from the greater cognitive science community on the fundamental need for background knowledge in understanding intelligence. In particular, the focus of this thesis has been on how we might gather a collection of background knowledge in order to enable the construction of synthetic intelligent systems. I divided the methodologies for knowledge acquisition into three areas: knowledge engineering, crowd sourcing, and automated text-based extraction. Within the textbased approach, systems may target either explicitly provided information, such as found in reference materials, or target knowledge implicit in natural discourse. The focus here has been on the latter strategy. Implicit knowledge acquisition from text may be performed by generalizing from existential statements, as well as through recognition of knowledge stated indirectly, such as in presuppositional contexts. I have shown the feasibility of constructing systems that perform this sort of extraction, both through structured analysis of natural sentences, as well as through the use of shallower techniques, as applied to search engine query logs. This establishes implicit knowledge acquisition as a valid technique for collecting common sense. I compared these systems to each other, as well as to related work in the commu-

131

nity (e.g., TextRunner), with an eye towards the relative level of depth in targeted representation. As seen in my discussion of natural language generic sentences, and in the description of the representational formalism employed here, an underlying focus of the work presented in this dissertation has been on acquiring well-formed statements represented in a symbolic form conducive to general inference. When abstracting from existential statements, extraction systems require access to a resource that maps instances (linguistic semantic objects) to their concept class (linguistic semantic kinds). I established a method for acquiring such gazetteers automatically, on a large scale. This directly enabled the extraction of knowledge pertaining to thousands (as compared to dozens) of semantic categories. Progressing beyond the use of flat gazetteers, I developed an approach for using a lexical-semantic ontology (i.e., WordNet) to find the appropriate level of conceptual generalization. As compared to prior work on determining verb-argument selectional preferences, my approach enables robust generalization over sparse contexts: an essential characteristic given the power law distribution of textually-acquired knowledge (i.e., most of it is sparse). Finally, I established a connection between general knowledge acquisition and the idea of building probabilistic, semantic language models.

9.2

Looking Forward

The core task of knowledge acquisition is to first discover basic semantic relations along with their arguments from text, and then to abstract beyond argument instances to more general statements about the conceptual classes we expect the predications to hold over. This was the focus of the dissertation. Future work on refining these basic relations needs to address the issues of determining proper quantifier strength, and what, if anything, is being left implicit in a rule’s quantifier domain restrictor. Restated, the first problem involves determining whether a given rule, e.g., A dog may bark, should be strongly (Most dogs bark ) or weakly (Some dogs bark ) quantified, or something in between (e.g., Many). Restating the second problem, which is closely related to the first: determine the proper contexts in which a given rule is applicable. For example, while it is the case

132

that Some countries have presidents, if we appropriately refine the domain restrictor, we end up with a stronger rule, Most democratic countries have presidents. As was pointed out, a focus on finding characteristic attributes for concept classes - with the use of a resource such as WordNet - may be seen as a limited version of addressing the problem of domain restriction in order to derive more strongly quantifiable statements. Since linguistic semanticists have yet to come up with a comprehensive way to formally interpret generic sentences, the natural analog to the knowledge I have focused on extracting automatically, then we should not expect problems of quantifier strength and domain restriction to be easily solved. Indeed, it seems likely that once a machine has the same level of understanding or comprehension of generic-like background knowledge as we do, we will have at that point made very significant progress towards the general problem of creating true Artificial Intelligence. It should be kept in mind, however, that the logical forms considered in this thesis arise from extraction over non-generic text: there may be novel approaches available to researchers in automatic knowledge acquisition that are not applicable to the case of direct generic interpretation. For example, hybrid involvement of human and automated systems, such as the previously cited project by Hoffman et al. (2009), may allow for fleshing out details normally left implicit in a conventional generic sentence, perhaps by selective querying of human knowledge engineers. Looking beyond further refinement of basic relational knowledge, we can view the work presented in this dissertation as a step on the path towards the acquisition of more general conditional rules, meant to enable chained inferences dealing with the everyday world. Work such as Chambers and Jurafsky (2008) can also be viewed as movement in this direction. With use of knowledge of the form being extracted by Knext along with the data-driven techniques such as used by Chambers and Jurafsky, it may be possible to automatically build up the sorts of complex knowledge motivating Schubert’s Skolemized scripts. The challenge of reporting bias is an issue that researchers in computational psycholinguistics and knowledge acquisition need to address. In particular, researchers need to be aware that frequency of occurrence of particular types of events or relations in text

133

may represent (sometimes drastic) distortions of real-world frequencies. Further, there may be a significant amount of general knowledge that is simply never alluded to in natural discourse. This further motivates investigation into hybrid acquisition systems. For example, repositories such as Cyc and OpenMind may contain certain categories of facts that are unlikely to ever be automatically gleaned from the New York Times;1 where if we knew which categories these were, we could focus different methodologies accordingly. Finally, the experimental results I have presented here were primarily based on manual evaluation of output samples from an automated system. Preferably the results of an acquisition system would be evaluated through their use in an applied task. I have provided little in this dissertation in the way of how this might be done, beyond minor motivational background (such as the work by Koo et al. (2008), which relied on relatively shallow corpus-based knowledge to improve a syntactic parser). In the short term, work related to the previous chapter on semantic language modeling will be explored more broadly, and applied to tasks such as machine translation (MT) as an extension to current work in syntactic models for MT. Looking further ahead, new tasks may need to be developed to properly “stress test” the quality of automatically acquired knowledge. An example could be something in the area of story understanding, perhaps along the lines of the complex question answering seen in Schubert and Hwang (2000), pertaining to the story of Little Red Riding Hood, but without the narrow focus of such earlier work on a few short narrative segments.2

9.3

Closing Remarks

In this dissertation I have shown the feasibility of acquiring common-sense knowledge from natural texts, represented in a logical form, with a semantics informed by, and 1

If so, this would validate a weakened version of the claim by Havasi et al. (2007), which I addressed

earlier. 2

This would go beyond the so-called narrative cloze of Chambers and Jurafsky, which looks specifi-

cally at episodic slot-filling, and does not require chained inference.

134

aligned with, the natural language version of such knowledge (generic sentences). Further work is needed to improve these methods, and to extend the target goal to more complex forms of knowledge. Considerable challenges remain, including some, such as the general problem of generic interpretation, that still have no obvious solution after decades of investigation. Nevertheless, this work represents progress on one piece of the overall puzzle of Artificial Intelligence, one of the most important endeavors of modern science.

135

Bibliography

Steven Abney. Partial Parsing via Finite-State Cascades. In Proceedings of ESSLLI, 1996. David D. Ahn. The Role of Situations and Presuppositions in Restricting Adverbial Quantification. PhD thesis, University of Rochester, Department of Computer Science, Rochester, NY 14627-0226, 2004. Abdulrahman Almuhareb and Massimo Poesio. Attribute-based and value-based clustering: an evaluation. In Proceedings of EMNLP, 2004. Abdulrahman Almuhareb and Massimo Poesio. Finding concept attributes in the web using a parser. In Proceedings of Corpus Linguistics Conference, 2005. Michele Banko and Oren Etzioni. Strategies for Lifelong Knowledge Extraction from the Web. In Proceedings of K-CAP, 2007. Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead, and Oren Etzioni. Open Information Extraction from the Web. In Proceedings of IJCAI, 2007. Rahul Bhagat. Learning Paraphrases from Text. PhD thesis, University of Southern California, April 2009. David Blei, Andrew Ng, and Michael Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 2003. BNC Consortium. The British National Corpus, version 2 (BNC World). Distributed by Oxford University Computing Services, 2001.

136

Thorsten Brants and Alex Franz. Web 1T 5-gram Version 1. Distributed by the Linguistic Data Consortium, 2006. Thorsten Brants. TnT - A Statistical Part-of-Speech Tagger. In Proceedings of ANLP, 2000. Samuel Brody and Mirella Lapata. Bayesian Word Sense Induction. In Proceedings of EACL, 2009. Diego De Cao, Danilo Croce, Marco Pennacchiotti, and Roberto Basili. Combining Word Sense and Usage for Modeling Frame Semantics. In Proceedings of Semantics in Text Processing (STEP), 2008. Sharon A. Caraballo. Automatic construction of a hypernym-labeled noun hierarchy from text. In Proceedings of ACL, 1999. Gregory N. Carlson and Francis Jeffry Pelletier, editors. The Generic Book. University of Chicago Press, 1995. Gregory N. Carlson. A Unified Analysis of the English Bare Plural. Linguistics and Philosophy, 1(3):413–458, 1977. Gregory N. Carlson. References to Kinds in English. PhD thesis, University of Massachusetts, Amherst, 1977. Published 1980 by Garland Press, New York. Gregory N. Carlson. Truth conditions of generic sentences: Two contrasting views. In G. Carlson and F. J. Pelletier, editors, The Generic Book, pages 224–237. University of Chicago Press, 1995. Gregory N. Carlson. Patterns in the Semantics of Generic Sentences. In Jacqueline Gu´eron and Jacqueline Lecarme, editors, Time and Modality, volume 75 of Studies in Natural Language and Linguistic Theory, pages 17–38. Springer Netherlands, 2008. Nathanael Chambers and Dan Jurafsky. Unsupervised Learning of Narrative Event Chains. In Proceedings of ACL, 2008.

137

Eugene Charniak. A Maximum-Entropy-Inspired Parser. In Proceedings of NAACL, 2000. Timothy Chklovski and Patrick Pantel. Verbocean: Mining the web for fine-grained semantic verb relations. In Proceedings of EMNLP, Barcelona, Spain, 2004. Timothy Chklovski. LEARNER: A System for Acquiring Commonsense Knowledge by Analogy. In Proceedings of Second International Conference on Knowledge Capture (K-CAP 2003), 2003. Timothy Chklovski. Using Analogy to Acquire Commonsense Knowledge from Human Contributors. PhD thesis, MIT Artificial Intelligence Laboratory, February 2003. Philipp Cimiano, Andreas Hotho, and Steffen Stabb. Learning concept hierarchies from text corpora using formal concept analysis. Journal of Artificial Inteligence Research, 2005. Stephen Clark and David Weir. An iterative approach to estimating frequencies over a semantic hierarchy. In Proceedings of EMNLP, 1999. Stephen Clark and David Weir. Class-based probability estimation using a semantic hierarchy. Computational Linguistics, 28(2), 2002. Peter Clark, Phil Harrison, and John Thompson. A Knowledge-Driven Approach to Text Meaning Processing. In Proceedings of the HLT-NAACL Workshop on Text Meaning, 2003. Christine Clark, Daniel Hodges, Jens Stephan, and Dan Moldovan. Moving QA Towards Reading Comprehension Using Context and Default Reasoning. In Proceedings of AAAI Workshop on Inference for Textual Question Answering, 2005. Herbert H. Clark. Bridging. In R. C. Schank and B. L. Nash-Webber, editors, Theoretical issues in natural language processing. Association for Computing Machinery, New York, 1975. Michael Collins. Three generative, lexicalised models for statistical parsing. In Proceedings of ACL, 1997.

138

Gerald F. DeJong. An Overview of the FRUMP System. In Wendy G. Lehnert and Martin H. Ringle, editors, Strategies for Natural Language Processing, pages 149–176. Lawrence Erlbaum, Hillsdale, NJ, 1982. Doug Downey, Oren Etzioni, and Stephen Soderland. A Probabilistic Model of Redundancy in Information Extraction. In Proceedings of IJCAI, 2005. Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. Web-scale Information Extraction in KnowItAll. In Proceedings of WWW, 2004. Christiane Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998. Noah S. Friedland and Paul G. Allen. The Halo Pilot: Towards A digital Aristotle. http://projecthalo.com/content/docs/halopilot vulcan finalreport.pdf, 2003. Peter Geach. Reference and Generality: An Examination of Some Medieval and Modern Theories. Ithaca, New York: Cornell University Press, 1962. Daniel Gildea and Daniel Jurafsky. Automatic Labeling of Semantic Roles. Computational Linguistics, 28(3), 2002. Daniel Gildea and Martha Palmer. The Necessity of Syntactic Parsing for Predicate Argument Recognition. In Proceedings of ACL, 2002. Roxana Girju, Adriana Badulescu, and Dan Moldovan. Learning semantic constraints for the automatic discovery of part-whole relations. In Proceedings of HLT-NAACL, 2003. Roxana Girju. Text Mining for Semantic Relations. PhD thesis, The University of Texas at Dallas, 2002. Jonathan Gordon, Benjamin Van Durme, and Lenhart K. Schubert. Weblogs as a Source for Extracting General World Knowledge. In Proceedings of K-CAP, September 2009. Paul Grice. Logic and Conversation. In Speech Acts. Academic Press, 1975.

139

Rakesh Gupta and Mykel J. Kochenderfer. Common Sense Data Acquisition for Indoor Mobile Robots. In Proceedings of AAAI, 2004. Sanda M. Harabagiu, George A. Miller, and Dan I. Moldovan. WordNet 2 - A Morphologically and Semantically Enhanced Resource. In Proceedings of SIGLEX, 1999. Catherine Havasi, Robert Speer, and Jason Alonso. ConceptNet 3: a Flexible, Multilingual Semantic Network for Common Sense Knowledge. In Proceedings of RANLP, 2007. Catherine Havasi. Discovering Semantic Relations Using Singular Value Decomposition Based Techniques. PhD thesis, Brandeis University, June 2009. Marti Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora. In Proceedings of COLING, 1992. Irene Heim. The Semantics of Definite and Indefinite Noun Phrases. PhD thesis, University of Massachusetts Amherst, 1982. Jerry R. Hobbs and Costanza Navarretta. Methodology for knowledge acquisition ˜ (unpublished manuscript). http://www.isi.edu/hobbs/damage.text, 1993. Jerry R. Hobbs. World Knowledge And Word Meaning. Theoretical Issues In Natural Language Processing, 1987. Jerry R. Hobbs. Toward a Useful Concept of Causality for Lexical Semantics. Journal of Semantics, 22(2):181–209, 2005. Raphael Hoffman, Saleema Amershi, Kayur Patel, Fei Wu, James Fogarty, and Daniel S. Weld. Amplifying Community Content Creation with Mixed-Initiative Information Extraction. In Proceedings of CHI, 2009. Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. OntoNotes: The 90% Solution. In Proceedings of NAACL, 2006. Hans Kamp and Barbara Partee. Prototype theory and compositionality. Cognition, 57(2):129–191, 1995.

140

Hans Kamp. A Theory of Truth and Semantic Representation. In J.A.G. Groenendijk, T.M.V. Janssen, and M.B.J. Stokhof, editors, Formal Methods in the Study of Language, pages 277–322. 1981. Sophia Katrenko and Pieter Adriaans. Semantic Types of Some Generic Relation Arguments: Detection and Evaluation. In Proceedings of ACL, 2008. Terry Koo, Xavier Carreras, and Michael Collins. Simple semi-supervised dependency parsing. In Proceedings of ACL, 2008. Manfred Krifka, Francis Jeffry Pelletier, Gregory N. Carlson, Alice ter Meulenn, Gennaro Chierchia, and Godehard Link. Genericity: An Introduction. In Gregory N. Carlson and Francis Jeffry Pelletier, editors, The Generic Book, pages 1–124. University of Chicago Press, 1995. H. Kucera and W. N. Francis. Computational Analysis of Present-Day American English. Brown University Press, Providence, RI, 1967. Douglas B. Lenat. CYC: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11):33–48, 1995. David Lewis. Adverbs of Quantification. In Edward L. Keenan, editor, Formal Semantics of Natural Language, pages 3–15. Cambridge: Cambridge University Press, 1975. Hang Li and Naoki Abe. Generalizing case frames using a thesaurus and the MDL principle. Computational Linguistics, 24(2), 1998. Maria Liakata and Stephen Pulman. From Trees to Predicate Argument Structures. In Proceedings of COLING, 2002. Maria Liakata and Stephen Pulman. Automatic Fine-Grained Semantic Classification for Domain Adaption. In Proceedings of Semantics in Text Processing (STEP), 2008. Dekang Lin and Patrick Pantel. DIRT - Discovery of Inference Rules from Text. In Proceedings of KDD, 2001.

141

Dekang Lin and Patrick Pantel. Concept discovery from text. In Proceedings of COLING, 2002. Dekang Lin. Automatic Retrieval and Clustering of Similar Words. In Proceedings of COLING-ACL, 1998. Dekang Lin. Automatic identification of non-compositional phrases. In Proceedings of ACL, 1999. Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. Using automatically acquired predominant senses for Word Sense Disambiguation. In Proceedings of Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, 2004. John McCarthy. Programs with common sense. In Proceedings of the Teddington Conference on the Mechanization of Thought Processes, London: Her Majesty’s Stationery Office, 1959. John McCarthy. Circumscription—a form of non-monotonic reasoning. Artificial Intelligence, 13:27–39, 1980. John McCarthy. Applications of circumscription to formalizing common sense knowledge. Artificial Intelligence, 26(3):89–116, 1986. John McCarthy. Notes on formalizing context. In Proceedings of IJCAI, 1993. Diana McCarthy. Estimation of a probability distribution over a hierarchical classification. In The Tenth White House Papers COGS - CSRP 440, 1997. Diana McCarthy. Lexical Acquisition at the Syntax-Semantics Interface: Diathesis Alternations, Subcategorization Frames and Selectional Preferences. PhD thesis, University of Sussex, 2001. George A. Miller and Florentina Hristea. WordNet Nouns: Classes and Instances. Computational Linguistics, 32(1):1–3, 2006.

142

Marvin Minsky. A Framework for Representing Knowledge. MIT-AI Laboratory Memo 306, June 1974. Teruko Mitamura, Eric H. Nyberg, and Jaime G. Carbonell. Automated Corpus Analysis and the Acquisition of Large, Multi-Lingual Knowledge Bases for MT. In In 5th International Conference on Theoretical and Methodological Issues in Machine Translation, 1993. Fabrizio Morbini and Lenhart K. Schubert. Evaluation of Epilog: a Reasoner for Episodic Logic. In Proceedings of Commonsense 09, Toronto, Canada, 2009. Renate Musan. Temporal Interpretation and Information-Status of Noun Phrases. Linguistics and Philosophy, 22:621–661, 1999. Roberto Navigli. Meaningful Clustering of Senses Helps Boost Word Sense Disambiguation Performance. In Proceedings of COLING-ACL, 2006. Geoffrey Nunberg. Position paper on common-sense and formal semantics. pages 129–133, Las Cruces, New Mexico, 1987. Marius Pa¸sca and Benjamin Van Durme. What You Seek is What You Get: Extraction of Class Attributes from Query Logs. In Proceedings of IJCAI, 2007. Marius Pa¸sca and Benjamin Van Durme. Weakly-Supervised Acquisition of OpenDomain Classes and Class Attributes from Web Documents and Query Logs. In Proceedings of ACL, 2008. Marius Pa¸sca, Benjamin Van Durme, and Nikesh Garera. The Role of Documents vs. Queries in Extracting Class Attributes from Text. In Proceedings of CIKM, 2007. Marius Pa¸sca. Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the Crowds. In Proceedings of the 16th World Wide Web Conference (WWW-07), 2007. Marius Pa¸sca. Turning Web Text and Search Queries into Factual Knowledge: Hierarchical Class Attribute Extraction. In Proceedings of AAAI, 2008.

143

Patrick Pantel and Dekang Lin. Discovering Word Senses from Text. In Proceedings of KDD, 2002. Patrick Pantel and Deepak Ravichandran. Automatically labeling semantic classes. In Proceedings of NAACL, 2004. Patrick Pantel, Rahul Bhagat, Bonaventura Coppola, Timothy Chklovski, and Eduard Hovy. ISP: Learning Inferential Selectional Preferences. In Proceedings of NAACL, 2007. Barbara Partee and Mats Rooth. Generalized conjunction and type ambiguity. In C. Schwarze R. Buerle and A. von Stechow, editors, Meaning, Use and Interpretation of Language, pages 361–383. Walter de Gruyter, Berlin, 1983. William Pentney, Ana-Maria Popescu, Shiaokai Wang, Henry Kautz, and Matthai Philipose. Sensor-based understanding of daily life via large-scale use of common sense. In Proceedings of AAAI, 2006. Simone Paolo Ponzetto and Michael Strube. Deriving a Large Scale Taxonomy from Wikipedia. In Proceedings of AAAI, 2007. Hoifung Poon and Pedro Domingos. Unsupervised Semantic Parsing. In Proceedings of EMNLP, 2009. Ellen F. Prince. On the function of existential presupposition in discourse. In W. Jacobsen D. Farkas and K. Todrys, editors, Papers from the Fourteenth Regional Meeting of the Chicago Linguistic Society, pages 362–376. Department of Linguistics, University of Chicago, 1978. V. Punyakanok, D. Roth, and W. Yih. The Importance of Syntactic Parsing and Inference in Semantic Role Labeling. Computational Linguistics, 34(2), 2008. Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge Belongie. Objects in Context. In Proceedings of ICCV, 2007. Joseph Reisinger and Marius Pa¸sca. Latent Variable Models of Concept-Attribute Attachment. In Proceedings of ACL, 2009.

144

Philip Resnik. Selection and Information: A Class-Based Approach to Lexical Relationships. PhD thesis, University of Pennsylvania, 1993. Philip Resnik. Semantic classes and syntactic ambiguity. In Proceedings of ARPA Workshop on Human Language Technology, 1993. Philip Resnik. Selectional preference and sense disambiguation. In Proceedings of the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, 1997. Francesc Ribas. On Learning more Appropriate Selectional Restrictions. In Proceedings of EACL, 1995. Stephen D. Richardson, William B. Dolan, and Lucy Vanderwende. MindNet: Acquiring and Structuring Semantic Information from Text. In Proceedings of ACL, 1998. Stephanie A. Schaeffer, Chung Hee Hwang, Johannes de Haan, and Lenhart K. Schubert. Epilog, the computational system for episodic logic: User’s guide. Technical report, Dept. of Computing Science, Univ. of Alberta, August 1993. Roger C. Schank. Using knowledge to understand. In TINLAP ’75: Proceedings of the 1975 workshop on Theoretical issues in natural language processing, 1975. Lenhart K. Schubert and Chung Hee Hwang. Episodic Logic meets Little Red Riding Hood: A comprehensive, natural representation for language understanding. In L. Iwanska and S.C. Shapiro, editors, Natural Language Processing and Knowledge Representation: Language for Knowledge and Knowledge for Language. MIT/AAAI Press, 2000. Lenhart K. Schubert and Matthew H. Tong. Extracting and evaluating general world knowledge from the Brown corpus. In Proceedings of the HLT-NAACL Workshop on Text Meaning, 2003.

145

Lenhart K. Schubert. Dynamic Skolemization. In H. Bunt and R. Muskens, editors, Computing Meaning, volume 1 of Studies in Linguistics & Philosophy Series, pages 219–253. Kluwer Academic Press, Dortrecht (also Boston, London), 1999. Lenhart K. Schubert. Can we derive general world knowledge from texts? In Proceedings of HLT, 2002. Lenhart K. Schubert. Some Knowledge Representation and Reasoning Requirements for Self-awareness. In Proceedings of AAAI Spring Symposium on Metacognition in Computation, 2005. Lenhart K. Schubert. From generic sentences to scripts. In Proceedings of IJCAI’09, Workshop on Logic and the Simulation of Interaction and Reasoning, 2009. Satoshi Sekine, Sofia Ananiadou, Jeremy J. Carroll, and Jun’ichi Tsujii. Linguistic Knowledge Generator. In Proceedings of COLING, 1992. Benny Shannon. On the Two Kinds of Presuppositions in Natural Language. Foundations of Language, 14:247–249, 1976. Mandy Simmons.

Presupposition without Common Ground (unpublished

manuscript). http://www.hss.cmu.edu/philosophy/faculty-simons.php, 2009. Push Singh. The public acquisition of commonsense knowledge. In Proceedings of AAAI Spring Symposium: Acquiring (and Using) Linguistic (and World) Knowledge for Information Access. AAAI, 2002. Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. Learning syntactic patterns for automatic hypernym discovery. In Proceedings of NIPS, 2005. Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. Semantic Taxonomy Induction from Heterogenous Evidence. In Proceedings of COLING-ACL, 2006. Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. Cheap and Fast - But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. In Proceedings of EMNLP, 2008.

146

Mark Steyvers and Tom Griffiths. Probabilistic Topic Models. In Thomas K. Landauer, Walter Kintsch, Danielle S. McNamara, and Simon Dennis, editors, Handbook of Latent Semantic Analysis. Lawrence Erlbaum Associates, Inc., 2007. David G. Stork. The Open Mind Initiative. IEEE Expert Systems and Their Applications, pages 16–20, May/June 1999. Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia. In Proceedings of WWW, 2007. Michael Sussna. Word sense disambiguation for free-text indexing using a massive semantic network. In Proceedings of CIKM, 1993. Robert S. Swier and Suzanne Stevenson. Unsupervised semantic role labelling. In Proceedings of EMNLP, 2004. Partha Pratim Talukdar, Joseph Reisinger, Marius Pa¸sca, Deepak Ravichandran, Rahul Bhagat, and Fernando Pereira. Weakly-supervised acquisition of labeled class instances using graph random walks. In Proceedings of EMNLP, 2008. Richmond H. Thomason. A Semantic Theory of Sortal Incorrectness. Journal of Philosophical Logic, 1(2):209–258, May 1972. Simon Tong and Jeff Dean. System and methods for automatically creating lists. US Patent 7,350,187. Assignee: Google Inc., April 2003. Judith Tonhauser. A Dynamic Semantic Account of the Temporal Interpretation of Noun Phrases. In Brendan Jackson (ed.), editor, Proceedings of Semantics and Linguistic Theory (SALT) XII. CLC Publications, Cornell, 2002. Peter Turney. Mining the Web for synonyms: PMI-IR vs. LSA on TOEFL. In Proceedings of the 12th European Conference on Machine Learning (ECML-01), 2001. U.S.

Department

of

Transportation.

National

transportation

statistics.

http://www.bts.gov/publications/national transportation statistics, October 2009.

147

Benjamin Van Durme and Daniel Gildea. Topic Models for Corpus-centric Knowledge Generalization. Technical Report TR-946, Department of Computer Science, University of Rochester, Rochester, NY 14627, June 2009. Benjamin Van Durme and Marius Pa¸sca. Finding Cars, Goddesses and Enzymes: Parametrizable Acquisition of Labeled Instances for Open-Domain Information Extraction. In Proceedings of AAAI, 2008. Benjamin Van Durme and Lenhart K. Schubert. Open Knowledge Extraction through Compositional Language Processing. In Proceedings of Semantics in Text Processing (STEP), 2008. Benjamin Van Durme, Ting Qian, and Lenhart K. Schubert. Class-driven Attribute Extraction. In Proceedings of COLING, 2008. Benjamin Van Durme, Austin Frank, and T. Florian Jaeger. Comparing Sources of Corpus Frequency Information. In The 22nd Annual Meeting of the CUNY Conference on Human Sentence Processing (CUNY-09), 2009. Benjamin Van Durme, Phillip Michalak, and Lenhart K. Schubert. Deriving Generalized Knowledge from Corpora using WordNet Abstraction. In Proceedings of EACL, 2009. Benjamin Van Durme. Notes on the Acquisition of Conditional Knowledge. Technical Report TR-937, Department of Computer Science, University of Rochester, Rochester, NY 14627, June 2008. Luis von Ahn, Mihir Kedia, and Manuel Blum. Verbosity: A Game for Collecting Common-Sense Knowledge. In Proceedings of the ACM Conference on Human Factors Computing Systems, CHI Notes, 2006. Kai von Fintel. Restrictions on Quantifier Domains. PhD thesis, University of Massachusetts at Amherst, May 1994.

148

Kai von Fintel. Would you believe it? The king of France is back! Presuppositions and truth-value intuitions. In Marga Reimer and Anne Bezuidenhout, editors, Descriptions and Beyond. Oxford University Press, 2004. Ellen M. Voorhees and Dawn M. Tice. Building a Question Answering Test Collection. In Proceedings of SIGIR, 2000. Richard C. Wang and William W. Cohen. Language-Independent Set Expansion of Named Entities using the Web. In Proceedings of IEEE International Conference on Data Mining (ICDM), 2007. Daniel S. Weld, Fei Wu, Eytan Adar, Saleema Amershi, James Fogarty, Raphael Hoffmann, Kayur Patel, and Michael Skinner. Intelligence in Wikipedia. In Proceedings of AAAI, 2008. Yuk Wah Wong and Raymond J. Mooney. Learning Synchronous Grammars for Semantic Parsing with Lambda Calculus. In Proceedings of ACL, 2007. John M. Zelle and Raymond J. Mooney. Learning to Parse Database Queries using Inductive Logic Programming. In Proceedings of AAAI, 1996. Uri Zernik. Closed yesterday and closed minds: Asking the right questions of the corpus to distinguish thematic from sentential relations. In Proceedings of COLING, 1992. Luke Zettlemoyer and Michael Collins. Learning to Map Sentences to Logical Form: Structured Classification with Probablistic Categorial Grammars. In Proceedings of UAI, 2005.

149

A

Generics

When common-sense knowledge is directly expressed in natural language, it usually takes the form of what are known in linguistics as generic sentences. In the following I give a short overview on what a generic sentence is, and problems associated with their interpretation, borrowing heavily from Krifka et al. (1995), the initial chapter of The Generic Book (Carlson and Pelletier, 1995). Linguists have used the term generic to refer to two different, but often co-occurring phenomena: kind -referring noun phrases (NPs) as in, Dogs, when used in (30), but not in (31); and sentences that refer to tendencies, or patterns, such as in (32), but not in (33). (30)

Dogs were domesticated from wolves.

(31)

Dogs are in my yard.

(32)

Rover (usually) barks.

(33)

Rover barked at me this morning.

Kind-referring NPs1 often occur together within generic, or characterizing, sentences, and it is this combination I have most in mind in this dissertation whenever using the term generic. Such NPs are revealed in sentences bearing kind-only predicates, such as 1

Krifka et al. (1995) additionally allows for well-established kinds, such as Coke bottles as compared

to green bottles: this potential distinction does not concern us here.

150

Extinct. For example, Dodos are extinct is natural, while, Rover is extinct can only be interpreted if the main predicate is somehow coerced.2 Generic sentences tend to satisfy the following three conditions: • Usually A generic sentence by default carries strong quantificational force, which means that they usually may be modified by adverbs such as usually with little to no change in meaning. For example, Dogs (usually) bark. Note that a generic may contain explicit quantification of any strength3 , such as in: Bears are rarely dangerous. Not all generics have this property, witness: Lightning rarely strikes people.4 • Stative Generic sentences are usually stative, in contrast to non-generics. One consequence is the lack of generic sentences in English progressive tense, as in Rover is barking, versus the generic, Rover barks.5 • Nomic As according to Krifka et al. (1995), generic sentences tend to express essential, as compared to accidental, properties. For example, while I can truthfully say that, All dogs are (or have been) born on Earth, it is a little odd to say, generically, Dogs are born on Earth, because this has law-like overtones, clashing with the intuition that many future dogs might be born elsewhere. By contrast, it is harder to imagine a future in which, Dogs bark is falsified. In this dissertation I will refer to this as the nomic, or rule-like, character of generics, which I take to be the key property that defines a generic assertion. This is further discussed in a later section. 2

For instance, we might take this sentence to have greater emphatic force, but express the same

meaning, as Rover is dead. 3

Such as captured by the full range of adverbial quantifiers discussed by Lewis (1975), and presented

here in Chapter 3. 4

Both examples come from Len Schubert (p.c.). Note that a strong quantifier is still present in:

(Usually) bears are rarely dangerous. In cases that retain this strong implicit quantification, but also carry overt adverbial quantification, the presence of multiple quantifiers lead to multiple readings, dependent on which variables the respective quantifiers are taken to bind to. 5

As also noted in Chapter 3, this dissertation is not concerned with the extraction of such habituals

predicated of object-referring NPs.

151

A.1

Individual and Stage-level Predication

Generic sentences may be divided into two categories, labelled by Krifka et al. (1995) as: habituals, which assert some pattern of activity; and lexicals, whose verbal predicates do not morphologically relate to time-bounded situations (such as love, cost, etc.). Note that lexical generics are not necessarily assumed to be forever unchanging, as in Aspirin costs a nickel, but they do act as if they’ll continue to hold, ceteris paribus (all else being equal). I will refer to this as a distinction between stage-, versus individual -level predication, terminology introduced by Carlson (1977b) (see also, Carlson (1977a)). An individual may be either a kind, such as dog-kind, or an object as in the case of Rover, which itself is of the kind Dog. A stage is a spatio-temporal part of an individual.6 This distinction was used by Carlson to analyze the behavior of bare English plurals.7 Previous analysis had struggled to account for the fact that these bare plurals could be used both in the generic sense under consideration here, as well as indefinite, non-generic plurals, as in Dogs are in the next room. Carlson (in short) suggested that a more uniform interpretation of bare plurals could be achieved by moving the burden to verbal predicates, which themselves select for different readings when composed with the nominal. For example, sentences (34, 35, 36) all carry the kind referring term, Dogs, as their syntactic subject. When composed with the verbal predicate in (34), which is stagelevel, and refers to a specific event, then the result is to pick out particular stages of particular dogs that were involved in the barking event. Sentence (35) does not denote a specific event, but instead has a habitual reading: All or most dogs bark at least occasionally. Sentence (36) contains an individual-level verbal predicate that applies to objects, which will pick out objects of the kind Dog, and assign the property of furriness. After specifying a representation language, a more formal treatment of (35) and (36) 6

I follow Tonhauser (2002) in this phrasing, but acknowledge that this incorrectly allows for, e.g.,

Rover’s left ear, from yesterday until today, to qualify as a stage. More precisely (while still deferring to Carlson as the proper definitional source) stages are the complete physical embodiment of an individual, over some temporal “slice”. 7

Bare in that they lack a determiner, e.g., Dogs bark, versus A dog barks.

152

are provided in Chapter 3. (34)

Dogs barked.

(35)

Dogs bark.

(36)

Dogs are furry.

(37)

Rover barked.

(38)

Rover barks.

(39)

Rover is furry.

Sentences (37, 38, 39) all contain an object-referring NP as their syntactic subject. In (37), we again have a stage-level verbal predicate that denotes a specific event, and thus picks out a particular stage of the given object (some spatio-temporal part of Rover ). Sentence (38) expresses a habitual, where there is a pattern of Rover stages that are involved in barking events.8 Finally, sentence (39), as in (36) applies an individuallevel predicate to its subject, which in this case is an object, and thus assigns the enduring property of furriness to Rover as an individual. Sentences (35, 36) are generic, with (38) classified by some as generic, depending on the treatment of these sorts of habituals. Sentences of the sort illustrated by (35,36) are the form of knowledge of concern here.

A.2

Formalization

Generics involve quantification over some set, and they are usually taken to ascribe some essential characteristic to those elements (in the case of habituals, the set would be comprised of situations). The problem in interpreting generics boils down to how you determine what this set is and the relative size of the subset that is required to satisfy the given predication. Krifka et al. (1995) settles on the following syntax for representing generics: 8

Here I gloss over alternate readings, such as that Rover has a capability for barking.

153

Gen [x1 , ..., xn ;y1 , ..., ym ] (Restrictor[x1 , ..., xn ]; Matrix[{x1 }, ..., {xn };y1 , ..., ym ]), where Gen is a generic quantifier of ambiguous strength, which combines with the domain Restrictor to constrain the set of elements the Matrix9 applies to. Variables x1 through xn are written as {x1 } through {xn } in the matrix (nuclear scope) to reflect the optional occurrence of these variables there. Variables y1 through ym are bound implicitly by an existential quantifier in the matrix.10 There is much left unsaid in this definition. For instance, what makes up the restrictor in the utterance, Rover barks? It is more specific than: every moment that Rover exists, otherwise Rover would need to be in a constant state of barking. Nor is it enough to restrict the utterance to mean every moment Rover is awake, or every moment he is awake and not eating, or every moment he is awake and not eating and not taking a bath. Within AI this was recognized as the qualification problem, McCarthy (1980): in order to fully represent the conditions for the successful performance of an action, an impractical and implausible number of qualifications would have to be included in the sentences expressing them. Owing to the implausible number of qualifications we would need in order to be crisply precise in our generic assertions (which, again, specify rules about the world), we as humans leave these restrictions as implicit. As often said by Len Schubert (p.c.), language is telegraphic, in that the majority of our intent is not given explicitly. Determining these implicit constraints – i.e., the content of the domain restrictor – is one of the primary motivations for AI researchers to acquire large amounts of background knowledge. As said above, generics with no explicit quantification can be modified by Usually with little change in meaning (for the generic reading of the given sentence). This appears problematic for sentences such as (40) and (41), taken from Krifka et al. (1995) and attributed to Carlson (1977b). In (40) the assumption is that roughly one half, at most, birds can lay eggs (which would require a quantifier of, e.g., roughly half ). In the 9 10

Otherwise known as the scope, or nuclear scope. Note that in Chapter 3 I commit to a generic quantifier that binds to just a single variable, as

compared to the Lewis-style quantification over cases that Krifka et al. (1995) seemed to have in mind.

154

case of (41): most turtles die young because of predation, and yet the sentence is taken by most as being truthful. (40)

A bird lays eggs.

(41)

A turtle lives a long life.

These sentences highlight the impact of the underspecified domain restrictor. For examples, Sentence (41) has a natural expansion as A turtle (that reaches adulthood) lives a long life. For (40) we might say, Most (female) birds lay eggs.

A.3

Truth Conditions

How do humans interpret generic statements? Carlson (1995) argued that the possibilities fall along a spectrum between the following two approaches: • Inductive The truth of a generic is based on observing, then characterizing, patterns of behavior. We agree on the assertion, Dogs bark, because of our personal experiences with dogs and their barking. A less strict view is that someone had enough such personal experiences, and it is those experiences on which we rely, when later repeating the assertion. • Realist Also termed the rules-and-regulations approach. In his exposition, Carlson uses the metaphor of a chess manual: we know that Bishops move diagonally because the rules say this to be so. This view takes support in part from sentences such as, This machine crushes oranges (Carlson, 1977b), which can be considered truthful even if the machine is brand new, and is demolished immediately subsequent to the utterance. In that case, there are no orange-crushing episodes on which to hang an inductive interpretation. Carlson supports the intuition of the later approach, saying that sentences such as, John is intelligent, suggest that humans behave, or at least speak, as if the world has some sort of inherent, causal, rule-like structure. This is made more clear in Carlson

155

(2008), who writes: the real world consists of a course or sequence of events (many of which may overlap one another), and that temporally or otherwise circumscribed portions of the world exemplify (or not) patterns that form the basis for the truthconditions of the sentences. This isn’t simply a philosophical issue, as arguments for the realist approach are potentially problematic for applied systems described in this dissertation. Assume a realist position, and then consider: • We want to extract knowledge from examples seen in text. • The resultant output can be verbalized as natural language generic sentences. • Humans interpret the truth of (at least some) such sentences through noninductive means. That is, systems such as Knext use an inductive process to acquire human knowledge about the world,11 and yet humans themselves rely on non-inductive means to ascertain the truth of those same statements (under a realist view). We might argue that the issue does not concern us, as systems such as Knext do not base their results on observations from the world, but instead, they are “eavesdropping” on humans that have already committed to some set of beliefs, and we’re simply trying to glean those commitments based on what humans may reveal through discourse. This argument carries weight for those that involved in KA using direct interpretation of declared knowledge (such as Wikipedia). In the case of implicit KA it depends on how one interprets the raw output of the extraction system, not to mention the problem of reporting bias (both of these points are discussed in Chapter 3).

11

Such an approach is even hinted at by Krifka et al. (1995): The former type of situations - where

John is speaking French - can be considered evidence for the truth of the characterizing reading [John speaks French]. (pg. 37).