retrieval of passages for information reduction - CiteSeerX

RETRIEVAL OF PASSAGES FOR INFORMATION REDUCTION

A Dissertation Presented by JODY J. DANIELS

Submitted to the Graduate School of the University of Massachusetts Amherst in partial ful llment of the requirements for the degree of DOCTOR OF PHILOSOPHY September 1997 Department of Computer Science

c Copyright by Jody J. Daniels 1997 All Rights Reserved

RETRIEVAL OF PASSAGES FOR INFORMATION REDUCTION A Dissertation Presented by JODY J. DANIELS

Approved as to style and content by: Edwina L. Rissland, Chair Nicholas J. Belkin, Member Jamie Callan, Member W. Bruce Croft, Member Ethan Katsh, Member David W. Stemple, Department Chair Department of Computer Science

To my parents, Jean and John (MB and DOD), who have repeatedly said: \So what if you failed this time. Go try again|you can do it." And to my sibs (and their families): JJ, Jerry, and Rosey, who sent numerous messages and placed many a phone call, always oering encouragement.

ACKNOWLEDGMENTS Tell me what you know. . . tell me what you don't know. . . tell me what you think. . . always distinguish which is which. | General Colin Powell, USA, Chairman of the Joint Chiefs of Sta My thanks go out to a large number of people. First, the undergraduate readers: Anita Parameswaran, Allana Todman, Joanna Grieco, and Eileen O'Dea, who spent hours making all the relevance judgements. There are many other people I would like to thank personally for making my seemingly endless stay in graduate school a better experience. Unfortunately, I know I would have missed someone, so I've generalized (in no particular order): the bridge crew for helping me learn how to play competition-level bridge; housemates who endured my many moods and times of total insanity; the Outing Club folks that introduced me to white water and reintroduced me to hiking; softball teammates; the RCF/CSCF (whoever you all are!) tech support crew | my thanks for graciously repairing all the broken disks and curing the other problems that ailed my machines over the years; those lovely TEXhackers who helped make my publications look great; the members of the two Reserve units that made sure I always had something to do in my \spare" time, the various secretaries that helped immensely with all the \small" stu; and fellow graduate students. There are a few people I feel I must single out by name. These people helped me immeasurably during the past n years. My thanks: for guiding me into the swamp of relevance feedback code and answering those thousands of questions when INQUERY v

stared mockingly back at me, Michelle Lamar; for sharing living space and thoughts, Celeste; for Boating trips I and II, always being ready at a moment's notice to play bridge with six decks of cards (and now bid boxes) in tow, and for listening to me talk about everything, Rob Brooks; for adding balance to my life outside of CS, Deaun; for letting me escape to Camp Allan, helping me wend my way through the INQUERY code, all those crazy discussions about work and life, and for letting me kick the cat when needed, James Allan. Thanks to my committee members: Nick Belkin, Jamie Callan, Bruce Croft, Ethan Katsh, and the chair: Edwina Rissland, for letting me wander o my own way. A huge thanks in particular to Jamie for rereading the many parts of this dissertation and always diplomatically making comments and suggestions. My eternal gratitude for your caring. It is a much better piece of work due to your advice and assistance. Any mistakes are purely mine. And nally, to all those who listened to me complain, be hyper, be happy, be manic, to all those who had more faith in me than I did, my thanks for your unbounded support. I know there were others whose paths I have crossed and who have helped me along the way. Please forgive my omission as the deadline to turn this in draws near and what sanity I had, disappears.

vi

ABSTRACT RETRIEVAL OF PASSAGES FOR INFORMATION REDUCTION SEPTEMBER 1997 JODY J. DANIELS B.S., CARNEGIE MELLON UNIVERSITY M.S., UNIVERSITY OF MASSACHUSETTS AMHERST Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Edwina L. Rissland

Information Retrieval (IR) typically retrieves entire documents in response to a user's information need. However, many times a user would prefer to examine smaller portions of a document. One example of this is when building a frame-based representation of a text. The user would like to read all and only those portions of the text that are about prede ned important features. This research addresses the problem of automatically locating text about these features, where the important features are those de ned for use by a case-based reasoning (CBR) system in the form of features and values or slots and llers. To locate important text pieces we gathered a small set of \excerpts", textual segments, when creating the original case-base representations. Each segment contains vii

the local context for a particular feature within a document. We used these excerpts to generate queries that retrieve relevant passages. By locating passages for display to the user, we winnow a text down to sets of several sentences, greatly reducing the time and eort expended searching through each text for important features.

viii

TABLE OF CONTENTS ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . .

v

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii Chapter 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Current case-based reasoning methods . . Current information extraction methods . Information extraction/location challenges Sample search problem . . . . . . . . . . . Current information retrieval methods . . Research goals/Problem being addressed . Guide to the dissertation . . . . . . . . . .

. . . . . . .

3 4 7 11 12 14 18

2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.1 Case-based reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Information extraction . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Information retrieval . . . . . . . . . . . . . . . . . . . . . . . . . .

19 23 26

2.3.1 General IR methods . . . . . . . . . . . . . . . . . . . . . . 2.3.2 INQUERY speci c background . . . . . . . . . . . . . . . .

26 32

2.4 Retrieval of passages . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 39

3. SYSTEM DESCRIPTION . . . . . . . . . . . . . . . . . . . . . . . .

41

3.1 System overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Detailed system description . . . . . . . . . . . . . . . . . . . . . . 3.3 Example problem . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41 42 47

ix

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1

3.3.1 Document retrieval . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Passage retrieval . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Manual passage retrieval . . . . . . . . . . . . . . . . . . . .

48 50 55

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

4. EXPERIMENTAL METHODOLOGY . . . . . . . . . . . . . . . .

57

4.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.1.1 Types of features . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Finding values . . . . . . . . . . . . . . . . . . . . . . . . . .

57 59

Features examined . . . . . . Excerpts . . . . . . . . . . . . Collection and test documents Retrieval . . . . . . . . . . . . Answer keys . . . . . . . . . .

. . . . .

61 62 66 70 72

4.6.1 De ning relevance . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Relevance judgements . . . . . . . . . . . . . . . . . . . . .

72 76

4.7 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

4.7.1 Standard measures . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Expected search length . . . . . . . . . . . . . . . . . . . . . 4.7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79 82 84

5. EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

5.1 Excerpt-based queries . . . . . . . . . . . . . . . . . . . . . . . . . .

85

5.1.1 Base queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Kwok-based weighting . . . . . . . . . . . . . . . . . . . . . 5.1.3 Semi-random sets of terms . . . . . . . . . . . . . . . . . . .

85 89 90

5.2 Manual queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Second set of excerpts . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91 93 94

6. RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

6.1 Comparison of excerpt-based queries . . . . . . . . . . . . . . . . .

97

6.1.1 Base queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Kwok-based weighting . . . . . . . . . . . . . . . . . . . . . 6.1.3 Semi-random sets of terms . . . . . . . . . . . . . . . . . . .

97 100 101

4.2 4.3 4.4 4.5 4.6

. . . . .

. . . . .

x

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

6.1.4 Second set of excerpts . . . . . . . . . . . . . . . . . . . . .

102

6.2 Comparison to the manual queries . . . . . . . . . . . . . . . . . . .

103

6.2.1 6.2.2 6.2.3 6.2.4 6.2.5 6.2.6 6.2.7 6.2.8 6.2.9 6.2.10 6.2.11

Procedural status . . . Future income . . . . . Sincerity . . . . . . . . Loan due date . . . . . Debt type . . . . . . . Duration . . . . . . . . Monthly income . . . . Plan ling date . . . . Profession . . . . . . . Special circumstances . Summary . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

109 115 118 122 126 132 138 142 148 155 158

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

162

7. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . 165 7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

165 166

7.2.1 Query construction . . . . . . . . . . . . . . . . . . . . . . .

167

7.2.1.1 7.2.1.2 7.2.1.3 7.2.1.4 7.2.1.5

Re ning the de nition of relevant Query expansion . . . . . . . . . Learning . . . . . . . . . . . . . . Stopping and stemming . . . . . Concept recognizers . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

167 168 168 169 169

7.2.2 Relevance feedback . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Display of context . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Classi cation of documents . . . . . . . . . . . . . . . . . .

170 171 172

7.3 Long term outlook . . . . . . . . . . . . . . . . . . . . . . . . . . .

173

APPENDICES A. SAMPLE CASE-FRAME . . . . . . . . . . . . . . . . . . . . . . . . 175 B. INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 B.1 Sincerity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Special circumstances . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Plan duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

180 180 181

B.4 Monthly income . B.5 Loan due date . . B.6 Plan ling date . B.7 Debt type . . . . B.8 Profession . . . . B.9 Procedural status B.10 Future income . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

181 181 181 182 182 183 183

C. EXCERPT-CKBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 C.1 Debt type . . . . . . . C.2 Duration . . . . . . . . C.3 Future income . . . . . C.4 Loan due date . . . . . C.5 Monthly income . . . . C.6 Plan ling date . . . . C.7 Procedural status . . . C.8 Professsion . . . . . . . C.9 Sincerity . . . . . . . . C.10 Special circumstances .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

184 185 186 187 187 188 188 190 190 190

D. MANUAL QUERIES . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

xii

LIST OF TABLES 3.1 The most highly ranked documents for the Rasmussen problem. . .

50

3.2 Top passages retrieved from the Sellers opinion for the bag of words and sum queries for duration. . . . . . . . . . . . . . . . . . . . . .

53

3.3 Top ranked passages from the Sellers opinion for the manual query on duration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.1 Test features from the good faith personal bankruptcy domain. . . .

61

4.2 Number of terms contained in the original excerpt-ckbs. . . . . . . .

66

4.3 Number of terms contained in the excerpt-ckbs, where features labeled with an * give values for the second excerpt-ckb. . . . . . . . . . . .

67

4.4 The most highly ranked documents for the Rasmussen problem. . .

68

4.5 The most highly ranked documents for the Makarchuk problem. . .

68

4.6 The most highly ranked documents for the Easley problem. . . . . .

69

4.7 Test documents and their length in passages and words. . . . . . . .

69

4.8 Number of passages judged relevant for each feature. . . . . . . . .

78

4.9 Relevance of the top passages from a sample query. Italicized lines are those that share a belief score. . . . . . . . . . . . . . . . . . . . . .

83

5.1 The top 20 terms and their weights for the Kwok-based queries for special circumstances and sincerity. . . . . . . . . . . . . . . . . . .

91

5.2 Sample Semi-random queries for sincerity. . . . . . . . . . . . . . .

92

5.3 De nitions of the dierent query types used in the experiments. . .

95

xiii

6.1 Average esl values for the bag of words and sum queries for all features. Features with a * list values from the second excerpt-ckb. . . . . . .

98

6.2 Average esl scores for the bag of words and set of words queries. Features with an * used the second excerpt-ckb for comparisons. . . . .

99

6.3 Instances when the bag of words query was signi cantly better than the set of words query using average precision at six cuto points. Features with a * used the second excerpt-ckb for comparisons. . . . 100 6.4 Instances where the bag of words query was signi cantly better than the best Kwok-based query. . . . . . . . . . . . . . . . . . . . . . . 101 6.5 Instances where the bag of words query was signi cantly better than the semi-random query with 1/2 of the excerpt terms. . . . . . . . . 102 6.6 Comparison at esl3 between manual and the SPIRE-generated bag of words and sum queries. An \SP" indicates that both SPIRE queries performed better than the manual. An \M" indicates that the manual query performed better. If the manual fell between the two, the SPIRE query performing the best is given: \b" for bag of words and \s" for sum. If all three queries performed equally well, an \=" is shown. . 104 6.7 Average esl values for the manual and SPIRE-generated queries for all features. Features with a * list values from the second excerpt-ckb. 105 6.8 Average esl values, only including those documents where all three queries were able to retrieve the requested number of passages. Features with a * list values from the second excerpt-ckb. . . . . . . . . 106 6.9 Average esl values when all passages have a default belief. Features with a * list values from the second excerpt-ckb. . . . . . . . . . . . 106 6.10 Average precision (non-interpolated) for the manual and excerpt-based queries. Features with a * list values from the second excerpt-ckb. Values with a * are signi cant. . . . . . . . . . . . . . . . . . . . . . 107 6.11 Average precision at 10 passages for the manual and excerpt-based queries. Features with a * list values from the second excerpt-ckb. Values with a * are signi cant. . . . . . . . . . . . . . . . . . . . . . 108 6.12 ESL values for the manual and SPIRE-generated queries for procedural status. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

xiv

6.13 Average reduction in esl for the two SPIRE queries procedural status when using the shortened excerpt set. . . . . . . . . . . . . . . . . . 112 6.14 ESL values for the manual and SPIRE-generated queries for the second excerpt-ckb for procedural status. . . . . . . . . . . . . . . . . . . . 113 6.15 Average precision at cutos and non-interpolated precision for procedural status. Values with a * are signi cant. . . . . . . . . . . . . . 115 6.16 ESL values for the manual and SPIRE-generated queries for the second excerpt-ckb for future income. . . . . . . . . . . . . . . . . . . . . . 116 6.17 Average precision at cutos and non-interpolated precision for future income. Values with a * are signi cant. . . . . . . . . . . . . . . . 117 6.18 ESL values for manual and SPIRE-generated queries for sincerity. .

119

6.19 Average precision at cutos and non-interpolated precision for sincerity. Values with a * are signi cant. . . . . . . . . . . . . . . . . . . 119 6.20 ESL values for the manual and SPIRE-generated queries for loan due date where there were between one and four relevant passages. . . . 123 6.21 ESL values for the manual and SPIRE-generated queries for loan due date. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.22 Average precision at cutos and non-interpolated precision for loan due date. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.23 Average reduction in esl for the bag of words and sum for debt type when using the second excerpt-ckb. . . . . . . . . . . . . . . . . . . 128 6.24 ESL values for the manual and SPIRE-generated queries for the second excerpt-ckb for debt type. . . . . . . . . . . . . . . . . . . . . . . . . 129 6.25 Average precision at cutos and non-interpolated precision for debt type. Values with a * are signi cant. . . . . . . . . . . . . . . . . . 129 6.26 Manual and list query esl scores for debt type. . . . . . . . . . . . .

131

6.27 ESL values for manual and SPIRE-generated queries for duration. .

133

6.28 Average reduction in esl for the bag of words and sum for duration when using the shortened excerpt set. . . . . . . . . . . . . . . . . . 135 xv

6.29 ESL values for the manual and SPIRE-generated queries for the second excerpt-ckb for duration. . . . . . . . . . . . . . . . . . . . . . . . . 136 6.30 Average precision at six cutos and non-interpolated precision for duration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.31 ESL values for the manual and SPIRE-generated queries for monthly income. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.32 Changes between the two excerpt case-bases for monthly income. . .

139

6.33 Average precision at cutos and non-interpolated precision for monthly income. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.34 Average esl values for the manual and three excerpt-ckbs. The second line lists the average for only the documents where all three queries retrieved the requested number of passages. . . . . . . . . . . . . . 140 6.35 ESL values for the manual and SPIRE-generated queries for the second excerpt-ckb for plan ling date. . . . . . . . . . . . . . . . . . . . . 143 6.36 Average precision at cutos and non-interpolated precision for plan ling date. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.37 Average reduction in esl for the bag of words and sum for plan ling date when using the second excerpt-ckb. . . . . . . . . . . . . . . . 145 6.38 Dierences between the base and second set of excerpts at esl3 for plan ling date. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.39 ESL values for manual and SPIRE-generated queries for profession.

149

6.40 Average precision at cutos and non-interpolated precision for profession. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.41 List queries for profession. . . . . . . . . . . . . . . . . . . . . . . .

152

6.42 ESL values for the original and three list queries for profession. . . .

154

6.43 ESL values for the manual and SPIRE-generated queries for special circumstances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.44 Average precision at cutos and non-interpolated precision for special circumstances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 xvi

6.45 Review of results for the two best SPIRE query types and the manual query set. SP denotes the SPIRE queries and M denotes the manual. 159 B.1 Previously encountered debt types. . . . . . . . . . . . . . . . . . . .

182

B.2 Previously encountered professions. . . . . . . . . . . . . . . . . . .

182

B.3 Previously encountered values for procedural status. . . . . . . . . .

183

xvii

LIST OF FIGURES 1.1 Part of a bankruptcy case representation along with sample values.

3

1.2 Portions of the Sellers court opinion discussing the feature of monthly income. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.1 Overview of the CBR process. Optional elements are dashed and in grey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.2 Top layers of the claim lattice for the Easely case. . . . . . . . . . .

21

2.3 Overview of an \Information Retrieval" event. . . . . . . . . . . . .

27

2.4 Sample inference network for document retrieval. . . . . . . . . . .

32

2.5 Sample document broken into overlapping 20-word passages. . . . .

33

3.1 Overview of SPIRE. . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.2 SPIRE's retrieval process for novel documents. . . . . . . . . . . . .

44

3.3 SPIRE's passage retrieval subsystem. . . . . . . . . . . . . . . . . .

45

3.4 Top layers of the claim lattice for the Rasmussen case. . . . . . . .

48

3.5 Passages 1420, 1430, and 1440. . . . . . . . . . . . . . . . . . . . .

53

3.6 Passages 2650 and 2660. . . . . . . . . . . . . . . . . . . . . . . . .

54

3.7 Passage 2460. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

3.8 Passages 2610 and 2620. . . . . . . . . . . . . . . . . . . . . . . . .

56

6.1 Passages 720, 730, and 740 [court opinion 764]. . . . . . . . . . . . .

125

xviii

6.2 The original and list query for debt type. . . . . . . . . . . . . . . .

130

A.1 Top-level legal case representation along with object class and sample values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 A.2 Top-level bankruptcy case representation. . . . . . . . . . . . . . . .

176

A.3 Estus-factors representation for plan duration. . . . . . . . . . . . .

176

A.4 Estus-factors representation for payments and surplus. . . . . . . .

177

A.5 Estus-factors representation for employment history. . . . . . . . . .

177

A.6 Estus-factors representation for plan accuracy. . . . . . . . . . . . .

177

A.7 Estus-factors representation for debt type. . . . . . . . . . . . . . .

177

A.8 Estus-factors representation for motivation and sincerity. . . . . . .

177

A.9 The various Makarchuk factors. . . . . . . . . . . . . . . . . . . . .

178

A.10 Plan payment representation. . . . . . . . . . . . . . . . . . . . . .

178

A.11 Debt type and amount representation. . . . . . . . . . . . . . . . .

178

A.12 Student loan bankruptcy case representation. . . . . . . . . . . . . .

179

A.13 Honest debtor bankruptcy case representation. . . . . . . . . . . . .

179

A.14 Judgment debtor bankruptcy case representation. . . . . . . . . . .

179

xix

CHAPTER 1 INTRODUCTION When confronted with a new problem, we frequently try to relate the situation to one that we have dealt with previously. If we have knowledge of a similar prior problem, we might apply the same or a related solution to the case at hand. By analogizing the new problem, or parts of it, to past experiences, we save ourselves from having to solve every problem from scratch. Similarly, we might want to evaluate a scenario and hypothesize the likelihood of various outcomes. Again, we could rely on prior experiences to assist in this endeavor. For both of these processes we are employing \case-based reasoning". Case-based reasoners solve problems or examine and explain possible outcomes to a scenario by relying on prior similar experiences. Case-base reasoning (CBR) systems can be found in such diverse domains as cooking [21], medical diagnostics [32], manufacturing [26], game playing [36, 37, 9], and legal applications [54, 58, 57]. In the legal domain, nding similar cases is crucial because of the importance precedents play when arguing a case. By examining the outcomes and features of previous cases, a legal reasoner (e.g., judge, advocate) can decide how to handle a current problem. Similarly, a social worker may want to ensure fairness in the handling of a new situation and may look back on prior cases to ensure the treatment plans are equitable. In the medical domain, remembering or nding cases similar to the current patient may be key to making a correct diagnosis. Previous cases may provide insight as to how an illness should be treated and which treatments may prove to be the most eective. Game players examine old scenarios and their outcomes to 1

decide how to play the current position. A researcher can argue the merits of funding a research proposal by showing the dierences between this proposal and what has been done in the past. In order for CBR systems to generate useful and reliable results, these systems need to have reasonably sized \case knowledge bases" (CKB's). For some domains this is not much of a problem since the representations are either fairly simple [18, 37], or can be easily automatically generated [36, 76]. However, when the domain representation for a case is complex, it is typically the situation that building a case knowledge base with large numbers of cases becomes an expensive proposition. Converting case information into a symbolic representation will frequently make use of the time and experiences of a subject matter expert. This is costly when there are large numbers of cases to be represented. If the case material is contained in a textual format, then there are a variety of ways to convert the document into a symbolic representation. At one end of the spectrum for automation is natural language processing (NLP) and at the other is manual search. In between are information extraction (IE), information retrieval (IR), and string-based searches. We would like the case acquisition process to be as automated as possible. At this point in time, natural language processing is beyond the state of the art and manual processing is too time-intensive. Therefore, we hypothesized that we could use a semi-automated approach or a user-aided/interactive approach to focus attention on selected portions of the document. We explored this approach by using an information retrieval (IR) system to locate relevant passages within a set of domain documents. The passages retrieved were those that described the important features for the domain and the features were those that make up the domain representation used by the CBR system in the form of feature-value pairs. We validated this approach using documents drawn from a statutory legal domain: court opinions addressing the per2

sonal bankruptcy \good faith" issue, speci cally x1325(a)(3) of the U. S. Bankruptcy Code. Figure 1.1 gives part of a case-frame for the personal bankruptcy domain (as used in the BankXX system [57]).

Slot Name

legal-case ( ) citation year level judge summary procedural-status decision-for factual-prototype-link legal-prototype-link legal-theory-link

Slot Type free text: In re Rasmussen, 888 F.2d 703 (6th Cir. 1988) numeric: 1985 category: :bankruptcy-court :appeals-court proper name: David A. Scholl free text category: 'plan-con rmation 'plan- ling category: 'debtor 'creditor category: 'student-loan 'farm category: 'Estus 'Kitchens set: 'memphis-theory 'estus-theory

Figure 1.1. Part of a bankruptcy case representation along with sample values. We used a case-base of \excerpts" derived from a small set of court opinions to form a query for each feature. Using the IR system we tested queries formed in various ways from the excerpts to see if they could perform as well as or better than a set of expertly crafted manual queries. The objective was to retrieve the most informative passages for each feature. By doing so, we did not automatically ll in the case representation, but we were able to focus the extraction system, whether human or machine, on the places in the text that were most likely to contain feature information. This should reduce the eort required to search through lengthy documents when building a comprehensive case knowledge base.

1.1 Current case-based reasoning methods CBR systems tend to focus on the automatic evaluation, selection, and adaptation of prior cases to solve a current problem or to justify and explain an interpretation of a case. Complex or deep case representations are frequently employed by CBR 3

systems. These case structures typically represent more than one level of abstraction so that the system can perform multiple levels of reasoning and can do a detailed analysis of the similarities and dierences among situations. Most CBR systems require the hand-coding of data into structures that support these processes (e.g., episodes in Kolodner's CYRUS [30] used a conceptual dependency representation and Branting's GREBE [4] used a complex semantic network). Most cases are not automatically generated from text or other sources. Rather, a human makes the decisions about how to represent the information as a case. Consequently, because of the cost of the manual case acquisition, most CBR systems have small case knowledge bases. In those systems where there is a large case knowledge base, it is the norm that the size and complexity of the case will be more simple and shallow in nature. Therefore, there tends to be a tradeo between the size of the CKB's and the size of the cases among current CBR systems. Those systems with large-sized cases have small CKB's and those with smaller-sized cases may have a larger CKB. There are few CBR systems with large CKB's, that is, a thousand or more cases. Because we would like to be able to expand more easily the size of the case knowledge bases associated with CBR systems, as well as have a complex case representation, we review automated techniques for representing a document and its contents and for locating information contained within the documents. We look rst at information extraction and then at information retrieval.

1.2 Current information extraction methods In some domains it is possible to use information extraction (IE) technologies to convert short texts into a frame- or feature-based representation. (Many examples of current message understanding systems and the domains they run in can be found in [42] and [43].) However, these are specialized domains and the techniques employed 4

rely on the ability to nd particular syntactic patterns associated with \trigger" or key words, usually verbs. The association of noun phrases with key verbs found in recognizable syntactic patterns allows for the extraction and symbolic representation of pertinent information. Unfortunately, these techniques currently need signi cant numbers of manually annotated texts (e.g., on the order of 1000 or more texts1 ), to provide training for the extraction system. These training documents are generally one paragraph to two pages in length. All of this learning presupposes that there are enough domain-speci c training documents available for learning. If there are insucient numbers of training instances, then the IE system will be unable to learn the lexicon and associated linguistic patterns. Without this domain knowledge the system will be unable to automatically extract information to ll the case representation. Further, since the training documents for generating the linguistic patterns are domain speci c, the patterns learned are not generalizable to other extraction domains. The cost of annotating such large numbers of texts can be exorbitant in most domains. This can be prohibitive if the information need is not recurrent, but sporadic. Similarly, the creation of large numbers of extensive frame-based case representations is a daunting task and prohibitively expensive in domains that require extensive expertise. Another important aspect of these IE domains is that the information extracted for a feature is directly found within the text. That is, the actual wording found in the text is used as the feature's value. This is generally not true for most CBR case representations, and, in particular, for the domain examined in this thesis. Frequently a human must draw some inference from the wording to ascertain the value or must 1

Although this may be changing. This will be discussed more in Section 2.2.

5

perform some reasoning on the relevant text passages to derive the value for the feature. In summary, existing information extraction techniques have four characteristics that make them dicult to use: 1. 2. 3. 4.

require a large, domain-speci c, training corpus, require manual annotation of the training corpus, generate results that are not generalizable to other domains, and directly extract their feature values, verbatim, from the text.

There were other problems with IE that we found when examining the texts from the legal domains. The texts did not lend themselves to automatic case generation via current information extraction techniques for several reasons: lengthy documents were computationally expensive to process, pertinent text was spread throughout the document and later references could be just as or more important than earlier references to a feature, and large portions of the text could be about other topics. These problems were particularly apparent when dealing with legal court opinions and their case representations. For the home oce deduction domain (speci cally x280A of the U. S. Internal Revenue Code as used in Cabaret [58]) and the personal bankruptcy domain, we found that documents ranged between two and twenty pages. Most opinions addressed just one issue, but sometimes they would address as many as seven dierent issues. Further, while these documents generally followed the pattern of rst giving background information, next, the issues under consideration, and then sequentially discussing each of the issues, there was not necessarily a clear break between the sections. A single opinion could also cover multiple factual situations (i.e., one opinion could have been a consolidation of many cases into one) and a dissenting opinion might have been included. These all complicated the information extraction procedure. Therefore, if we were to build a reasonably sized case-base for 6

these domains, and if the costs of extraction were to be kept reasonable, then then some other means would be needed to focus the extraction process.

1.3 Information extraction/location challenges To illustrate the problem of information extraction, consider the following scenario: an individual is evaluating the possible courses of action relative to claiming bankruptcy and ling a repayment plan for her creditors. Because she is concerned with whether her plan will be considered as being proposed in \good faith" she wants to compare her situation with that of others who have already been to court over the same issue. Assume that we have acquired a set of texts relevant to our domain of interest, the bankruptcy good faith issue. We desire to incorporate these texts as cases into our CBR system. Therefore, we now have the problem of locating within these documents, those segments of text that discuss the various features of importance so that we can ll in the case-frame. One of the considerations for determining good faith that the court must consider is the debtor's amount of monthly income, which, when combined with information about monthly expenses will determine the surplus available to pay o the debts. All three of these features are likely to be a part of a lawyer's case representation for a good faith case. In fact, such a representation has been used in prior work with the BankXX system [57]. (Appendix A gives the complete case representation used for the good faith domain.) We use the Sellers2 opinion to illustrate some of the problems that arise while searching/reading the text for case-frame information. In Figure 1.2 we give locations along with the relevant text portions that discuss the attribute of the debtor's monthly 2

In re Sellers, 33 B.R. 854, 857 (Bankr. D. Colo. 1983)

7

income from within the Sellers court opinion. (The italicized text is that which directly refers to monthly income.)

Page 1 In consideration of Chapter 13 plan, income of debtor's wife was to be included, . . . Page 4 He listed his monthly take-home pay at $701.80. When the hearing was held on April 20, 1981, Sellers had changed employment and was earning roughly $1,012.00 per month takehome pay for a 4-week period. His wife's income was listed as $525.84 per month. At the time of the last hearing, on June 8, 1983, Seller's take-home pay was approximately $1,150.00 per month. Seller's wife, who has not led for bankruptcy, has experienced an increase in income from $525.84 per month to a monthly income of $950. Page 5 This includes consideration for the wife's income, which, according to Kull, supra, should be included. The resultant changes, including overtime, taxes, monthly expenses and salary increase appear from the evidence to indicate the debtor's monthly surplus,. . . Page 6 Although the debtor has experienced changes in income and expenses, there is no indication such changes were a result of attempts to mislead the court. Indeed, the debtor freely admitted at the original con rmation hearing that he had experienced a substantial increase in income since his original budget. Page 7 Additionally, Sellers did not make an eort to hide the increase in income which he experienced at the time of the original con rmation hearing.

Figure 1.2. Portions of the Sellers court opinion discussing the feature of monthly income.

First, note that the document is long and the relevant discussion is interspersed throughout the length of the document. In this opinion relevant text is scattered between the rst and seventh pages. It turns out that the text found on the fourth page is the most informative. Second, diering terminology is used in the various fragments. The concept of monthly income is described as \take-home pay", \salary", and as \monthly income". Monthly income might also have been called \earnings" or \wages". The source of 8

income might have been a \stipend", \grant", or even \unemployment". All of these are descriptive of the monthly income feature. Besides the many ways to describe the feature, there are multiple ways to state the actual value: \monthly", \weekly", \yearly", \annually", \a 4-week period", etc. Some of these will require a calculation to determine what the amount is on a \monthly" basis. Therefore, there is a large vocabulary that can be used to describe both the feature and the value. Third, some of the fragments do not explicitly give any indication of the feature's value. On pages 5, 6, and 7, the information given indicates that there has recently been an increase in the debtor's income. Yet, no amount is stated. This text is relevant, although no value can be ascertained. A similar situation occurs when there is information to infer the value, yet the exact value is not given. For example, consider the feature of loan due date. One court opinion states: \the loan was due six months after graduation". Unless the date of graduation was previously mentioned, an exact date value is not possible. A related example for monthly income, taken from another opinion is the following sentence: \Debtor's amended budget commits $30 to the plan from a weekly takehome pay of $262.30." This sentence does not contain the actual value for the feature, but it does provide enough information for a user to be able to deduce the value, 262:5 52 12 = $1136:63. Nowhere in the opinion is the value $1136 ever given as the debtor's monthly income. Fourth, it may be necessary to combine, in some fashion, partial information to determine the nal value. In the above opinion, both the husband and the wife have income. Each of their respective incomes must be merged together to arrive at the total monthly income for these debtors. Page 4 gives both of the required values, although they are not in the same paragraph. While, in this document, there are two values that must be combined into a single one, in other cases and for other features 9

there may be more. One example of this is the feature of monthly expenses where an opinion itemizes each of the expenditures. Fifth, some of the stated values may give background information. In this document, the income of both debtors increased between the time of the initial ling and the current proceedings. Two of the stated dollar amounts are the old income level and three are the current level. Another example of this problem arises when the court quotes text from a prior case. The opinion may cite some important feature and its value for comparative purposes. Automatically dierentiating between two time periods or recognizing that the text is describing a prior case and not the current situation is quite dicult. Finally, as stated above, the document might cover more than one topic. There may be large chunks of text that have nothing to do with the topic of interest, let alone with the speci c aspects or features of the topic. Filtering out these non-topical areas of text would be ideal. To summarize, there are many challenges when trying to locate within a long document those segments of text that speak to a particular feature. We have highlighted six of these issues: 1. 2. 3. 4. 5. 6.

the descriptive text can be found throughout the document, diering terminology can be used to describe the feature and its value, there may be no explicit statement of the value, it may be necessary to combine multiple values into a single one, some of the text may be historical in nature, and large segments of text may be about other topics.

We examined the case representations with their features and values in two domains, (the home oce deduction and personal bankruptcy domains mentioned earlier), and found that, many of the values diered in their representation between the text and the case-frame. We found that current information extraction technologies 10

could not directly extract the relevant information from within these texts for the reasons given above. We were also discouraged from using IE techniques by the fact that we had only a limited number of available documents in each domain: approximately 100 in one, 350 in the other. (We were limited by the number of opinions that have been rendered each year in the particular areas we were researching. Further, we could only draw court opinions from as far back in time as when the current statute was enacted.) There was the additional limitation that the case knowledge bases for our domains were small (one consisted of 55 texts, the other only 25) for use in training. It was infeasible to create training data of a larger magnitude because of the expense associated with generating each new case-frame. These issues precluded our ability to use prevalent information extraction methods. These half-dozen problems highlight the magnitude of the obstacles facing the knowledge engineer who desires to use a complex case representation. If the user must read through the entire document, and must do so multiple times because of the numbers of features or slots that are in the document's representation, then any large-scale or even moderate reduction of the length of the text that must be read can provide a tremendous savings.

1.4 Sample search problem The following hypothetical example illustrates the diculties when using string matching to search through an on-line text. We continue to use the personal bankruptcy domain of good faith issue and the monthly income feature. The user is searching for the value of the feature:

First, the user attempts a string search trying to match \monthly income". When this fails, the user proceeds to search for instances of \salary", \earnings", or just \income" alone. 11

If this is unfruitful, the next attempts are to look for \per month", \per week", or \per year". Other synonymous terms for \monthly income" might come to mind leading the user to try expressions such as \minimum wage", \stipend", \grant", or even \unemployment" or \unemployed". Finally, the user might just give up and either assume that there is no description of the debtor's monthly income, or resort to reading the entire court opinion.

The scenario illustrates several problems with searching through on-line text: 1. String searches may not yield a result due to the requirement for an exact match. The user would have to understand stemming and when it would be appropriate to do so. For example, both \earnings" and \earned" might be appropriate for locating a value for the monthly income. In which case, stemming to \earn" would work for both words. 2. All the synonyms for a particular expression may not come to mind. An additional phrase describing monthly income is \take-home pay". In one text we found it as \take home pay", further complicating attempts at string matching. 3. Typographical errors will complicate matching. Misspellings in either the input or within the document will cause matches to not take place. Clearly, string searches are an inecient means of locating important text about features. If we are to use automated methods to assist in the building of our casebases, then we must employ more ecient techniques.

1.5 Current information retrieval methods Information retrieval search engines are able to ameliorate the string search problem by using word-based rather than character-based approaches. They also have available techniques such as stemming to arrive at common base forms and automatic query expansion to increase the number of terms available for matching. 12

We can use the surface-level features of texts to retrieve documents from a corpus of full-length texts. Each document is indexed based on the terms it contains. (The distinction between a \term" and a \word" is given later, in Section 2.3.) The IR system also gathers statistics on the terms, either as found in individual documents or across the collection, for use in weighting the importance of the terms in their ability to discriminate among the documents. Combining statistics based on both the corpus as a whole and on the individual texts, the system assesses the relevance of each document to the user's query. From these calculations the system decides which documents to retrieve. IR obtains for the user a set of documents statistically related to their query. The query represents an information need and the retrieved documents are believed to satisfy it. However, what is retrieved is a set of documents. What if the user is in search of speci c facets of the text's content or if the document is lengthy and the user is only interested in a small portion(s) of the document? Showing the user where within the document the pertinent text resides is crucial. If the user's information need is more specialized than a global examination of the document or if there are limitations on the available resources, such as time or money, then a more focused presentation of the document is required. Current technology allows us to display for the user those locations within the document where there is a match between the user's query and the various terms or phrases in the document. It can also highlight the passage(s) that contain the best match for the query. Unfortunately, current IR systems do not signify which text segments relate to which feature. The typical IR system is not designed to allow the user to conduct interactive searches over single documents; there is usually a tight loop between posing a query over an entire document collection and examining the results of that query. Furthermore, there has only been a limited amount of work on retrieving items smaller 13

than an entire document [46, 16]. When these items (sentences, passages, etc.) have been evaluated, they have been used primarily to augment the score of the document, rather than for independent retrieval [6, 71, 64]. In some cases documents have been broken into smaller elements for evaluation against a query, but the total document is given in the retrieval [34, 15]. (Section 2.4 elaborates on all of these works.) This thesis addresses the situation where the user is interested in multiple facets of a document, that is, the user seeks documents that are generally similar (i.e., they are about the same broad subject), and additionally seek subportions where speci c features are discussed. We have built a system called SPIRE, (Selection of Passages for Information REduction), that does a two-stage retrieval: 1. The rst stage is retrieval of documents that are relevant to a presented problem situation. 2. The second stage is retrieval of passages that contain information relevant to speci c features or aspects of the topic. Such a two stage retrieval is useful in a wide variety of domains for such tasks as: evaluating related research, arguing a legal case, etc.

1.6 Research goals/Problem being addressed The goal of this research to locate within a document those areas of text containing information relevant to lling in a frame- or feature-based representation of the document. These areas can then be given to a human or automated system system for extraction of information, thereby saving a user or IE system from reading or processing an entire document. Without these techniques, a user must read through the entire document in order to ll in the template. This research can help avoid a huge expenditure of time, particularly when there are many texts. Similarly, this research could save an automated 14

information extraction system from processing an entire text by focusing the IE system on those text portions most likely to contain information about the requested features. This thesis explores the formation of queries to retrieve passages containing information about the features found in a CBR case-frame. One of the crucial issues we must address is how to best form queries that will be used use to retrieve the passages. Having a user attempt to form a query for each feature of interest is not ideal. Too often users of text retrieval systems pose short and ambiguous queries and are ultimately dissatis ed with the retrieval. The typical user of a retrieval system will form a query containing an average of only two to three words [12, 69, 29]; legal research queries tend to be longer, averaging around 9 words [75]. Short queries are frequently ambiguous; the words they contain can have multiple senses and meanings. Because of this ambiguity, the items retrieved may not be what the user wants. Therefore, it is desirable to form queries longer than just a few terms in order to reduce possible retrieval ambiguity, to show relative importance among query terms, and thus to increase the probability of locating the relevant passages. We oer a low-cost means of gathering many terms that are expressive of an information need. Our hypothesis was that we would be able to achieve good passage retrieval results with automatic passage query formation using a \case-base" of actual text pieces. For the documents already represented in our case-base, we know that a link exists (although possibly weak) between certain text pieces and the value in the case-frame because a human has made this link. For every feature that has a value, we know that a human was able to transform the information contained in the text into the frame-based representation. Therefore, we know that there must have been some level of association between text segments and a feature. 15

Using this knowledge, our approach was to form passage queries that nd text descriptive of a feature by using a small selective set of textual \excerpts". Since we already had a set of texts and their frame-based representation available for use by a case-based reasoner, we added a set of excerpts for each feature to the case-based reasoner's CKB. We derived our excerpts from a limited number of the documents whose facts were already present in the CKB as cases. We thus created an excerpt case knowledge base (excerpt-ckb) for each of ten features. The excerpts were stored along with the feature they discuss. SPIRE then used these excerpt-ckbs to automatically generate queries. These were posed against twenty novel texts to retrieve the most highly matching passages in a text. The main hypotheses of this thesis were that these excerpts would provide enough local context to retrieve related areas of text in new documents, and that the passages retrieved would be about the same feature as were the excerpts that derived the query. This research explored some of the various means of creating queries from these excerpt-ckbs: we tried combining the excerpt terms in various ways, including the use of phrases found in them; we tested dierent weightings of the excerpt terms; and we tried reducing the total number of terms in the query. Finally, we compared all of these means of generating queries from the excerpt-ckbs against a set of expertly crafted manual queries. SPIRE generated the various queries and returned the top ranked passages for further examination. Using relevance judgements we were able to compare retrieval performance. In general, the excerpt-based queries with the least amount of structure did the best. Adding phrase operators helped retrieval for some features, while hurting others. Being able to select phrases that included stop words and words other than noun groups might have done better for some of the features. We also found that reducing the in uence of words that were in the excerpts multiple times was detrimental. Being 16

able to more heavily weight terms that appeared in several locations aided retrieval of relevant passages. Because of the small window size of the passages and the apparently large number of terms in some of the excerpt-ckbs, we tried to selectively reduce the number of terms in the excerpt-ckbs. None of the methods we tried produced queries that were better than the original set of excerpt terms. We learned, while examining the results from the many excerpt-based queries, that one must be a bit careful when constructing the excerpt-ckbs | inclusion of extraneous terms or proper nouns can negatively aect results. While a great deal of care and expertise was not required to build the excerpt-ckbs for these features, some caution is warranted. When we compared the best of the excerpt-based queries to a set of expert manual queries, we found that, in general, the excerpt-based queries performed equally well. There were some features where the expert queries were much better (e.g., when a small set of domain-speci c keywords were highly descriptive of the relevant passages), and instances where the excerpt-based queries were superior (e.g., the excerpts provided better coverage of the many ways to describe the relevant passages than the terms included in the manual query). For most of the features, however, the results were fairly close. This means that SPIRE was able to achieve an expert level of performance. Being able to use a small case-base of excerpts gathered directly from a set of relevant texts can reduce the volume of text that has to be either read or processed in order to ll in a case-frame. By using the excerpts, the CKB builder no longer has to understand the complexities associated with forming eective queries for an information retrieval system. Additionally, the excerpts provide a rapid means of enumerating the various means of expressing complex concepts. Use of these excerpts did, in fact, provide enough local context to retrieve the relevant passages for the 17

vast majority of the features, and they did so without the use of an expert. Thus our experiments con rmed the hypothesis that we could save a user or IE system from reading or processing an entire text to locate the important information about a feature.

1.7 Guide to the dissertation The next chapter explains in more detail the technologies we employed and gives a review of related work. Chapter 3 describes the system developed, SPIRE, and illustrates its operation with an extended example. Chapter 4 reviews the experimental methodology employed and Chapter 5 describes the experiments conducted. Chapter 6 gives results and discussion and Chapter 7 provides conclusions and looks at future extensions to this work.

18

CHAPTER 2 BACKGROUND This research touches on the work done in many dierent methodologies. In this chapter we describe case-based reasoning, information extraction, and information retrieval. We nish with a discussion of related work done in passage retrieval.

2.1 Case-based reasoning CBR systems focus on the automatic evaluation, selection, and use of prior cases to solve new problems. There are two basic types of CBR systems; those that solve problems and those that are precedent-based. Problem-solving CBR systems retrieve solutions to previous problems or subproblems and merge or adapt these solutions to yield a new solution, preferably optimal. Precedent-based or interpretive CBR systems explain or justify the possible answers based on comparisons with prior outcomes. As previously mentioned, most CBR systems require the manual input of cases. (Although there are systems with automatically generated case-bases [17, 76].) The case and indexing structures are decided upon by a human. Each case's input form is extracted from text or created from events, and data is placed into the appropriate representation. The actual indices for a case may be automatically generated based on the input data. Known cases are stored in a case knowledge base (CKB). To reason about a new case or problem, a current fact situation (cfs) must be input to the system. It is generally in the same format as the cases in the CKB. Once input, the cfs is compared to the CKB using any one of a variety of similarity metrics. 19

(See Figure 2.1.) The comparison yields a set of cases deemed to have some level of similarity to the cfs. In some CBR systems it is sucient to nd a \close enough" case and not search the entire case base for the best matches [32]. Many CBR systems use some form of nearest-neighbor metric. Other methods include claim lattice, as implemented in HYPO [54, 2], structure mapping as implemented in GREBE [4], or Thematic Organization Packets (TOPs) as implemented in Chef [21]. Current Fact Situation

Representation

Case Knowledge Base

Query Adaptation/ Interpretation

Case Comparison

Feedback Evaluation & Selection

Verify Solution Feedback

Figure 2.1. Overview of the CBR process. Optional elements are dashed and in grey.

In SPIRE, we use a HYPO-style CBR module that employs a claim lattice to determine similarity. A claim lattice is a partial ordering of the cases similar to the cfs. The CKB is sorted based on the intersection of each case's \dimensions" (or features) with those applicable to the cfs; cases with no shared dimensions are not considered since they are not deemed relevant. Dimensions address important aspects of cases and are used both to index and compare cases. In this sorting, Case A is considered more on-point than Case B if the set of applicable dimensions it shares with the cfs properly contains those shared by B and 20

the cfs. Maximal cases in this ordering are called \most on-point" cases (mopc's). The result of sorting the cases is shown in the claim lattice. It contains only the subset of cases from the CKB that are considered relevant to the problem case at hand. Those cases on the top level of the lattice are the mopc's. The cfs is in the root node. With Easley as the cfs, Figure 2.2 is the top few layers of the claim lattice from a home oce deduction legal domain. Sheets, Rasmussen, Dos Passos, and Ali are the mopc's. EASLEY DOS PASSOS

SHEETS

RASMUSSEN ESTUS

ALI

HAWKINS SEVERS

SOTTER

MAKARCHUK

FLYGARE MYERS

Figure 2.2. Top layers of the claim lattice for the Easely case. Evaluation of the set of retrieved cases may or may not involve an evaluation function using weighting to provide discrimination within the set as to their utility in satisfying the current need. For many CBR applications, particularly in planning or designing, it is desirable to work with a single case or at least a small number of cases for adaptation (or possibly a case for each problem component). Therefore, problem-solving CBR systems generally have some means of weighting the relevant cases. (Kolodner's PARADYME [31] uses preference heuristics, Hammond's CHEF [21] uses a discrimination net that hierarchically orders features in terms of their relative importance, and Stan ll and Waltz applied a variety of distance metrics to arrive at the \best case" in MBRtalk [70].) 21

Interpretive CBR systems do not necessarily assign numerical values to relevant cases, but nd other means for generating at least a partial ordering among the retrieved cases. (HYPO and Rissland and Skalak's CABARET [58] generate a claim lattice and Goodman's Battle Planner [19] uses \case prototypes" or conjuncts of indices to select the \most on-point" cases for analysis.) Once a solution or evaluation is generated, the CBR system may try to validate the solution. Additionally, the CBR system has the option of storing information about the problem-solving or interpretive episode, such as knowledge about failures, adaptations, and the nal outcome. To summarize, the strengths of CBR include the following:

CBR is a developing methodology that accords a great deal of exibility to the design of individual systems. (Available parameters include deciding on case structure, the indices and/or their structure, similarity metrics, adaptation or interpretation techniques, retrieval model, and the storage mechanisms to use.) Methodological exibility allows CBR systems to take advantage of domain knowledge and do a single job well. (Some of the speci c tasks that CBR systems have been designed to accomplish include plan generation and development of interpretations.) Cases may have multiple levels of, or complex structures for, indexing. (Having multifarious indices enables a system to view a case from diering levels of abstraction and thereby allows for a variety of means of reasoning.) Because of the indexing methods used, case retrieval is done with a high level of relevancy. Problem-solving CBR systems can adapt a solution from a previous experience to solve a new problem, thereby saving the time of re-solving all or a portion of a current problem. Interpretive CBR systems can analyze old experiences 22

to predict an outcome for a current situation or relate/distinguish the current situation to/from previous events. CBR systems can learn by storing solutions to, or analyses of, new situations. CBR systems can learn from failures by storing these events and the indices that predict them, thereby avoiding repeating failures.

Even though CBR is a strong paradigm for reasoning at multiple levels of abstraction, there are certain weaknesses found in many systems:

Devising case representations, indexing structures, and extracting cases for use in some domains is too manually intensive. Many of the currently used storage and retrieval mechanisms do not easily scale-up for larger case-bases.

Overcoming or reducing the rst of these weaknesses, so that we may take advantage of the strengths of CBR in reasoning from multiple perspectives and levels of abstraction, is one of the goals of this research. Hence the justi cation for integration of case-based reasoning with information retrieval, which can rapidly index its \cases".

2.2 Information extraction The objective of an information extraction system is to locate and extract those pieces of information considered to be signi cant from within a text. The information is typically placed in a frame-based representation. The slots summarize information about an event and include such items as the actor, date, location, type of event, important objects, outcomes, etc. Using the extracted data one could easily create a summary of the signi cant events or data items found in the text. Domains used for the Message Understanding Conferences (MUC) over the years include naval opera23

tions reports and terrorist newswires [42, 43]. TIPSTER evaluations, sponsored by DARPA, used business joint venture and micro-electronics stories [48]. Sometimes, information extraction systems are paired with a ltering system. The ltering system tries to separate the relevant from the non-relevant texts so that the extraction system only has to expend eort on apposite texts. SPIRE has a similar stage in that it uses a relevance feedback generated query to act as a lter on the document collection to locate relevant texts. Steier, Human, and Hamscher, built a system that ltered a text stream based on keywords, then used the ODIE information extraction system to do extraction on the remaining texts [72]. Only those texts that met prede ned characteristics were considered suitable for examination by the extraction system. Additionally, they further reduced the amount of semantic processing by only examining those sentences in which the keywords appeared [27]. Some extraction systems look for important phrases by applying heuristics. These could be: look for proper nouns to be the names of individuals, companies, or countries; look for a word that is all capitals surrounded by parenthesis, (which would denote an acronym); look for the use of italics or underlining to signify important concepts; and nd multiply repeated compound noun phrases (with the belief that some of these will be domain speci c) [33]. AutoSlog-TS [53] shows great promise in overcoming one of the limitations found with most natural language extraction systems | that of needing large quantities of training data to learn syntactic patterns and to construct dictionaries. Primarily this training data has consisted of a domain-speci c manually-annotated corpora. These corpora represent a huge investment each time a new domain is to be explored. AutoSlog [51], used for dictionary construction with several information extraction domains, also suered from this limitation. The new approach of AutoSlog-TS relies

24

on linguistic rules and statistics collected over the new corpus to generate a dictionary without the need for an annotated corpus. AutoSlog-TS has been tested for two NLP tasks, extraction and text classi cation. It performed comparably to the original AutoSlog for the classi cation task. For extraction, AutoSlog-TS was able to produce concepts that would not otherwise have been generated because it examined statistics and patterns throughout the collection, not just the portions of annotated text. There still remains a need for a human to review the generated concepts, but the need for an annotated corpus is potentially reduced or eliminated. Even though the need for a large training corpus may be removed, IE systems still rely on linguistic patterns or rules associated with keywords to locate slot lls and extractions that come directly out of the text. We have tried to use the natural language processing extraction system at the University of Massachusetts, Amherst on one of our domains. This system did quite well on the MUC domains, unfortunately, it was dicult to use it on our domain since it had little training data and the linguistic patterns were suciently varied that they did not readily apply. Limited automatic extraction was obtained and with lower than acceptable precision and recall rates. (See the next section for de nitions of precision and recall.) An additional problem with IE systems is that there appears to be an implicit or even explicit assumption [27] that the information being extracted is about an \event". These systems further assume that the constituent elements being found ll \roles" within an event. This assumption does not hold for the domains of this research; the sentences containing the desired information generally use non-descriptive passive or in nitive verb forms. Unfortunately due to 1) the magnitude and availability of the training data generally required, and 2) the nature of the information captured we have not been able to transfer information extraction technologies to our domains. 25

2.3 Information retrieval Finding objects (e.g., abstracts, documents, pictures, video clips, etc.) to satisfy an information need is a long-standing problem for researchers in information retrieval. The goals of an information retrieval system are to automatically index the documents (or other data items), with little or no manual input, and then retrieve those items that will best match an information need stated in the form of a query as eectively and eciently possible. We next cover general IR practices. We then give speci cs about the INQUIRY retrieval engine, as it was used in this work.

2.3.1 General IR methods Two of the key issues for IR are rst, deciding what database(s) to search and second, what query to pose. If the information is distributed among various locations or the need has never arisen before, learning where to expend eort will take some time and eort. After deciding on what collection(s) to expend eort, there is still the issue of translating an information need into a query such that the retrieval system will be able to act upon it. Consider the need to nd current articles on a particular topic. The query could take the form of a manual eort, such as picking up a newspaper and searching through it to nd a speci c section, or getting a copy of the latest conference proceedings in a research area (or a new journal issue) and searching through the index. It could mean going to the library and searching a card catalog. It may be a more automated search using a retrieval engine and interface provided to a commercial database, such as WestLaw R or Lexis-Nexis. The search may be a scan through corporate records or a collection of photographs. It might be a broad-based search such as that done when using one of the search engines on the World Wide Web. In this last search, the user has no control over where in the Web the search will 26

be conducted. In all cases, and especially the last, the reliability of the information that is retrieved must be considered. Figure 2.3 illustrates the use of an IR system where the IR system could be any of the techniques previously described. "IR System"

Information Need

Query Results Data Source

Figure 2.3. Overview of an \Information Retrieval" event. We restrict further discussion to IR systems that work with textual documents. The most common models for representing documents and queries include the vector space model [63] as implemented in SMART [5], the probabilistic model as implemented in Okapi [60], and inference networks as implemented in INQUERY [8]. (While this research uses the inference net model as implemented in INQUERY, there is no apparent reason why this work could not be done within another framework.) Using any of the above models, forming a query that will be eective, that is, will describe the information need in such a manner that those and only those documents that meet the user's requirement are retrieved, is a very dicult task. Many systems still use an intermediary, trained in the use of the system, to form good queries [3]. Most users of today's web-based search engines form very short queries, on the order of two words [12, 69, 29], while legal researchers will form longer queries, closer to 9 words [75]. When the user's query is short it is likely that the query will be ambiguous since many words have multiple senses and meanings. It is also possible that the query terms may appear frequently in the collection. In this case, using only a few terms 27

hampers the ability to discriminate among the entire set of items in the collection. The user will be overwhelmed with retrieval results, of which only a few are truly relevant. The converse situation may also arise: if the query terms appear in only a few items, yet the collection contains many useful related terms, then only the few documents that match on the few terms will be retrieved and the user will miss out on much relevant information. Longer queries provide several advantages over short ones. First, expressing the information need with a larger number of terms increases the chances of reducing ambiguity. The additional terms help to re ne or express the concepts that de ne the information need. Using all the available terms increases the chance that the user's information need will be satis ed. Second, when a longer query is given there is a greater chance that some of the important terms will appear in the query more than once. When the query is evaluated, the more important terms will play a larger role in the retrieval, and there is a better chance the retrieval will result in an appropriate ordering. Third, increasing the number of terms in the query increases the probability that there will be any sort of match with the collection. For these reasons, there is extensive research into how to automatically expand queries [1, 68, 77]. By using a set of excerpts as the basis of our queries, we have reduced the burden placed on the user to determine good query terms. We next examine how documents are stored for matching against these queries. For all IR systems, the internal representation of a document may contain information about the document's words, their locations, their frequencies, whether they have been identi ed as being part of a concept, and information about adjacent words. Most IR systems will \stop" and \stem" the collection (and likewise the query), to locate the content terms and use these terms as the indices into individual documents. \Stop" words are high frequency words that do not represent content and add little value for discrimination between documents (e.g., and, but, the, a). The system is 28

usually given a set of prede ned stop words. It then removes these stop words from consideration as document indices. The next step is to \stem" the remaining words, that is, remove suxes (and possibly pre xes), to get at the root form of each word (e.g., \works" and \worked" both become \work"). What remains from a document's words constitute the \terms" that are used as the (inverted) indices for it. Besides indexing the terms found in the document, we may want to store information about the location of each term. Storing location data facilitates searches that are based on pairs, triples, or even larger sets of terms that are within a stated proximity to each other. By keeping this information, when we submit a query, we may designate certain terms as \phrases", sets of words (typically pairs or triples) to be found within close proximity to each other. Phrase terms are typically requested to be within a separation of at most 2 words. When posing queries the user may be able to request terms found in proximities that replicate sentence, paragraph, or larger sized document elements. Query formulation in all IR models consists of taking an information need and turning it into a form that can be matched against that of each document. The comparison yields a set of documents deemed to have some high level of \similarity" to the stated query. Similarity in many IR models is calculated using the measures of \term frequency" (tf) and \inverse document frequency" (idf) for each of the terms in the query. TF measures the number of times the term appears in the document. The more frequent the appearance of the term in the document, the more likely it is that the document is \about" the term. Thus, the credit given to a document for containing a query term will vary for each document. IDF measures the frequency with which the term appears in documents across the collection. Terms that appear in most every document are given less weight than terms that appear in only a few documents. IDF is frequently normalized across all terms in the collection to yield a value between 0 and 1. 29

Another possible modi cation to the tf idf score for term similarity is to normalize for document length. Longer documents have a greater probability of containing more instances of any given term than do shorter documents. Because of this, adjustments may be made to the score based on the average document length in the collection. The retrieved document set is usually ranked, providing discrimination among the set as to their utility in satisfying the query. Retrieval quality is typically measured via two metrics: \precision" and \recall". Precision compares the number of relevant items retrieved against the total number of retrieved items. It measures accuracy. Recall compares the number of items that should have been retrieved by a query to the total number of items that actually were. It measures coverage. (Problematically, if we employ the use of a thesaurus or other query expansion technique, it becomes more dicult to explain to the user how each document is relevant to the query and to the stated information need.) Some IR systems contain a mechanism for the user to judge the results of a query and to request a new query that incorporates information based on these judgments. Using the judgements, the system can automatically generate a modi ed query in a process known as \relevance feedback" [65]. Because there has been much research and advancement in the IR process over the past 30 years, current IR systems are robust and exhibit the following strengths:

IR is a well-worked out methodology that accords exibility to the design of individual systems. (Parameters that may be set include the decision to use a morphological rule set, stop word list, index set, query model, and the storage mechanisms to use.) Language characteristics allow IR systems to entirely automate the traditional index structure creation process, thereby being permitting application across domains. (This is true for English, Spanish, French, Japanese, Chinese, Korean, and Finnish.) 30

IR systems can access and retrieve data from very large collections or document bases. (Thus far collections have ranged upwards to several million documents.) Accessing and retrieving from these large collections can be done very quickly. (Most large-scale systems are capable of producing results for users within seconds.) IR systems can automatically add new data items to their collections (modulo data entry). IR systems can utilize learning in the form of relevance feedback to iteratively improve retrieval during a query session.

Even though there has been a great deal of research, information retrieval does have some weaknesses and there are still many challenging areas within the eld:

Only limited reasoning about the retrieved documents can be done. Similarity assessments may be made, but current indexing techniques do not enable their indices to be used for problem-solving or analysis of documents. Formulating good queries that accurately re ect the information need can be very dicult. However, the addition of new query operators beyond the Boolean (such as max, synonym, and proximity operators) and better user interfaces have helped to ameliorate this problem. Thus far IR systems have only been able to achieve high recall at the cost of lowered precision and vice versa. However, reasonable results are achievable at intermediate values. IR systems do not learn from their failures over time. For example, suppose a user poses a query and then provides relevance feedback. At a later time, that same initial query will produce the same initial results. (Although this is an active area of research.)

31

2.3.2 INQUERY speci c background For its retrieval, INQUERY uses an inference network model [73], [74] speci cally, a Bayesian inference net, to represent texts and queries. It uses a directed acyclic graph with an information need at the root, document nodes at the leaves, and a layer of query nodes, query concept nodes, content representation nodes, and text representation nodes in between. (See Figure 2.4; copied from [73].) Nodes that represent complex query operators can be included between the query and query concept nodes. The INQUERY model allows for the combination of multiple sources of evidence (beliefs) to retrieve relevant documents. (For more detail on the various node types, see [8].) d1

d2

...

di-1

di

Document

t1

t2

...

t j-1

tj

Text Representation

r1

r2

r3

...

rk

Content Representation

c1

c2

c3

...

cm

Query Concept

Document Network

Query Network

q1

q2 I

Query Information Need

Figure 2.4. Sample inference network for document retrieval. The inference network computes the probability that the information need is satis ed given a particular document (see Equation 2.1). The arcs between the various node layers represent conditional probabilities. Based on these conditional probabilities, the set of documents are rank-ordered and returned to the user.

32

P (Information need satis ed Doci)

(2.1)

j

For the experiments conducted in this work, we wanted to retrieve \passages". When requesting passages, these are continuous segments of text, usually of a xed size, although they may be more arbitrary in length. Passages can cross thematic and semantic boundaries. The passages or windows used in this work were of xed size and overlapped by half the size of the window. For example, in Figure 2.5, there are ve overlapping passages. The rst three all contain 20 words, while the fourth contains 17 and the fth only has 7. Under this scheme, every word in the document is in two passages, except for the rst ten, and the last two passages may be less than 20 words long. 2

4

1 0

9 10

3 19 20

29 30

5 39 40

46

Figure 2.5. Sample document broken into overlapping 20-word passages. We used the INQUERY #Passage operator for all of the experiments conducted in this work, speci cally, #Passage20. (The #Passage operator allows the user to specify the size of the passage window.) All of the queries act similar to an INQUERY natural language query. Natural language queries currently convert to a #Wsum operator wrapped around all of the terms. The #Sum operator is the simplest query operator within the inference net model. Using this operator, the score for each document is obtained by adding together the tf idf scores for all the query term a document contains. The #Wsum operator is a weighted version of the #Sum operator. It allows for designated terms to be given higher weights. The #Wsum operator produces a weighted average of the beliefs of the query terms. The initial value in the query represents 33

the total that the terms should be normalized to at the end. The other numbers represent the weight to be given to the term that follows. That is,

Pn wi belief(ti) belief(wsum(1:0 w1 t1 w2 t2 : : : wn tn )) = i=1 Pn wi

i=1

(2.2)

In its implementation, the #Passage operator performs similarly to a #Wsum in that it converts to a #WPassage operator and adds in weights based on the frequencies of the enclosed terms. As an example, below we give a query and its internal representation: #Passage20(this little piggy went to market this little piggy went home); #WPassage20(1.0 2.0 this 2.0 little 2.0 piggy 2.0 went 1.0 to 1.0 market 1.0 home);

Some work had to be done to INQUERY so that all of its operators (except for the #Passage operator) could be nested inside a #Passage operator. Previously, only a subset of the INQUERY operators could be nested; all others were simply removed.

2.4 Retrieval of passages Previous work in information retrieval on retrieving items smaller than an entire textual document is limited. Primarily the work falls into three areas, where the second and third areas are more closely aligned to this work than the rst: 1. imagery analysis to locate document elements, 2. examining passages or other small document element (sentence or paragraph) to aid in document retrieval, and 3. retrieval of document elements. The rst line of research into breaking a document into smaller components primarily focuses on imagery analysis at the pixel level to aid in distinguishing document 34

elements. Partitioning a document into smaller structure-based elements is reported in [61, 62]. Rus' work uses agents to search for \layout-based abstractions" such as tables, gures, and paragraphs, and \content-based abstractions" such as theorems, lemmas, and examples. Once located, smaller objects are further examined to see if they meet the information need. As an example, the technique was applied to segment a newspaper, lter for relevant sections, segment the sections into paragraphs, then detect tables and graphs. The resultant table and graph objects were examined for speci c information, such as stock data on a particular company. This re nement of search toward progressively smaller levels of granularity is similar to what we propose, except that we do not rely on locating visually delineated objects for retrieval, nor do we search at the image level. (See [61] for citations of other structure-based approaches that use image analysis techniques to locate document elements.) In the second category of related work, various types of passages are examined to aid with document retrieval. Generally, the value of the best passage is combined with the similarity score of the document in some fashion. This process may be illustrated by the work of Ro, who compared the results of document retrievals when queries were placed over the full-text, paragraphs, abstracts, and a controlled vocabulary eld (manually annotated keywords) [59]. Not surprisingly, precision went, from best to worst: controlled vocabulary, abstracts, paragraphs, and full-text. Recall was almost the inverse, with the order being: fulltext, paragraphs, controlled vocabulary, and abstracts. Since one may not have available an abstract or controlled vocabulary list for each document, the best object for improving precision with full-text retrievals may be the paragraph. By combining the score for the best matching paragraph with the scores of the individual terms, it may be possible to achieve a higher level of precision without sacri cing the recall of full-text retrieval alone. 35

Callan provided a comprehensive study of the relative merits of three types of passages when dealing with dierent types of collections and documents [6]. This work used the inference net model and in all cases, the belief value for the best passage within a document was added to the belief for the document as a whole. Comparison was done among discourse, bounded paragraph, and window passages. Discourse passages were based on user-de ned breaks such as sentence and paragraph breaks. Discourse passages have the disadvantage that they may vary significantly in length, thus impacting term frequency statistics. To ameliorate the size problem with discourse passages, bounded paragraphs were tested. In this scenario, small paragraphs were merged and large paragraphs were split. Callan also examined window passages, a third technique for breaking documents into smaller sized units. Window passages cover a xed amount of text, say 100 or 200 words. Fixed-size windows have a disadvantage in that they may possibly break related text across windows. This was ameliorated by having passages overlap at intervals of half the window size. Experiments were done over an assortment of collections that range in the size of their documents. The results indicated that of the various elements, window passages did the best. Similarly, Stan ll and Waltz discuss a commercial system, CMDRS, that used the scores from xed-size passages to retrieve documents [71]. CMDRS has been in operation since 1989. Another approach, by Salton and Buckley [66, 67], incorporated a local and global evaluation to determine similarity. They started with a global document similarity metric to rank encyclopedia articles. During subsequent queries they used the topretrieved article as the query. This multi-stage search showed additional improvement when requiring a minimum paragraph-to-paragraph or sentence-to-sentence similarity threshold. 36

Rather than using sentences or paragraphs, Moat et al. tried section and \page" divisions [40]. They made page divisions based on the number of bytes in consecutive paragraphs; this approximately normalized the length of each page. Again, though, their task was to retrieve entire documents where they found no bene t from breaking documents into sections or pages. The third area of related work includes systems that actually retrieve passages in response to a query. The rst is by Salton, Allan, and Buckley. They executed single-stage searches, computed global similarity, and then computed similarity at the sentence level for those documents that exceeded a threshold [64]. This allowed for sections and paragraphs in addition to entire texts to be retrieved. While this is in the same spirit as our work, in order to compute a sentence level similarity, they started with queries that are articles themselves. We have much smaller individual elements available as our initial queries; our excerpts are generally a sentence or shorter in length. Hahn argued for analyzing and indexing the conceptual or thematic structure of a text and implemented his ideas in the TOPIC system [20]. TOPIC was tested over short texts describing computer systems. Hahn asserted that storing a text's conceptual structure makes possible three types of retrieval operations: abstracting or summarization, fact retrieval, and passage retrieval. We are only concerned with his work on passage retrieval, which was made possible by linking the passages to their representative in the concept index. This is a true instance of \passage retrieval", with the limitation that the only way to retrieve a passage is if it was linked into the concept index. Passages were not retrieved in response to general queries, only in response to a concept. TextTiling is another technique based on a document's thematic structure [24, 25]. With TextTiling, coherent subtopic discussions determine document indices. Subtopic discussion breaks were ascertained by evaluating the frequencies of 37

the terms found in proximity to one another. To form each tile, small blocks of text (i.e. three to ve sentence units,) were compared to adjacent blocks to measure similarity. Thematically related blocks were merged. Subtopics or \tiles" could span multiple paragraphs. High value terms from within each tile formed the subtopic's label. Assembling the labels into a listing eectively represented a table of contents for the document. The query language for this system allowed users to request related documents based on either or both overall document similarity and similarity among tiles. Document retrieval was based on accumulating the scores of the tiles. This type of query allowed for searches where a topic was speci cally given as being subordinate to a main topic. This is similar to what we wish to do, however we do not decide a priori which subtopics are discussed within a text. Since the discussion of our features is frequently on the order of one or two sentences long and may be scattered throughout the text, we are not able to take advantage of the TextTiling methods. Gauch and Smith investigated the expansion of queries that retrieved passages [16]. Their corpus was a textbook on computer architecture and searchers were to locate sentences that were relevant to a set of ve questions/topics. They tested the ability of a domain speci c semantic network used in conjunction with an expert system to re ne queries. A third point of comparison was the availability of a thesaurus to assist users. Their experiments focused on the user and ways to assist the user in re ning their query. To some extent, the same could be said for SPIRE, except that we do not rely on a semantic hierarchy for the domain. Our initial query is derived from a set of excerpts, rather than user input. We did not attempt any further term expansion or revision, and our units of retrieval were not exact sentences, instead twenty-word windows.

38

Finally, O'Connor experimented with the retrieval of \answer indicative" or \answer reporting" passages [44, 45, 46, 47]. The rst experiments were run on a corpus of 82 full-length texts on information retrieval. The second corpus was created by having medical librarians search for documents that answered speci c medical questions and subquestions. To be added to the collection, a document had to be an \answer-paper", that is, one that contained either an \answer indicative" or \answer reporting" passage. Two individuals without biomedical knowledge created sets of search terms for each question. They spent between 2.5 and 3.0 hours, on average, developing the query terms for a single question. There was a 4.0 hour maximum per question. The collections were then manually searched for sets of consecutive sentences containing minimum numbers of matching query words. The results of this simulated retrieval were quite good with recall being either 72% or 67% depending on the set of search words. This work is quite relevant to what we are trying to accomplish. Their simulations have shown that it is quite reasonable to expect that we can retrieve\answer indicative" or \answer reporting" passages with our queries. However, we have the advantage of being able to automatically run our queries and of generating our passage queries as a by-product of another process.

2.5 Summary Case-base reasoning systems have the ability to do in-depth reasoning, although their case-bases tend to be small and scaling is still an issue. Conversely, information retrieval systems are unable to perform detailed reasoning, yet scale quite well. Attempting to build larger case-bases by taking advantage of information extraction techniques is currently not feasible due to the requirement for large amounts of domain-speci c training data. Therefore, to assist in the building of larger case39

bases we turn to information retrieval to locate the passages that discuss the features in our case representation. If we can locate the important passages, we should be able to reduce the total volume of text that must be processed, either manually or automatically, and ideally, the size of the case-bases will be able to grow.

40

CHAPTER 3 SYSTEM DESCRIPTION We now provide a broad overview of the SPIRE system followed by a more detailed explanation of the processing of documents. We then run through an example problem, starting with the input of the problem situation, showing retrieval of relevant new documents, and nishing with an example of the passages retrieved in response to a particular feature query.

3.1 System overview The SPIRE system is a hybrid case-based reasoning and information retrieval system that (1) from a large text collection, retrieves documents that are relevant to a presented problem case, and (2) highlights within those retrieved documents passages that contain relevant information about speci c case features. We assume that we have available a small set of court opinion documents, their frame-based representation, and a set of excerpts from the documents associating text with features. Using a CBR system in conjunction with an IR engine, we use these cases and their texts to retrieve an additional set of texts believed to be relevant to a current problem situation (one possible technique is described in [55, 14]). The next step is to see how closely the situations in the retrieved documents match our current problem. To be able to do this automatically, we must convert our newly retrieved texts into a representation with which the case-based reasoner can work. Since we are unable to automatically extract the values from the texts, we now examine how we assist in locating those passages that contain information about the features. 41

In general terms, we gather all the excerpts for a feature from our original case knowledge base. We use these excerpts as the basis of a new query, with which we will retrieve passages from novel, but relevant, texts. The top ranked passages are presented to and reviewed by a user who will extract or infer feature values for each novel text. In this way, we have added new cases to our CKB at an expense to the user that is lower than the cost of reading the entire text. Fact Situation Initial Documents Case-Based Reasoner

Features, Values, and Excerpts

"Best Case" Texts

Relevance Feedback for Document Retrieval

Document Query

"Top" Texts

Cases Case Base

Information Retrieval System

Excerpts

Query Generation for Passage Retrieval

New Excerpts New Cases

Document Retrieval

Passage Query Passage Retrieval

Passage Presentation "Top" Passages

Figure 3.1. Overview of SPIRE.

3.2 Detailed system description SPIRE operates in two-stages. We sometimes call the rst stage the outer loop and the second stage, the inner loop. Figure 3.1 gives an overview of the entire process. In the rst stage, SPIRE is given a new problem or fact situation. The facts are input into a case-frame representation. We assume that the representation was designed by domain expert based on their expertise, knowledge of the domain, and

42

understanding of the task at hand. This representation will be exploited by the case-based reasoner to perform the desired type of reasoning. SPIRE then uses its HYPO-style CBR module [54, 2] to analyze the situation and select a small number of most relevant cases. These cases come from the reasoner's case-base. In a bankruptcy domain the case-base consisted of 55 symbolically represented personal bankruptcy cases. In standard CBR fashion, SPIRE determines the similarity of each known case to the new problem, sorts the relevant known cases according to their degree of on-pointness, and represents the results of this analysis in a standard claim lattice. (Figure 3.4 provides an example claim lattice.) The most relevant cases from this analysis|typically the cases in the top two layers of the claim lattice|are then used to `prime the pump' of INQUERY's relevance feedback module. This set of \best" cases is called the relevance feedback case-knowledge-base or RF-CKB. The original text of the cases in the RF-CKB (i.e., the opinions) are passed to the INQUERY [8] IR engine. The IR engine then treats these documents as though they had been marked relevant by a user. Using a modi ed form of relevance feedback, (we start with an empty query and generate one, rather than modifying or expanding an existing query,) the IR system generates a query by selecting and weighting terms or pairs of terms from within the RF-CKB. This query is then run against the larger corpus of texts, with the result that documents (court opinions) are retrieved and ranked according to INQUERY's belief as to their relevance to the posed query. Figure 3.2 gives an overview of this process. (More details on this process can be found in [55, 14].) In the second stage, SPIRE locates germane passages within each of the texts retrieved in stage one. There is no real limit on how many of the texts can be examined in stage two, however, in our experiments only the top 10 documents are further processed. If we were able to set a threshold that would distinguish between

43

Problem Case

CKB

Case-Based Reasoner "Best" Texts

Case Texts

Information Retrieval System

Retrieved Texts

Relevance Feedback Module Query Engine

Figure 3.2. SPIRE's retrieval process for novel documents. texts that are relevant and those that are not, we would process additional documents until that threshold was reached. To now locate the text segments that discuss a particular feature, SPIRE once again uses a hybrid CBR-IR approach but this time its task is to locate passages (within a document) rather than documents (within a collection). To locate these passages, SPIRE generates passage queries that express the information need associated with the particular feature. To do this, SPIRE uses the information that appears in excerpts from past discussions of a feature. The features are of various sorts, such as Boolean, symbolic, short list, etc. (See Section 4.2 for more information on the feature set.) For each case feature of interest, SPIRE is endowed with a case-base of speci c textual excerpts, called the excerpt-ckb. Each excerpt is an actual piece of text containing relevant information about a case feature. Each comes from an episode of information location/extraction performed on a past case. Excerpts are typically a phrase or a sentence in length. (Examples of excerpts are given in the next section.) To locate pertinent text about a feature in a new document, SPIRE gathers all the existing excerpts for that feature. Using the excerpt-ckb SPIRE generates a new query to be run on each document. SPIRE iteratively performs passage retrieval over 44

Information Retrieval System Text Marker

CKB

Case texts

New Texts

Excerpts Features Values Excerpts

Additional Excerpts New Cases

User/ IE System

Passage Query Generation Passage Query Passage Retrieval

Passage Presentation

Top Passages

Figure 3.3. SPIRE's passage retrieval subsystem. the texts retrieved in the rst stage. Figure 3.3 gives an overview of the passage retrieval process. There are numerous techniques available for transforming the excerpts into passage retrieval queries. For instance, one can simply amalgamate all the excerpts, and submit it as a \natural language" query. (Section 5.1 describes various techniques for building passage queries.) SPIRE presents the query along with a speci ed document to the IR engine. The IR engine divides the document into overlapping windows of 20 words each, approximating the length of a sentence. Each word in the opinion will appear in two windows (except for the rst 10 words). INQUERY then retrieves the top-ranked passages for presentation to the user (or possibly to an information extraction system). Thus, the excerpts are used analogously to the RF-CKB's of stage one: their terms are used to to generate queries. The dierence is that (at this point in our development of SPIRE) there is no selection of excerpts according to some model of relevance since all are used to generate the query. At some point, when these excerpt 45

collections become larger, the question of winnowing or selecting excerpts will become an interesting one. We created these case-bases of excerpts by asking an individual familiar with the representation of the problem domain to highlight example excerpts in the opinions corresponding to a small number of the cases in SPIRE's case-base.1 Typically this will only be a few case documents, on the order of 10{15. The individual was instructed to select any portion of text that was useful for determining the feature's value. The reader could select as much or as little text as was felt necessary. Excerpts could be a few terms, a few phrases, a full sentence, or even several sentences. It was also permissible to gather pieces from multiple locations throughout the text. The objective was to locate those portions of text that enabled the reader to ll the case-frame with the appropriate value. All of the data, (the features, their values, and the associated excerpts from within the original text,) was stored in the CKB. At this point, instead of treating each new document as a single item, the IR system divides the document into smaller elements. The IR system segments a retrieved document into passages and treats each as a separate entity. There are various ways to segment a text: using user-de ned boundaries such as sentences, paragraphs, or sections; semantically or thematically; or by indiscriminately dividing the text into windows of a particular (or varying) size. We chose to use windows of twenty words since this approximated the length of a sentence in this domain. (In Section 2.4 we elaborated on the various options available.) Next, SPIRE causes the IR system to generate a query from the excerpt-ckb to locate the relevant passages for a particular feature within an individual document. The This step would normally be done in conjunction with the creation of the representation for the domain and the encoding of the rst few cases; thus eliminating the need for a full review of the texts. In this case, the author examined the court opinions and their case-frames and derived the excerpt-ckbs from the opinions. 1

46

passage query is posed against the document's passages and the top ones presented to the user. For each document and feature in the inner loop, the user (or information extraction system) can examine the presented passages, determine (if possible) the actual value of the feature in the document, and add it to the case representation for the text. The user may also decide to add one or more of the retrieved passages, or selected portions of them, to the appropriate excerpt-ckb along with the feature and value. These new excerpts may be used to form later passage queries. In this way, SPIRE may aid in the acquisition of additional knowledge about the context of each feature. Once the new texts have been converted and added to the CKB, the CBR component may reason about them relative to the current problem or fact situation and the original task. In summary, given a new problem or topic, SPIRE retrieves documents from a text collection in its outer loop. It then highlights passages relevant to knowledge about speci c features from each of these documents in its inner loop.

3.3 Example problem To better illustrate the approach we used in SPIRE, we run through the following scenario based on a real personal bankruptcy case from Chapter 13 of United States personal bankruptcy law (11 U.S.C. x1301-1330), the Rasmussen2 case. Suppose a client, Mr. Rasmussen, approaches a lawyer about his attempt to le a personal bankruptcy plan. The Bankruptcy Court has denied approval of the plan because it failed to meet the \good faith" requirement. However, Mr. Rasmussen believes that he does satisfy the requirement and wants to appeal the court's decision. He tells the lawyer various facts concerning his problem case, including the information that he 2

In re Rasmussen, 888 F.2d 703 (6th Cir. 1988)

47

had recently used a dierent section of the bankruptcy code to discharge some of his other debts. The lawyer inputs these facts to SPIRE.

3.3.1 Document retrieval Having practiced in this area of law, the lawyer has knowledge of a set of past bankruptcy good faith cases and their outcomes. Assume she has represented these in her own in-house case-base, which is used by the CBR portion of the system. The system begins by performing an analysis of her client's problem case, with respect to this in-house case-base. In this instance, the CBR module uses a HYPO-style reasoner that uses a claim lattice to determine similarity. (See Section 2.1 for more details on CBR systems.) From the CBR analysis, SPIRE next selects a small set of special texts (the \RFCKB") on which to employ relevance feedback to generate a query. One good choice of texts would be those associated with the top layer or top two layers of the claim lattice. The cases in the top layer are those most highly similar to the problem case, based on the reasoner's particular similarity metric. RASMUSSEN CHURA RASMUSSEN SILVA BROWN

SCHYMA GIBSON SELLERS

SHEETS NEUFELD TAUSCHER

AKIN ALI CANDA DOS-PASSOS GUNN OWENS SANABRIA SEVERS SOTTER GIRDAUKAS

GOEB ESTUS TRAMONTO ASHTON

Figure 3.4. Top layers of the claim lattice for the Rasmussen case.

48

With Rasmussen as the problem case and a small corpus containing 43 hand-coded bankruptcy cases as the CKB, Figure 3.4 shows the top portion of the resulting claim lattice. All of the dimensions of the Chura case overlap with the problem case, hence, Chura is the only most on-point case. This is depicted by the single node coming from the root. There are three nodes in the second layer of the lattice and these nodes encompass the four additional cases that with Chura comprise the top two layers. The court opinions associated with the mopc's or the top two or top three layers of the claim lattice are then used as the set of marked, relevant documents on which the IR engine will perform relevance feedback to generate a query. To enable this, the CBR module passes the indices for these documents to the relevance feedback module within the INQUERY system. Relevance feedback is normally given a query to modify with information from the marked documents, but in our case, we are using relevance feedback to generate a new query. Further, in our scenario, we do not pass along any data about non-relevant documents. The relevance feedback module then selects and weights the top terms or pairs of terms from within these CBR-provided texts and forms a query. INQUERY then acts on the query in the usual way to return a set of relevant documents from a larger collection, say the WestLaw R Federal Taxation Case Law collection. The system returns to the user this set of probably relevant documents, some of which she already knows about since they were in her own personal CKB, to use in her research on Mr. Rasmussen's legal problem. Below is a sample query for this case, using relevance feedback to generate the top 15 terms based on the top three layers of the claim lattice: #WSUM(1.000000 0.772051 d.c.cir. 1.524504 mislead 1.888424 likelihood 0.957335 marlow 1.330305 sincer 1.523974 liber 1.345248 frequenc 1.426169 accurac 1.744932 minim 1.136436 eighth 1.117896 inaccurac 0.891818 gen 1.441973 colleg 1.247028 inordin 1.248914 preferent)

The ten top-rated documents based on the Rasmussen situation are listed in Table 3.1. We note that only Chura and Sellers were already known to SPIRE (i.e., 49

represented in its case-base of documents,) although none of these opinions have text in the excerpt-ckbs. (Only 13 documents were used to derive the excerpt case-bases. For more information about the derivation of the excerpts, see Section 4.3.) Thus, the other eight of the top ten cases must be \read" in order for their facts to be ascertained in preparation for any use in a legal analysis for Rasmussen. Table 3.1. The most highly ranked documents for the Rasmussen problem. Rank 1 2 3 4 5 6 7 8 9 10

Case Name Belief Score Doc-Id In re Sellers (0.490157) 180 In re San Miguel (0.483656) 289 In re Chura (0.482781) 188 In re LeMaire 1990 (0.479262) 860 In re LeMaire 1989 (0.479195) 751 In re Stewart (0.479071) 877 In re Chase (0.477976) 260 In re Lincoln (0.475428) 204 In re Nittler (0.474340) 407 In re Kazzaz (0.474268) 472

We also note that both of the LeMaire cases occurred after SPIRE's case-base was created (and after Rasmussen). In a elded version of SPIRE with a real problem case, of course only already litigated cases (having a published opinion) would be available for retrieval. However, the ability to retrieve cases from a text corpus allows a symbolic system, like the CBR submodule of SPIRE, to overcome some well-known limitations, like what we have called the \staleness problem," by allowing the system to access cases occurring after its case-base was created. This completes SPIRE's stage one. The lawyer now has a larger set of relevant documents for her research on Mr. Rasmussen's problem. The system has located new legal cases, previously unknown to the CBR module.

3.3.2 Passage retrieval Next, suppose the user would like to examine speci c facts in these newly retrieved cases, such as, nding out how long other repayment plans were. (Other features of 50

good faith bankruptcy cases are discussed in Section 4.2.) To do this, we direct SPIRE in stage two to locate the passages within each of the top case texts that concern the feature called duration. Duration is the length of the repayment plan proposed by each debtor. SPIRE uses excerpts from its case-base of excerpts on duration to form a query to retrieve passages. Sample excerpts from this CKB are:

\just over 25 monthly payments" \the plan would pay out in less than 36 months." \proposed a three-year plan for repayment," \The Court would require the Ali's [sic] to pay $89 per month for 36 months." \Debtors propose payments of $25.00 weekly for 33-37 months." \would be paid in full after two years. In the four or ve months following this two-year period, the unsecured creditors would be paid the proposed amount of 10% of their claims."

Notice that the rst three excerpts are only fragments of sentences; and that the third contains the value for the plan's duration, but expressed as a string. The fth, a complete sentence, yields a range of values for the feature, 33 to 37. Determining the value in the sixth, a sentence fragment plus a complete sentence, requires combining evidence from each portion to determine that the plan would run for a total of 28 or 29 months. SPIRE's case-base for this particular feature contains 14 excerpts collected from 13 opinions. Combined, they contain a total of 212 words, 92 unique terms after stemming, and 59 unique terms when stop words are removed. This happens to be the feature with the largest set of excerpts, although it does not contain the largest number of unique terms. (More information on the characteristics of the excerpt-ckbs can be found in Figure 4.2.) The top-rated document for the Rasmussen problem is the In re Sellers case, so we use it to illustrate passage retrieval. The IR engine divides the Sellers opinion 51

into overlapping windows of 20 words each. Each word in the opinion will appear in two windows (except for the rst 10 words). SPIRE then generates a query to be run against the Sellers opinion, divided into these windows. INQUERY carries this out and ranks the passages according to its belief that each is relevant to the query. For this example, we allow SPIRE to use two simple methods to generate queries. The rst combines the terms from all the excerpts about a feature into a single \natural language" query. Each word in each excerpt provides a possible match against the words in the window. Regardless of whether two words were in dierent excerpts, each contributes to the total belief. We refer to this type of query as a bag of words query. The second type of query is more restrictive when matching terms. Rather than amalgamating all the individual terms from the feature's excerpt-ckb together into a single long query, each excerpt is individually compared to the twenty word window. The value achieved by the best single excerpt will be the score given to the passage. This type of query tests the ability of the individual excerpts to match against each passage. We refer to this type of query as the sum query because it is formed by wrapping an INQUERY #Sum operator around each excerpt. For both types of query we use a #Passage20 operator, which tells the IR engine to retrieve passages with a window size of 20. Part of both of these types of queries for duration are shown below: #Passage20( just over 25 monthly payments the plan would pay out in less than 36 months. proposed a three-year plan for repayment, ...) #Passage20( #Sum( just over 25 monthly payments) #Sum( the plan would pay out in less than 36 months.) #Sum( proposed a three-year plan for repayment,) ...);

52

Posing these two queries over the Sellers opinion causes INQUERY to retrieve many relevant passages. Below in Table 3.2 are the top ve passages for each query, annotated with whether or not each is relevant: Table 3.2. Top passages retrieved from the Sellers opinion for the bag of words and sum queries for duration. Rank 1 2 3 4 5

Bag of Psg Strt 1430 1440 2650 2660 1420

Words Belief (0.404378) (0.404199) (0.402939) (0.402002) (0.401956)

Rank 1 2 3 4 5

REL REL REL REL REL

Sum of each Excerpt Psg Strt Belief 1440 (0.405236) 1430 (0.405234) 2650 (0.403057) 2460 (0.402278) 1420 (0.402145)

REL REL REL not REL REL

Figure 3.5 gives the text of the 1430 and 1440 passages, the top two passages in both retrievals. We boldface content terms that match those found in the excerpts and show word counts along with the text. (We have included and highlighted terms from the passage beginning at 1420 as it is ranked fth by both queries.) . . . spirit and purpose of Chapter 13. The debtor's proposed Amended Plan 1420 1430 called for payments of $260.00 per month for a period 1440 of 36 months. Pursuant to [the] Court Order, the debtor has 1450 made 24 monthly payments without a default. Of course, at the time of the original hearing. . .

Figure 3.5. Passages 1420, 1430, and 1440. From either the 1430 or the 1440 passage we can determine that the debtor proposed a 36-month plan. From the 1440 passage we can also learn that 24 payments had already been paid at the time of the hearing. The third ranked passage for both queries is 2650. We display it in Figure 3.6. (We include enough text to cover passage 2660, as it ranked fourth with the bag of words query and ninth with the sum query.) These passages talk to the duration of a plan that the judge is summarizing. 53

2650 2660 2670

. . . The debtor's plan is scheduled to run for only fteen months instead of the more common period of three years. This proposal to pay for only a limited time seems to relate with particularity to repaying only. . .

Figure 3.6. Passages 2650 and 2660. The fth-ranked passage for both queries, 1420, provides introductory text about the length of the plan. By looking at the next several words following the 1420 passage the reader can determine the duration of the plan. (Passage 1420 is given in Figure 3.5.) For the sum query, the fourth ranked passage, 2460, (given in Figure 3.7) is not relevant although it contains many terms in common with the excerpts for duration. It discusses the amount of the monthly payments and the monthly surplus, rather than theduration. . . . the debtors' 2460 proposed monthly payment under the Amended Plan is $260.00, and 2470 the monthly surplus of income is now over $1,000. The

Figure 3.7. Passage 2460. In stage two, SPIRE has thus located passages relevant to the duration feature without requiring a user to pose a query. SPIRE can do this for any feature for which there is an excerpt-ckb, and on any individual document. Unlike other approaches, which merely retrieve entire documents, SPIRE is able to retrieve documents and then present a signi cantly reduced amount of text about features contained within the document. This greatly decreases the amount of text a user must inspect for information.

54

3.3.3 Manual passage retrieval For comparison, suppose we intervene after SPIRE's rst stage and manually generate a query for the topic of duration. A sophisticated query might look like: #Passage20( duration #Phrase(per month) #Phrase(monthly payments) #3(propose to pay) );

Just like the excerpt-based queries, the manual queries examine twenty-word windows using the Passage20 operator. The other query operators, the #Phrase and #3 operators, add even more belief when the enclosed words are within 3 of each other, order dependent. (The #Phrase operator allows for a slight bit of exibility in its actual execution based on the frequency of the enclosed terms within the entire collection.) Posing this expert manual query against the Sellers opinion yields the ranking found in Table 3.3. Table 3.3. Top ranked passages from the Sellers opinion for the manual query on duration. Rank 1 2 3 4 5 6 7 8 9 10

Psg Strt 2620 2610 2100 2090 2080 1990 1980 1940 1930 1430

Belief (0.415226) (0.415226) (0.410598) (0.410598) (0.410598) (0.410148) (0.410148) (0.410148) (0.410148) (0.410148)

REL REL not not REL not not not not not

REL REL REL REL REL REL REL

We display the top two passages in Figure 3.8, again highlighting matching terms. They do, in fact, contain information about the duration of the plan. However, the next relevant passages are not found until ranks 5, 13, and 19. Of the top ten passages 55

seven are not at all pertinent. In general, one would like to achieve a higher percentage of hits in the top-ranked passages. 2610 2620 2630

. . . history and likelihood of continued future advances was good even in 1981. Third, the duration of the Plan is three years. The court in In re Estus, supra, sheds some light on this factor. . .

Figure 3.8. Passages 2610 and 2620. By using the case-base of excerpts SPIRE was able to generate a query with which to locate a good passage about the feature. Once the user has read the rst passage, she can easily calculate the value for the duration of the proposed plan. If the user so chooses, she may add text from this retrieval to the duration-ckb for use in future queries. This may prove particularly fruitful if an additional means of expressing the feature becomes apparent.

3.4 Summary This example shows how we can leverage the strengths of two dierent systems for mutual bene t. First, the IR system uses its knowledge of word distributions and the CBR provided relevant documents to create suitable queries for document retrieval. Second, this same style of interaction is used to retrieve germane passages over a variety of types of features and values from within the individual documents.

56

CHAPTER 4 EXPERIMENTAL METHODOLOGY Given that information retrieval systems locate entire documents and information extraction systems are expensive to train, this research aims at being able to automatically locate the relevant portions of a text and associate them with a feature in a frame, without the use of a large, annotated, training corpus. We next describe SPIRE's feature set and the excerpt-ckb's associated with them. We then explain how we acquired our test collection and documents, how retrieval was performed, and the process we used to generate answer keys. We conclude by de ning the evaluation metrics used to measure SPIRE's performance.

4.1 Features For any particular frame-based representation of a domain, there may be various types of features and types of values that may be the feature's value. By examining the representations found in two legal domains, we identi ed several dierent types of features and values.

4.1.1 Types of features The features found in the two two legal domains (personal bankruptcy and income tax home oce deduction) can be divided into nine general type classes: Boolean, category, numeric, range, set, date, proper name, free text, and formatted text. Here we give the de nition of each feature type along with example features and some illustrative values. (This is not intended to be an exclusive list of all the possible types 57

of features in a case-frame or template, merely an enumeration of those encountered in these two case representations. See Appendix A for the complete case-frame in the bankruptcy domain.)

Boolean:

{ Values: A yes or no value. { Examples: special circumstances occurred, substantiality of repayment.

Category:

{ Values: One and only one from a given class of items. { Examples: decision for (plainti, defendant), employment history (poor, neutral, good), earnings potential ((small poor) (medium neutral) (large good)).

Set:

{ Values: One or more values from a given class of items. { Examples: furniture in home oce (desk, chair, telephone, etc.), debt type (educational, taxes, judgment-debt, fraud, other)

Numeric:

{ Values: Single numeric value, either integer or real. { Examples: monthly income, percent surplus income, amount unsecured claims.

Range:

{ Values: Numeric values with an upper and lower bound. { Examples: hours per week in home oce, room temperature .

Date:

{ Values: A calendar date 58

{ Examples: plan ling date, loan due date

Proper Name:

{ Values: Name of an individual or company { Examples: judge, plainti, defendant

Formatted Text:

{ Values: Stylized text or symbols { Examples: case citation (In re Rasmussen, 888 F.2d 703 (10th Cir. 1989))

Free Text:

{ Values: Free text { Examples: case summary 4.1.2 Finding values Because there is such a diversity of feature types, there is an assortment of available means (manual and automatic) to locate the values. We found no direct correlation between features and a particular means of locating a value. The various methods may work well for features of diering types. Below are some of the ways of nding values. (Again, this is not meant to be an exclusive list, merely representative of those techniques currently in practice.) 1. Locating a word by string-matching from a set that we might be able to enumerate. Examples: (ruling: armed, overturned), (furniture in home oce: chair, desk, telephone). 2. Keywords/phrases that closely link a feature to its value { These might be similar to the \triggers" for \concept nodes" [51, 53]. When one sees the word/phrase, one expects the next or preceding text to be the value. Examples: (occupation: employed by, works for), (ruling: judgment for, judgment against), (monthly amount: proposes(ed) to pay, excess available). 59

3. Locating words/phrases that one might expect to be found in proximity to the value. Examples: (plan duration: week(s), month(s), year(s)), (payments made: paid back, regular payments). 4. Inferencing based on a segment of text { The reader or system must know some background information in order to make the association between the feature and the value. Examples: (monthly income: given a weekly salary { convert a weekly amount into a monthly amount), (decision for: overturned { to ascertain the outcome of this decision, one must know the ruling given in the original decision and any other intermediate level decisions). 5. \Concept recognizers" { These are methods for identifying closely related ideas or patterns such as: proper names, dates, monetary values, and foreign countries [7, 49, 39]. Examples: (monthly surplus: monetary value), (loan-due-date: date). 6. Synonym expansion through use of a thesaurus, an association thesaurus, [28], or a co-occurrence thesaurus, [68]. We observed that for most features, there would be multiple ways to describe the value. In Section 1.3 we provided an example of this with the monthly income feature. These dierent descriptors vary both within and across texts. For example, when trying to nd the value for the surplus, income after expenses that is available for the repayments of debts, one may directly locate the phrase \surplus of", which one would expect to be followed by the amount of the surplus. Alternatively, consider: \did not have more than $50 per month for debt repayment" and \leaving him with approximately $100 per month to nance his plan". While these are quite descriptive of a surplus, it is likely that the average user would not be able to create a query capable of matching these expressions. As another example, consider the feature describing the number of hours spent in the home oce. One might nd \per week" and \spent" in proximity to an actual 60

value. However, the expressions \on weekends" and \in the evenings" do not directly derive a value and, again, require inferencing on the part of the user. Thus, the usual extraction strategy of locating important syntactic or linguistic patterns in conjunction with key terms would not suce here.

4.2 Features examined We selected ten features from an existing personal bankruptcy good faith case representation [57]. There were ve types of values that these features could have: 1. 2. 3. 4. 5.

Boolean, date, category, set, or numeric.

For our set of ten test features, we included two of each type. We give the complete listing along with a description in Table 4.1. (Appendix B provides additional information about each feature and Appendix C gives the respective excerpt-ckbs.)

Table 4.1. Test features from the good faith personal bankruptcy domain. Feature Duration Monthly Income Sincerity Special Circumstances Loan Due Date Plan Filing Date Debt-Type Profession Procedural Status Future Income

Type Numeric Numeric Boolean Boolean

Description Length of the proposed plan in months Earnings in terms of dollars Was the debtor sincere in proposing the plan Were there any extenuating conditions aecting the debtor Date When the rst loan payment was due Date When the repayment plan was led Set Such as \educational" or \consumer" Set Such as \dentist" or \secretary" Category Such as \appeal" or \remanded" Category Likelihood that there will be an increase in the debtor's income

61

4.3 Excerpts For the set of ten features given in Section 4.2 we gathered excerpts from 13 bankruptcy court opinions. These 13 opinions were a set designated as \meaty" in that each covered a number of aspects and arguments associated with the good faith issue [56, 57]. The case-base of excerpts becomes crucial in the process of retrieving passages. Care must be taken that excerpts are of reasonably good quality to ensure that they will enable the retrieval of similar or related passages. This is especially important when we consider that the case-base is initially derived from a small set of texts, less than 15, (although it could be larger). To allow for the best possible case-base, excerpts must meet the following criteria: 1) the excerpt explicitly gives a value for the feature, or 2) the text provides sucient information such that by reading it, a human could infer the value. Since the excerpt-ckbs were manually created, we needed to provide guidelines on the amount and type of information that was to be included in an excerpt. Pertinent issues were:

how long should the excerpts be, and how much information should be annotated to denote that some text provides information about a feature?

We return to our example of monthly income to illustrate these problems. Below are several example sentences from the thirteen meaty opinions: 1. \In an amended family budget led June 27, 1983 the debtors list total income of $1,468.00 per month and total expenses of $1,350.00." 2. \The bankruptcy court found that the Kitchens' net disposable monthly income for 1979 averaged $1,624.82, $1,800 when federal and state income taxes are included, and that their estimated future monthly income was $1,479." 62

3. \Her monthly salary is $1,068.00." 4. \He is currently employed by a rm in a sales capacity, earning approximately $15,642.62 per year." 5. \Debtors' sworn Chapter 13 statement shows a combined monthly spendable income of $1,010.50, and monthly expenses of $900.00. When deciding on the length of an excerpt, we needed to consider whether to store: complete sentences, just the value found, the smallest set of words that delineated the value, or some intermediate level of text. One advantage to using complete sentences is that in the future we could try to take advantage of linguistic context. However, we did not attempt to do so in this research. One disadvantage of using entire sentences was that if the information is only a small portion of a lengthy sentence, then there would be many terms in the excerpt-ckb that would have no bearing on nding passages relevant to the feature. (In the bankruptcy domain the average sentence length was approximately 24 words with a standard deviation of around 10 words.) In the rst two examples, it seems likely that we will want only a portion of the text to be included in the excerpt-ckb. For the rst example, we actually stored: \the debtors list total income of $1,468.00 per month" and in the second: \net disposable monthly income for 1979 averaged $1,624.82". In the rst example we did not include the text mentioning the expenses as this related to a separate feature. Similarly, for the second example, we did not include: \and that their estimated future monthly income was $1,479", as this is text about future income and not the debtors' current monthly income. The converse problem of keeping an excess of terms in the excerpt-ckb is that of storing too little useful information from the sentences. This problem arises if we choose to use only the value we nd as the excerpt. This is especially true when the value in and of itself is meaningless for future queries. For example, if we only select the value from the third sentence, \$1,068", then we will be searching future 63

texts for the number \1,068", which is not likely to be a fruitful exploration. A more meaningful excerpt would include longer phrases or even the entire sentence. In this case, we would include the entire third example sentence in the excerpt-ckb. It is less obvious as to how much of the fourth and fth examples to use. We would probably enter the entire fourth sentence. For the fth example, we would chop o text from both the beginning and the end to leave \statement shows a combined monthly spendable income of $1,010.50". We remove \Debtors' sworn Chapter 13" since every document in the corpus is a Chapter 13 case and hence, this information would be useless for retrieval about monthly income. Similar to the rst example, we also remove the nal clause, as it pertains to monthly expenses and not monthly income. The next examples illustrate when the information necessary for a value spans multiple sentences, or even paragraphs or pages. The following was excerpted from the Seller's opinion and provided the information necessary to determine the monthly income: \When the hearing was held on April 20, 1981, Sellers had changed employment and was earning roughly $1,012.00 per month take-home pay for a 4-week period. His wife's income was listed as $525.84 per month."

Here, we needed to combine two values found in dierent sentences to determine the total amount. Another example of where the information spans a region was for the feature of special circumstances. The following comes from our excerpt-ckb: \She thereupon encountered some diculties in her personal life. A medical condition forced her to leave work for a two-week stay in the hospital and her marital relationship began to deteriorate. She also claims to have suered from a nervous condition during that time. She claims that she was unable to pay back these debts because her husband deserted her, leaving bills for her to pay, and that she was confused and upset."

We included all of this text in the excerpt-ckb as it was all informative. Below is a shorter example where the information was not contained in a single sentence. These excerpts come from the feature for the amount of unsecured debt. In 64

this opinion, the rst segment provides the background to aliate the values in the second with the feature. Interestingly, the rst segment actually appears well after the second: 1. \her two unsecured student loan creditors" 2. \. . . the NYSHESC and SUNY obligations were Debtor's only scheduled debts, with SUNY due approximately $501.58 as the result of a National Direct Student Loan, and NYSHESC due approximately $7,600.80 as the result of two Guaranteed Student Loans. . . "

These examples show that we wanted our excerpts to be descriptive of the feature, yet not contain excessive verbiage, nor be too terse. Our rst attempt to satisfy these criteria resulted in excerpt case-bases that contained both partial and complete sentences. In summary, the following were considerations when we constructed the excerpt-ckbs:

How long should an excerpt be? Need it be a complete sentence? multiple sentences? or, will smaller segments such as phrases be sucient? What text needs to be included? How much expertise is needed and how much care taken when creating the excerpt-ckbs?

Based on the above considerations, we initially assumed that not much expertise was necessary when gathering these excerpts. We selected excerpts that were as long as was necessary to include some information descriptive of the feature. Further, the excerpt text was to contain the actual value or to be indicative of the value. Excerpts came from multiple locations throughout the text and varied considerably in length. Appendix C lists the complete set of excerpts for each feature. The number of excerpts for each feature ranged from 3 to 14. Interestingly, the number of excerpts did not directly correlate to the number of unique content words for a feature. For example, one of the features with the most unique content terms, 65

special circumstances tied for seventh in number of excerpts. Conversely, monthly income had one of the highest number of excerpt yet one of the fewest in terms of unique content terms. Table 4.2 contains information about the set of excerpts for each of the test features. The reduction in size from the total words down to the unique terms ranged from 53 to 71% of the original size, with the average number of unique terms being 62%, or approximately two-thirds, of the number of total words. The reduction in the number of total words down to the unique content terms ranged from 28 to 50%, with an average of 36%, or approximately one-third of the number of total words. The average number of unique content words from the excerpts for the ten features was 46.7. Table 4.2. Number of terms contained in the original excerpt-ckbs. Feature Duration Monthly Income Sincerity Special Circumstances Loan Due Date Plan Filing Date Debt Type Profession Future Income Procedural Status Average:

Number of Total Unique Unique Excerpts Words Terms Content Terms 14 212 92 59 13 110 52 34 9 123 89 52 8 188 117 71 4 47 32 18 10 145 66 45 10 164 102 63 3 36 29 18 8 88 68 36 13 194 100 71 9.2 130.7 74.7 46.7

For a number of features a second excerpt-ckb was created. This second set of excerpt-ckbs diered from the rst in that some of the excerpts were reduced in length. We give information of these ckbs in Table 4.3. (The derivation of these excerpt-ckbs is explained in Section 5.3.)

4.4 Collection and test documents The test collection for both the retrieval of documents and passages consisted of 956 legal case texts addressing the issue of approval of a debtor's plan, as speci66

Table 4.3. Number of terms contained in the excerpt-ckbs, where features labeled with an * give values for the second excerpt-ckb.

Number of Total Unique Unique Feature Excerpts Words Terms Content Terms Duration* 14 192 88 55 Monthly Income* 13 109 51 33 Sincerity 9 123 89 52 Special Circumstances 8 188 117 71 Loan Due Date 4 47 32 18 Plan Filing Date* 10 115 56 37 Debt Type* 10 109 70 40 Profession 3 36 29 18 Future Income* 8 84 65 34 Procedural Status* 13 169 84 55 Average: 9.2 117.2 68.1 41.3

ed under Chapter 13 of United States personal bankruptcy law (11 U.S.C. x13011330). About one third of the court opinions address the sub-issue of good faith from x1325(a)(3). It is a homogeneous corpus since all of the opinions deal only with the speci c issue of debtor plan approval, as speci ed in Section 1325(a). We built this corpus by down-loading all the court opinions that were found by posing the query 1325(a) to the WestLaw Federal Bankruptcy Case Law database. We restricted the query to include only those cases decided between 1982 and 1990, inclusive. It contains all but the 10 earliest cases from the original 55-case BankXX CKB [56]. In this corpus about 40% (385 documents) make speci c reference to the narrower \good faith" issue. Thus, this corpus is very focussed. We ran SPIRE using three problem cases, Rasmussen1, Makarchuk2, and Easley3. Based on previous research on generating document queries [55, 14], for this research we used the top two layers of each claim lattice to generate a document query. Each query consisted of 25 pairs of terms, where the pairs were to be found within a In re Rasmussen, 888 F.2d 703 (10th Cir. 1989) 2 In re Makarchuk, 76 B. R. 919 (Bkrtcy. N. D. N. Y. 1987) 3 In re Easley, 72 B. R. 948 (Bkrtcy. M. D. Tenn. 1987) 1

67

window of 3 words. From these three queries we collected the top documents for each problem case. The top-ranked documents for each problem case are given in Tables 4.4, 4.5, and 4.6. Removing duplicates and documents that had been used to derive the excerpt-ckbs, we made a test collection of 20 documents from among the top 10 retrievals for each problem case. Table 4.4. The most highly ranked documents for the Rasmussen problem. Rank 1 2 3 4 5 6 7 8 9 10

Case Name Belief Score Doc-Id In re Sellers (0.490157) 180 In re San Miguel (0.483656) 289 In re Chura (0.482781) 188 In re LeMaire 1990 (0.479262) 860 In re LeMaire 1989 (0.479195) 751 In re Stewart (0.479071) 877 In re Chase (0.477976) 260 In re Lincoln (0.475428) 204 In re Nittler (0.474340) 407 In re Kazzaz (0.474268) 472

Table 4.5. The most highly ranked documents for the Makarchuk problem. Rank 1 2 3 4 5 6 7 8 9 10

Case Name In re Stewart In re Ali In re Makarchuk Matter of Akin In re Gathright Matter of Hawkins In re Ellenburg In re Carpico In re Newberry In re Porter

Belief Score Doc-Id (0.518380) 877 (0.508805) 178 (0.504300) 565 (0.503330) 353 (0.500740) 427 (0.498640) 177 (0.496048) 693 (0.493452) 915 (0.492010) 733 (0.491779) 764

The rst set of ten documents were taken from the top ten of the Rasmussen and Makarchuk cases. All of the rst ve from Rasmussen were included. However, the top ve from Makarchuk were problematic. This top ve included two documents that we could not use for testing: the Makarchuk problem case, itself, and the Ali court opinion, from which excerpts were gathered. Since excerpts also came from the document ranked sixth (Hawkins) we went further down the list. From the top 68

Table 4.6. The most highly ranked documents for the Easley problem. Rank 1 2 3 4 5 6 7 8 9 10

Case Name In re Estus In re Sellers In re Lincoln In re LeMaire In re LeMaire In re Kazzaz In re Stewart Flygare v. Boulden In re McMonagle In re Todd

Belief Score Doc-Id (0.478711) 001 (0.475997) 180 (0.475093) 204 (0.472234) 860 (0.472186) 751 (0.470862) 472 (0.469057) 877 (0.468791) 133 (0.468579) 206 (0.467763) 442

ranked documents for the Makarchuk case, we included the Ellenburg and Carpico opinions as well as the Stewart, Akin, and Gathright opinions in the rst set of ten test documents. The next set of ve test documents came from the third problem case, Easley. Similar problems of selection arose as some of the top-ranked opinions were those used for gathering the excerpt-ckbs and were duplicates from the previous two problem cases. After selecting a unique set of ve for this problem, an additional ve were gathered from the top set from all three of the problem cases to create a set of 20 test documents. Table 4.7. Test documents and their length in passages and words. Doc ID 180 188 289 353 427 693 751 860 877 915

Case Num Name Psgs In re Sellers 437 In re Chura 207 In re San Miguel 279 Matter of Akin 255 In re Gathright 517 In re Ellenburg 376 In re LeMaire 676 In re LeMaire 1096 In re Stewart 523 In re Carpico 214 AVERAGE:

Num Doc Words ID 4365 001 2064 204 2785 206 2547 260 5164 407 3755 442 6755 472 10957 733 5228 764 2137 961 Passages 442.7

69

Case Num Name Psgs In re Estus 448 In re Lincoln 398 In re McMonagle 417 In re Chase 470 In re Nittler 688 In re Todd 455 In re Kazzaz 455 In re Newberry 207 In re Porter 324 In re Gibson 412 Words 4421.95

Num Words 4476 3975 4161 4694 6873 4546 4541 2068 3230 4117

Table 4.7 lists the twenty test documents, along with their length in terms of number of passages and words. The average document length was 442.7 passages and approximately 4422 words. Document 860, In re LeMaire, was quite a bit longer than the others. It was an appeals court decision and contained both a majority and a dissenting opinion. We examined approximately 140 random sentences drawn from the documents in the left-hand column of Table 4.7. The average sentence length was 24.01 words with a low of 6 and a high of 62. The standard deviation was 10.95 words.

4.5 Retrieval To run our experiments we needed to decide what type/size item we would retrieve as a \passage". This raised many questions:

What type of element did we want to retrieve:

{ { { {

writer-de ned structures: sentence, paragraph, section, etc. [59], subject delineated segments (ala TextTiling [25, 24]), bounded-size windows [6], or xed-size windows [6, 71].

How long should a passage be:

{ If we used bounded-size windows or discourse-based elements, how large a segment should each be?

{ Should the retrieval size be the same for every feature? { Should passage length depend on the size of the user-provided excerpts? Since the best results for retrieving documents when incorporating information about passages had come when using xed-size passages, we decided to use the xedsize method. While our task was not the same, we would still like to retrieve the 70

\best" passage instead of the best document. Also, using a single xed-size window simpli ed the judging process (described in more detail in Section 4.6.2). We ran the risk of using an incorrect window size for retrieval. Problems could have arisen if there had been too many query terms for too small a passage size. If this had been the situation, there would have been be too many positive hits within many passages and we would have been unable to distinguish relevant from non-relevant passages. This possibility argued for either ensuring that the retrieved passage size stayed proportional to the number of query terms, posing multiple independent queries and merging the results, or selectively removing query terms. (We discuss these options in more detail in Section 5.1.) After some preliminary experimentation, we decided to use a single window of 20 words, approximating the length of a sentence. We did not vary the passage size based on either the feature, its type, or the length of the excerpt-ckbs. Doing so would have greatly complicated the judgement process. During retrieval, the IR engine divided each document into overlapping, xedsized windows of 20 words each. Each word in the text appeared in two windows (except for the rst 10 words). The nal consideration we addressed was whether we would require a minimum number of matching terms within the passage before allowing it to be retrieved. This was the situation in other work where minimum numbers of terms had to be matched at the sentence level [46, 64]. Only in a loose sense did we require a minimum number of matching terms: unless otherwise stated, a passage was only a part of the ranking if at least one term matched the query. This diered from the standard INQUERY procedure of giving every document a default score of 0.4. By only retrieving passages with this minimal match requirement we were able to observe those instances where the queries retrieved

71

less than the requested number of relevant passages. (See Section 4.7.2 for more on this.) If it had been the case that passages with fewer matching terms ranked higher than those with greater numbers of matching terms, and that the rank ordering was poor, we could have imposed thresholds on minimum numbers of matching terms as a means of boosting performance. Alternatively, we could have considered other ranking functions and means of forming queries.

4.6 Answer keys To test SPIRE's approach, we required a de nition of \relevant" as well as knowledge about which portions of a document contained text relevant to the feature being tested. We next describe our de nition of \relevant" and how we acquired relevance judgements.

4.6.1 De ning relevance In general, we believed that the relevant passages could be more lenient in their description of the feature than were the text found in the excerpt-ckbs. Therefore, to determine which passages would be judged relevant we considered the following questions: 1. How much information must the passage contain? 2. Was there be a single \best" passage or would there be multiple \good" passages? 3. If a feature requires that information be drawn from several sources within the text, must all of these passages be retrieved? The answer to the rst question somewhat answered the second question of whether there was a single best solution. These simple questions were, in fact, much more 72

complex than they appeared and had to be resolved in order to evaluate the success of the system. There were several types of passages which could be considered relevant. Below are the various levels of information the passages could provide:

explicit statement of a value for the feature, inferential statement of a value, historical information, and discussion about the feature without giving a value.

The rst type was any passage that expressly contained the value of the feature for the opinion at hand. This type of passage was, obviously, relevant. The second type of relevant passage was that which included enough relevant information from which a human could infer the value. For example, from the Easley opinion, comes the sentence: \Debtor's amended budget commits $30 to the plan from a weekly take-home pay of $262.30." It is easy to compute the value of the debtor's monthly income from the information given. What was not so obvious was what to do with passages that provide a value, but do so for another court opinion that was being used for illustrative purposes or was being analogized to assist with a particular point of discussion. Consider the following text which gives the amount of debt incurred through a student loan: In Nkanang, this Court found that the debtor did not demonstrate good faith in that there was no showing made by the debtor that he sincerely wished to repay his unsecured creditors, including a creditor owed $3,977.01 for a student loan. [court opinion 961]

While this text gives the amount of debt and the type of debt, these were not the correct values for the court opinion at hand. Making a distinction between a passage that provided a value for this particular court opinion and one that did so for another, at this point, was not reasonable. The state of the art is not suciently advanced to be able to provide enough context to 73

discriminate between passages that discuss historical from those discussing a current situation. What about the fourth type of passage, those that discussed the feature, but did not contain enough information to provide a value? For example, there were several references to the \income of the debtor's wife" or the \wife's income" within the Sellers opinion (given previously in Figure 1.2). These passages talked to the feature of monthly income, yet did not provide any value for the case-frame. Automatically discerning dierences between passages that discussed a feature from passages that had enough information to derive a value would be a very dicult task for an IR system. Consider the following monthly income segments from the Sellers opinion (also previously given in Figure 1.2):

\The debtor's original budget re ected slightly higher income. . . ", \. . . income was slightly reduced by lost overtime," \. . . amendments were explained by changes in income and expenses including the loss of overtime," and \. . . given the debtor's small income,. . . ".

These all provide insight to the reader that the value of the monthly income feature had been previously discussed and was important to this case without ever stating any amounts. These last four segments were spread throughout the opinion. The rst two appeared in footnote two on the second page, while the third was on the fourth page, and the last did not appear until the sixth page. Fortunately for the reader of this opinion, the segment containing enough information to provide a value for the feature appears rst. Unfortunately, the characterization of the income as being \small" did not appear until the sixth page and was buried in a discussion about the debtor's pre-plan conduct. The characterization of the debtor's level of income as \small" is useful in the sense that the lawyer would now know how at least one judge viewed this level of income. While not a consideration for this opinion, this characterization 74

may be important knowledge to the lawyer when comparing this situation to a new one. This is especially true if the opinion did not state any monetary amount, but merely referred to the level of income. At the present time, for many features, the state of the art does not allow us the ability to dierentiate between these passages and those that actually contain enough information. If we had an available means of directly extracting a value or of recognizing the dierences between these types of passages, then we would do so. For most of the features about which we were concerned, this was not the situation. For this work, these passages were considered relevant. In summary, to be considered relevant, a passage must:

explicitly state a value for the feature, inferentially provide a value, or provide discussion about the feature without giving a value.

Regardless of which court case the text was discussing (the current or one prior), if the passage met one of the above criteria, it was relevant. In answering the second question, that of whether there was a \best" passage, it is important to recognize that there may be multiple passages within a document that each, independently, provided enough information to determine the value of the feature. When there were multiple \best" passages we wanted to give our system credit for nding any one of the solutions and not receive reduced credit for not nding all of them. We defer discussion of how we did this until Section 4.7 which discusses the evaluation metric. We turn now to the third question, that of the need for multiple passages to discern a value. This type of solution was beyond the scope of this research. Future work is needed to examine such sets of passages and their associated features for possible means of generating multiple queries and retrievals and other methods to notify the user that the solution spans more than one passage. It may be that the passages are 75

in close proximity to one another. If that were the situation, then it may be a simple matter of either expanding the passage size or increasing the amount of the presented material.

4.6.2 Relevance judgements We hired two undergraduates to read the court opinions and underline any text that they perceived as being descriptive of a given feature. Both were given the same set of texts. They were given a set of written instructions that described each feature and were provided with samples of the sort of text they should mark. (See Appendix B for the complete instruction set.) When reading a court opinion, the name of the feature was displayed across the top of every page. We used these markings to create the les of relevance judgements. If a marked segment crossed passage boundaries, then all of the including passages were considered to be relevant. When we compared the two sets of markings we noted that the second reader was nding one or two additional relevant segments of text that the rst reader had not noticed. However, the second reader was failing to mark much of the relevant text. Additionally, this reader was being very liberal when underlining and was including large portions of text that were adjacent to relevant passages but contained no information about the feature. Because of the extreme liberalness of the second reader's markings, we sometimes pared down the amount of text judged relevant. In addition, the duration of the marking process was quite lengthy. Readers were taking quite a bit of time to go through each document, even after having read it more than once. Further, the readers complained that the process was quite tedious. To summarize, there were several problems with this procedure:

one reader marked as relevant large portions of non-relevant text that happened to be adjacent to relevant passages, 76

this same reader missed many relevant passages (resulting in a signi cant difference between the judgements), and the process was very time consuming and tedious.

In most IR settings where there are competing systems or techniques, there is usually one, or perhaps two, assessors making relevance judgements. The set of items to be judged are frequently \pooled" from across: dierent systems running the same query, dierent constructions of an information need run on one system, or the combination of a variety of queries being run on multiple systems [22, 79]. In all situations, the assessor's judgements are said to be the \truth" for this particular information need and any unjudged items are assumed to be non-relevant. All future systems and queries, in their subsequent attempts at retrieval, will be attempting to mimic the assessor's de nition of relevance in order to score well on this retrieval task. Because of the acceptance of the above protocol within the IR community and the problems stated above, we modi ed the task of our readers. Rather than having complete judgements for every feature and document, we opted to use \pooled" judgements. For a single feature we gathered the top twenty passages from each query, merged them into one large set, placed them in ascending order by passage start location, and removed duplicates. The readers then went through each court opinion and for each passage in the set, marked whether or not they believed it was relevant to the feature. They were told that there would be non-relevant passages within the set. As long as at least one reader judged a passage as being relevant it was included in the nal set. In Table 4.8 we give the nal numbers of passages judged relevant in each of the documents along with the average number of relevant passages for each feature. These may not be the only relevant passages for the second ten documents (those in the

77

right-hand column of Table 4.7), as the readers only judged the top twenty passages from the retrievals done on these documents. Table 4.8. Number of passages judged relevant for each feature. Doc ID 001 180 188 204 206 260 289 353 407 427 442 472 693 733 764 751 860 877 915 961 avg

Debt Duration Future Loan Mthly Plan Proc. Profes- Sincere Special Type Income Due Income Filed Status sion Circ 28 12 10 2 15 4 27 4 24 12 37 17 20 10 26 19 53 4 40 31 45 7 7 3 8 18 20 9 28 23 33 12 2 4 6 10 36 4 18 6 16 14 14 2 38 29 34 24 6 5 25 17 7 0 3 14 18 2 28 14 36 44 23 0 23 4 12 19 36 10 57 17 25 7 14 12 18 11 8 4 30 2 0 0 6 21 23 17 55 12 51 33 15 0 8 14 35 0 38 6 30 34 2 2 8 21 29 26 17 3 34 7 30 0 14 21 36 13 24 15 94 23 24 16 43 11 35 25 22 7 43 4 0 15 0 14 28 0 8 0 45 45 18 5 7 5 22 10 24 2 51 19 28 13 31 9 41 20 22 32 49 13 30 6 20 15 52 23 109 23 86 19 26 13 25 8 56 28 28 22 40 16 19 0 11 9 22 10 11 5 42 11 10 0 22 12 31 28 17 16 43.6 18.3 15.5 4.9 16.4 13.5 31.4 13.8 28.1 12.4

The least likely feature to be discussed was the loan due date: seven of the texts have no mention of this feature. On the opposite side of the spectrum, both debt type and procedural status were always discussed and usually at some length. The sincerity of the debtor placed a close third, primarily due to the abundance of discussion in document 860. There was no dierence among the types of features, with the exception that the two date features, loan due date and plan ling date, generally had less discussion than the other types (only special circumstances had less text than plan ling date.) 78

Document 733 contained the least number of relevant passages, with four of the features not being mentioned at all. One other document fails to discuss profession and one fails to give any indication of future income. Overall, there were few documents in which any of these ten features were not discussed.

4.7 Evaluation metrics We did not evaluate SPIRE on the basis of the time spent in computing the top passages. System retrieval time was negligible when compared with a user's expenditure of time in reading an entire document. Further, we assumed no monetary cost for using the IR system. Therefore, the primary evaluation concern was be that of the actual retrieval eectiveness. Since our driving task was that of lling in a case-frame with information derived from the retrieved passages, we considered the cost of evaluating the retrieval results in terms of the amount of eort expended by the user.

4.7.1 Standard measures The standard binary approach to relevance judgements rates items as either relevant or non-relevant. For many collections there will be non-rated items relevant to a query. These non-rated texts can be treated as either always relevant or always non-relevant, to produce best-case and worst-case scenarios. We were unable to get relevant/non-relevant judgments for all of the passages in all of the test documents. However, since we pooled the top twenty retrievals from a variety of query types, we assumed that non-judged passages were non-relevant. Based on a binary assessment of relevance, we considered what metrics should be used to evaluate passage retrieval. Within the IR community there are a variety of available metrics, primarily used to evaluate document retrieval. These include:

79

1. Recall { compares the number of items that should have been retrieved by a query to those that actually were. 2. Precision { compares the number of relevant items retrieved against the total number of retrieved items. 3. Average precision { precision averaged over some set number of recall points, typically 3 or 11, either interpolated or non-interpolated. 4. The E measure { combines recall and precision values at a given retrieval cuto, ignoring rank. The user provides the proportional value to give to each of recall and precision [50]. 5. Total number of relevant items retrieved by a given cuto. 6. Total number of queries with no relevant items among the top n [11]. Our task does not require that every applicable passage be retrieved. If there were multiple good passages, then merely locating one or two of them may be sucient for the user to be able to ascertain the value of the feature and ll in the case-frame representation. Therefore, in this environment, it is acceptable for recall to be low while requiring precision to be high. Hence, if we were to use the E measure, we would weight precision much more heavily than recall. It is also not unreasonable that we might not use recall at all and rely on other measures of assessment. Similarly, not using recall would rule out using average precision to evaluate system eectiveness. We can take advantage of the fact that our retrieval engine generates a rank ordering among the retrieved passages. This allows the option of generating values for metrics 5 and 6. This requires determining a reasonable value for n: How far down the ranked list should we look? Do we look at the top 3, 5, 10, or more, passages to see how many of the relevant items we have retrieved? Should this depth vary based on the importance of a feature? Since we strongly desired to reduce the total amount of text presented to a user, (whether human or machine), we evaluated our system with the most stringent possi80

ble metric. We felt that a user would generally be willing to read the top ve passages, and occasionally as deep as the top ten. Reading beyond the top ten passages would be unreasonable to expect unless done rarely or if the feature were particularly important. Therefore, for comparative purposes, we provide results from SPIRE over the top one, three, ve, and ten passages. We also give results at fteen and twenty passages for informational purposes. We altered the TREC [22] code to generate precision at these levels rather than the default set of cutos which are much deeper. We did not make any distinctions of importance among the features; all features were evaluated to the same cuto points. In order to evaluate the retrievals using the metrics of recall and precision, we had to insure that all passages received a default belief. This value had to be greater than 0.0, so that a total ordering could be generated. Additionally, so that this default could be generated, the length of every document, in terms of number of passages, needed to be known. This required multiple changes to the existing code. Once completed, we could generate the set of TREC evaluation metrics. We should bear in mind that we did not have a complete set of relevance judgements for every passage in every document for every feature. This means that it is possible that there were relevant passages that were counted as being not relevant. Not having total judgements could aect both recall and precision scores. However, since half of the documents for each feature did have complete judgements, scores should be aected equally. Since we were interested in a metric that heavily credits those rankings where good passages are found early, we needed a stronger measure than simply counting the number of good passages in the top n. We wanted to have a better idea of where the relevant passages appeared in the ordering; using cutos loses some of this information. 81

4.7.2 Expected search length In our evaluation, we take a user-oriented perspective in which the user is hoping to nd some number of relevant items quickly. We are not concerned with locating every relevant item, and we want to credit those queries in which the relevant passages appear high in the rankings. Further, we want a single value so that comparison is easily done. Therefore, our primary concern is with how much non-relevant data a user must go through before nding some number of relevant items. This value can be measured by what is called the expected search length (esl)[10]. ESL measures the number of false hits a user (or system) would have to read through (or process) before nding a speci ed number of relevant items, q. It measures the amount of wasted eort. Using this measure allows that there may be multiple items with the same probability of retrieval (in our system, the same belief scores). By using esl we assume that these items may be retrieved in any order. To calculate esl we locate the qth relevant item. We must now look at the set of items with the same score as the qth relevant one. as well as those items preceding. Let j be the number of non-relevant items viewed prior to reaching this set. We need s relevant items from this set, where s q. Of the set of equally scored items, n are non-relevant and r are relevant to our query. If the r items are evenly distributed, there will be r + 1 partitions, each n non-relevant items. To locate the sth relevant item, on average, we containing r+1 must examine the following number of items: rn+1s . Therefore, to nd q relevant items, we must, on average, examine:

esl(q) = j + rn+ s1

If, for example, we had requested two relevant passages, and the relevance assessments for the top passages are those given in Table 4.9, then the esl is 1.0. If three 82

relevant passages were requested, then the esl is 3.67. (The passages ranked 1 through 5 would have to have been processed as well as an average of .67 from 6 through 9.)

Table 4.9. Relevance of the top passages from a sample query. Italicized lines are those that share a belief score. Rank 1

2 3 4

5

Score 0.424378

0.422939 0.422939 0.422939

0.419623

Judgement REL

REL not REL not REL

not REL

6 7 8 9

0.415667 0.415667 0.415667 0.415667

not REL REL REL not REL

10

0.411673

REL

An esl score of -1 is given if either of two conditions is met:

the document did not contain the requested number of relevant passages, or the query was unable to locate the requested number of relevant passages.

We could calculate a value in the second situation: we determine how many passages were not retrieved and then calculate how long on average it would be until the requested number of relevant passages appear within this set. We did not calculate the score in this second case because we wanted to know those instances where there were limited numbers of relevant passages that had matching query terms. If we were to give a default score to every passage, then the following scenario is possible: assume there are two queries for a feature: A and B. In retrieval A, a relevant passage, p, with no matching query terms may provide A with a better esl score than when p has terms in common with query B. This is possible when p occurs low in the ranking of query B and there are few passages in the document that match the query terms of A. For most experiments the esl score for the second condition was not calculated and so both conditions yielded the same value. For the results under the second 83

condition, we can calculate the esl score and determine whether a random retrieval would have been better than posing the query.

4.7.3 Summary For our task, that of a user reading and evaluating a small number of passages to determine the value of a feature, we did not necessarily care if every relevant item appeared in the retrieval; our primary consideration was precision and not recall. But, even more important than precision, we cared about precision within the top set of retrievals. We can use the measure of average precision at various cutos, but we lose information about where within that set the relevant items appeared. By looking at several cutos we can more accurately tell when the relevant items appeared. On the other hand, even by looking at multiple cuto levels, we still lose some of the locational information. Further, for this task, we are user oriented. We would like to know how much eort was expended by the user to nd the amount of information needed to ll in the value of feature in a frame-based representation. Expected search length allowed us to measure how much extra work had to be done before the user was able to nd the information. This more accurately depicted the nature of the task.

84

CHAPTER 5 EXPERIMENTS Assuming a set of excerpts, all believed to be relevant to a given feature, how could we take advantage of this information? We wanted to automatically transform this information into a query to locate \good" passages from within a novel text. Our primary focus was whether we could automatically form good queries based on a set of excerpts. For all the queries that we posed, there was a #Passage20 operator at the outer most level. This indicated to the IR engine that we wanted retrieval of passages and that the passage length was to be 20 words long.

5.1 Excerpt-based queries From the case-base of excerpts we generated a variety of query types to test our hypothesis that terms selected from the excerpts could represent the information need associated with a feature. We built three dierent types of queries from the excerpts, the \base" set, \Kwok-based", and \semi-random". We also manually crafted a set of queries to use for comparison. We next describe each of the techniques we explored for generating queries based on the excerpts.

5.1.1 Base queries While there were any number of specialized query operators that we could wrap around the text in the excerpts, we wanted to take advantage of the ones that made the most sense, and, if possible, were the least complex. The operators we chose to investigate were included in the set of queries we termed the \base" queries. We 85

tested weighted sum operator, both wrapped around each individual excerpt and (indirectly) around the excerpt-ckb as a whole. (The #Passage20 operator acts as a weighted sum.) We also tested the use of a proximity operator, #Phrase, nested within the queries. The set of base queries consisted of ve queries in which we tested the utility of various simple query operators and added in increasing amounts of structure. The nal query removed information to test the usefulness of redundancy in the excerpt case-base. The base set consists of the following query types:

bag of words, sum, bag of words plus phrases, sum plus phrases, and set of words.

As previously mentioned, we started with two simple queries: the rst being the bag of words query and the second being the sum query. The rst is representative of a very long natural language query. By wrapping a #Passage20 operator around the entire excerpt-ckb, the entire set was treated as one long query. As stated above, the #Passage20 operator actually translates into a weighted sum based on the frequencies of the contained terms. This query may seem a bit simplistic, but it is for that reason that we wanted to test it. The simplest query might work perfectly well in our scenario. The score for a passage when using the #Sum operator was based on the best match between the window terms and each excerpt. The use of a #Sum operator surrounding each excerpt restricted the number of terms that could match against the twentyword passage window to be those in a single excerpt. With the bag of words query, the passage score was based on matches found with any of the terms found in any of 86

the excerpts. To illustrate, below are portions of the bag of words and sum queries for the feature of loan due date. #Passage20( Repayment on the loan after the expiration of the grace period, began on January 1, 1983, but Mr. Ali was given an additional deferment until July 1, 1983. became due one year after graduation . . . ); #Passage20( #Sum( Repayment on the loan after the expiration of the grace period, began on January 1, 1983, but Mr. Ali was given an additional deferment until July 1, 1983.) #Sum( became due one year after graduation) . . . );

It was also reasonable to collect information about the phrases in the excerpts and to add this data to the queries. There existed an INQUERY tool that made this possible, even when our queries were not complete sentences. The third and fourth base queries were built by passing the excerpts through the INQUERY phrase generation tool. The generated phrases replaced the original excerpt terms. The use of the #Phrase operator further restricted the proximity constraints for certain word combinations. The #Phrase operator added belief when the enclosed words were within 3 of each other, order dependent. Depending on the relative belief of the given terms and their co-occurrence within the document collection, the #Phrase operator may split into one of several other operators. Phrases were added to the bag of words and sum queries to produce the third and fourth base queries, respectively. Below we show the phrases that were added to those portions of the previously given bag of words and sum queries for loan due date. #Passage20( Repayment on the loan after the expiration of the #Phrase( grace period), began on January 1, 1983, but #Phrase( Mr. Ali) was given an #Phrase( additional deferment) until July 1, 1983. became due #Phrase( one year) after graduation . . . );

87

#Passage20( #Sum( Repayment on the loan after the expiration of the #Phrase( grace period), began on January 1, 1983, but #Phrase( Mr. Ali) was given an #Phrase( additional deferment) until July 1, 1983.) #Sum( became due #Phrase( one year) after graduation) . . . );

The nal query type dealt with the issue of how to deal with redundant parses, (i.e., two excerpts may reduce to the same stopped and stemmed set of terms | either in the same order or not.) If we had decided that the fact that an item existed in more than one excerpt was valuable, how were we to incorporate this information? How would we handle terms, or phrases, that were identical across excerpts, or even entire excerpts that were identical? Operationally, keeping redundant items was simple; multiple instances of an item could be re ected in the query term weights. This is what is currently done with natural language queries by INQUERY and what we did with the bag of words queries. Testing the removal of redundant items was slightly more dicult: we had INQUERY merge the items under the #Wsum operator. The system then went back and reweighted all the query nodes so that they shared equal weights. This fth type of base query examined whether words found in more than one excerpt should be given higher weights based on their frequency within the excerpt corpus. These queries were built by taking the bag of words query and removing duplicate words, thus removing the frequency information. We refer to these as the set of words queries. Below, we rst give part of the bag of words query and then the resulting set of words query for the feature of debt type. Words that were removed to make the set of words query are given in boldface in the bag of words query. #Passage20( nature of the debts as student loans two unsecured and otherwise nondischargeable student loan debts sixty- ve percent of his total debt is student loan obligations . . . );

88

#Passage20( nature of the debts as student loans two unsecured and otherwise nondischargeable sixty- ve percent his total is obligations . . . );

These ve queries make up the base set. They are all simple queries that required only a limited amount of processing to create.

5.1.2 Kwok-based weighting The second set of queries were framed loosely on a weighting scheme suggested by Kwok[34]. Kwok developed his weighting based on the idea that queries with only a few terms did not convey enough information about the relative importance of each of the various terms to accurately depict the information need. To increase the probability of nding useful documents, he chose to automatically \boost" the scores of some of the terms to re ect their importance. The weighting scheme was built on the notion of \average term frequency" (or \average tf"). This was de ned as the average number of times a term appeared within any document in the collection. By counting the number of documents where the term appeared and then, knowing the total number of times the term was in the collection, the average tf was calculated. The average tf was then balanced with either the inverse document frequency or by a cuto. The average tf may be given more or less importance by the use of an exponent, typically ranging between 1.0 and 2.0. This is shown in Equation 5.1. All scores were then normalized so that the sum of the term weights added to 1.0. avg tf i (5.1) log(max(cuto; Doc Freqi)) Experiments were done with the INQUERY system and the formula yielding the best variation on this approach is given in Equation 5.2. The query count was retained 89

to re ect any user given relative importance and was used in conjunction with the average tf. Maxidf was the largest idf value found in the collection. If the average tf for a term was less than 1.2 then the term was given a weight of 0.0. Query Cnti avg tfi (5.2) log Coll Size Doc Freq max idf For this set of queries, SPIRE gathered all the excerpt terms for a feature and reweighted them using the formula given in Equation 5.2. The terms were placed in descending order of perceived importance using the new weights. From this initial ordering, SPIRE generated several new queries: i

1. simple reweighting of all query terms, 2. selection of the top 10 terms with their new weights, 3. selection of the top 10 terms, but reweighted evenly (all terms received weights of 1.0), 4. selection of the top 20 terms with their new weights, and 5. selection of the top 20 terms, but the terms reweighted equally. Table 5.1 gives the top 20 terms and their weights for the Kwok-based queries for special circumstances and sincerity. The excerpt terms were stopped and stemmed prior to reweighting. The rst term for sincerity, \bny", is an abbreviation for the Bank of New York, a creditor in one of the cases.

5.1.3 Semi-random sets of terms The last type of SPIRE-generated query we investigated was a set of \semirandom" queries. To create these, SPIRE randomly selected either one-half or onethird of the available query terms. Each query term was considered equally likely (i.e., excerpt frequency was not used) and available only once (i.e. selection without replacement). We produced twenty queries with one-half of the terms, and ten with 90

Table 5.1. The top 20 terms and their weights for the Kwok-based queries for special circumstances and sincerity.

Special Circumstances 0.826753 victim 0.818761 desert 0.732213 incarcer 0.622541 ill 0.616444 penitentiar 0.607779 hospit 0.597349 week 0.555805 medic 0.554433 marit 0.529763 bellgraph 0.504582 condit 0.489071 mexico 0.470094 extraordinar 0.461580 deterior 0.459047 disabl 0.444904 fraudul 0.399679 special 0.390049 contribut 0.376766 forc 0.374285 bkrtcy.w.d.n.y.

Sincerity 6.182001 bny 1.139891 sincer 1.107249 maine 0.638365 earnest 0.636542 repay 0.627096 sensibl 0.521929 concert 0.499889 wipe 0.463457 genuin 0.407922 desir 0.396330 eort 0.357844 believ 0.304697 coupl 0.272390 testimon 0.272328 avoid 0.268532 regular 0.247390 live 0.243003 motiv 0.235132 permiss 0.230153 60

one-third. Table 5.2 gives a sample of the terms found in the semi-random queries for the feature of sincerity. These queries addressed the concern that there might have been too many terms available for matching, especially since some of the excerpt-ckbs had fairly large numbers of terms. This in turn might have allowed too many passages to receive high belief scores.

5.2 Manual queries To provide another point of comparison, we also had a human expert, familiar with the domain, the INQUERY IR engine, and the IR engine's query operators, create queries. (The author was the expert.) The expert could use any of the available INQUERY operators and was able to re ne the queries as many times as desired. There was no time limit imposed. Some additional query re nement was allowed 91

Table 5.2. Sample Semi-random queries for sincerity. 1/3 of Query Terms extend possibl believ eort believ extend complet sought complet repay complet regular desir continu expect permiss attempt testimon sincer regular three month intend desir make testimon motiv claim desir debt temper make petit concert debt coupl earnest coupl genuin plan wipe uncontrovert return creditor temper court 13 maine bny return concert

1/2 of Query Terms possibl possibl possibl extend deal believ eort repres deal complet consist eort continu repay complet regular 60 repay intend expect continu testimon intend regular three testimon live mean motiv desir sincer period period made made sincer month make month make debt debt debt unsecur unsecur coupl petit coupl genuin uncontrovert uncontrovert uncontrovert wipe wipe claim temper claim maine sensibl maine sensibl 13 concert creditor plan plan earnest return return plan avoid avoid chapter debtor chapter debtor bny debtor

92

after judgments for the 20 test documents were available. The goal was to create the best possible query, thus setting a high standard for SPIRE to attempt to achieve. The best query for each feature was retained for use as a baseline. We refer to this set as the \manual" queries. The complete set of manual queries can be found in Appendix D. Besides this set of one query per feature, we also generated several queries that consisted of lists of words that were relevant to a \category" type of feature. Similar to keywords, these were lists of possible values for that feature. For example, for the feature of profession we enumerated job titles. There were three additional \list" queries of job titles for profession. (These queries are described in more detail in Section 6.2.9.) In addition to profession, a list query was created for debt type (see Section 6.2.6). We created these list queries to see how well a query, built on the basis of keywords, would perform. We were interested in using the keywords that were the values of the features, rather than keywords that described the feature. We also wanted to test how long the list would need to be in order to be eective: if a user has to produce a long list of keywords, then it might not be worth the user's eort to do so. On the other hand, if a few keywords would suce, then using the excerpts would be more costly.

5.3 Second set of excerpts We observed during our initial experiments that many of the non-relevant passages for some of the features included the speci c names of debtors from a previously decided case. In our excerpt case-base, such names had sometimes been included, for instance, \Debtors-Appellants, Mr. and Mrs. Okoreeh-Baah, led for bankruptcy on November 19, 1985." Since the case name, \Okoreeh-Baah", was included in the excerpt, it caused SPIRE to rate passages that included it very highly, even though 93

the presence of this speci c name does not make a passage relevant to the feature under discussion, plan ling date. Based on this realization, we reexamined SPIRE's excerpt case-bases. Within the excerpts for several features, proper names were frequently included. Where reasonable (i.e., at the beginning or end of an excerpt), we subsequently removed any proper names. We also noted any text extraneous to describing the feature and also removed it from the excerpt-ckbs. (For examples of excerpts where text was removed, see Section 6.2.5.) We created a second excerpt-ckb for the features of debt type, duration, future income, monthly income, plan ling date, and procedural status. (The original and the second set of excerpt-ckbs are given in Appendix C.) We used these second excerpt-ckbs to create another set of base queries.

5.4 Summary We created three sets of excerpt-based queries: the base set, Kwok-based, and semi-random, and tested them against a set of manually crafted queries. We also created a second set of base queries using excerpt-ckbs that had been trimmed of extraneous information. In Table 5.3 we summarize the queries that were formed and the experiments that were conducted.

94

Table 5.3. De nitions of the dierent query types used in the experiments. Base {

Simplest use of excerpts and query operators

Bag of Words Least restrictive, natural language query Sum Groups each excerpt together Bag of Words + Phrases Use of #Phrase operator Sum + Phrases Use of #Phrase operator Set of Words Removed frequency information Kwok-based { Dierent term weighting and selection scheme All terms All reweighted Top 10 reweighted Reweight + select 10 best terms Top 10 even Reweight + select 10 best terms + reweight to 1.0 Top 10 reweighted Reweight + select 20 best terms Top 10 even Reweight + select 20 best terms + reweight to 1.0 Semi-random { Reduced query length 1/2 Terms (20 queries) 1/3 Terms (10 queries) Second Base { Re ned excerpt-ckbs Expert Manual { Upper bound retrievals

95

CHAPTER 6 RESULTS We ran SPIRE using three problem cases and collected the top documents for each. Removing duplicates and documents that had been used to derive the excerpt-ckbs, we made a test collection by selecting documents from among the top 10 retrievals for each problem to make a test set of 20 documents. Using various methods for passage query generation, we tested SPIRE on these 20 documents with 10 dierent case features: duration, monthly income, sincerity, special circumstances, loan due date, plan ling date, debt type, profession, procedural status, and future income. We report esl values when one, three, and ve relevant passages were requested. We denote these as esl1 , esl3 , and esl5. When we report esl scores, a score of -1 indicates that the query did not retrieve the requested number of relevant passages. For signi cance testing we used a t-test at the .05 level on average non-interpolated precision and precision at six cuto levels. The cuto levels were 1, 3, 5, 10, 15, and 20 passages. We rst compare results within the set of base queries and then compare the best of these types to the Kwok-based and semi-random queries. We also give an overview of the dierences when using the second set of excerpts. We then give an extended discussion of the best of the excerpt-based queries as compared to the expert manual queries.

96

6.1 Comparison of excerpt-based queries Across all the excerpt-based queries, the base set did the best. The Kwok-based weighting and selection of terms was not bene cial and was signi cantly worse than using the best of the base set. Using only some of the terms from the excerpts to make the semi-random queries also proved to give much worse results than the queries in the base set. Conversely, re ning the excerpt-ckb was de nitely worth doing as many of the retrievals improved and some did so signi cantly.

6.1.1 Base queries The rst two base query types, the bag of words and sum queries, were fairly comparable in their performance. Notable dierences were in the results for the special circumstances feature, where the bag of words query did better, and in the procedural status feature, where the sum query did better. These respective queries were better across all three esl levels. For the other features, the results for the bag of words queries were a bit better for duration and loan due date, the sum query was better for sincerity, and the others were mixed. Table 6.1 give the average esl scores for both query types at the three esl levels for all features. Average precisions at the cutos are given within the discussion of each feature. In general, the sum queries performed slightly better at esl1 and esl5, and the bag of words queries performed slightly better at esl3 . Examination of the passages retrieved revealed no clear reason why one query would perform better than the other. These rst two types of query generally outperformed their equivalent query that included phrases (base query types three and four). This indicates that, in our application, phrases were detrimental, rather than bene cial. This was surprising since phrases typically increased retrieval performance [13, 15, 34].

97

Table 6.1. Average esl values for the bag of words and sum queries for all features. Features with a * list values from the second excerpt-ckb. 1 psg requested Bag of Sum Words Debt Type* 0.47 0.30 Duration* 1.45 1.70 Future Inc* 18/0.6 18/0.2 Loan Due 13/9.3 13/11.2 Monthly Inc* 19/0.1 19/0.1 Plan Filing* 2.55 2.55 Proc Stat* 2.20 1.30 Profession 18/1.8 18/1.3 Sincerity 0.25 0.15 Spec Circ 19/1.3 19/2.8 Feature

3 psgs requested 5 psgs requested Bag of Sum Bag of Sum Words Words 1.37 1.34 2.17 2.34 19/3.0 19/4.0 18/5.6 18/6.7 16/1.8 16/2.0 15/3.0 15/3.3 9/16.1 9/18.3 7/38.6 7/35.1 19/3.6 19/5.2 17/12.7 17/11.4 6.85 8.25 18/22.3 18/22.8 5.50 4.50 14.30 11.20 15/7.5 15/8.2 13/9.0 13/11.3 3.25 2.72 5.62 5.40 18/3.1 18/10.0 16/41.4 16/51.4

Closer examination of the SPIRE-generated phrases oered an explanation for this. In situations where phrases have proved bene cial, they were generated for use in full-document retrieval. For these retrievals the useful phrases were typically groups of nouns or noun phrases. The INQUERY tool used by SPIRE to generate phrases primarily created phrases from nouns found in close proximity. Some of these were not at all indicative of the feature. For example, \two unsecured" and \2000 bank" were generated for duration, \net disposable" and \sales capacity" were generated for monthly income, and \percentage repayment" and \student loan" were generated for procedural status. These all led to reduced performance by the queries. The phrase generation tool did come up with some good phrases: \unsecured debts", \student loan", and \medical bills" for debt type, \nervous condition", \fraudulent practices" and \special circumstances" for special circumstances, and \veterinarian position" and \private practice" for profession. So, there were both features where the phrases helped and hurt. Overall, though, the eect was negative. The phrases that a user might cite as being descriptive for our features were not necessarily noun groups: \payments of" and \proposes to pay" for duration and 98

\became due" for loan due date. These phrases all contain important words that stopping will remove. Therefore, for our features, the useful phrases were not all noun groups, although the phrases generated for our features were. Many of the phrases generated were not at all indicative of the feature, although there were some features where good phrases were generated and were bene cial. Further, many of the phrases that a user would specify contain stop words. If we were to retain all terms, (e.g., we did not stop), and statistically examine those terms that occurred near each other, then it is possible that a better set of phrases could be automatically generated. We validated that term frequency information was important. The dierences between the bag of words and set of words queries were signi cant. For all features, the bag of words queries outperformed the set of words queries at all esl levels. Table 6.2 gives the average esl results for both the bag of words and set of words queries for all features and Table 6.3 shows those instances where average precision and average precision at the six cutos were signi cantly better for the bag of words queries. Table 6.2. Average esl scores for the bag of words and set of words queries. Features with an * used the second excerpt-ckb for comparisons. 1 psg requested Bag of Set of Words Words Debt Type* 0.47 0.73 Duration* 1.45 10.85 Future Inc* 18/0.6 18/1.1 Loan Due 13/9.3 13/29.2 Monthly Inc* 19/0.1 19/1.4 Plan Filing* 2.55 15.60 Proc Stat* 2.20 5.22 Profession 18/1.8 18/5.0 Sincerity 0.25 4.15 Spec Circ 19/1.3 19/10.4 Feature

3 psgs requested 5 psgs requested Bag of Set of Bag of Set of Words Words Words Words 1.37 2.70 2.17 4.17 19/3.0 19/19.6 18/5.6 18/30.7 16/1.8 16/6.7 15/3.0 15/13.2 9/16.1 9/23.0 7/38.6 7/60.1 19/3.6 19/12.2 17/12.7 17/29.5 6.85 37.19 18/22.3 18/51.4 5.50 13.60 14.30 23.90 15/7.5 15/13.9 13/9.0 13/18.8 3.25 14.60 5.62 21.10 18/3.1 18/22.3 16/41.4 16/67.1

There were a few individual documents where the bag of words query did better, but the number of times was very small. Overall, giving additional weight to multiply 99

Table 6.3. Instances when the bag of words query was signi cantly better than the

set of words query using average precision at six cuto points. Features with a * used the second excerpt-ckb for comparisons.

Cuto

Debt Dura- Future Type* tion* Inc* 1 * 3 * * 5 * * 10 * * 15 * * * 20 * * * Avg Prec * * *

Loan Mthly Plan Proc. Profes- Sincere Spec Due Inc* File* Status* sion Circ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

stated terms resulted in better retrieval results. This indicated that giving additional weight to a term when it occurs multiple times in the excerpt-ckb was better than weighting each term evenly. Overall, the bag of words and sum queries did the best within the base set. We go into more detail on these two types of queries and how they compared to the manually crafted queries, giving a feature-by-feature comparison in Section 6.2. Because the bag of words and sum queries were fairly close in performance, for simplicity, we compared the other excerpt-based query types to the bag of words query set.

6.1.2 Kwok-based weighting The bag of words queries outperformed all of the Kwok-based versions. The best of the Kwok-based queries was the rst type, where all excerpt terms were given new weights. Comparing this query, along with the bag of words query, to the manual showed that the Kwok-based weighting generally followed the performance of the bag of words across all features, yet at a reduced level. Within the Kwok-based set, the inclusion of all terms beat those with just 20, which in turn beat those with just 10. Weighting the top terms equally versus using their new weights caused no noticeable dierence in retrieval performance. Table 6.4 100

gives those features and precisions when the bag of words queries did signi cantly better than the best Kwok-based query.

Table 6.4. Instances where the bag of words query was signi cantly better than the best Kwok-based query.

Cuto Debt Dura- Future Loan Mthly Type tion Inc Due Inc 1 * * 3 * * * 5 * * * 10 * * * 15 * * * * 20 * * * * Avg Prec * * * * *

Plan Proc. Profes- Sincere Spec File Status sion Circ * * * * * * * * * * * * * * * * * *

The Kwok weighting scheme was intended to \boost" the weight of at most three terms within a single query. Under our scheme, every term had the possibility of receiving an altered weight. Further, it was possible for some terms to have their weights reduced to zero, thus eliminating them from the query altogether. Another dierence between our system and Kwok's was that Kwok had a set of prede ned phrases that were indexed as single terms. When these phrases appeared in a query, they were treated as though they were individual terms, and received an appropriate new weight. Since these phrases were prede ned and were calculated into the overall weighting of the query terms, they might have proved useful in this domain (where those noun groups collected by the INQUERY tool were not). Further, since these items were prede ned, and treated as atomic units, their tf and idf scores would have diered from those we used. This also may have impacted on results.

6.1.3 Semi-random sets of terms On the whole, the semi-random queries with one-half of the excerpt terms did better than those with only one-third, but only slightly so and there were exceptions. The features of debt type, duration, and monthly income all did better when only 101

one-third of the terms were selected. Both sets performed decidedly worse than the bag of words queries. Table 6.5 displays those instances when the bag of words queries were signi cantly better than the semi-random queries containing 1/2 of the available terms when examining average precision at the cutos. The results displayed are for one set of the semi-random queries. The other semi-random queries performed approximately the same. The table showing signi cance for the semi-random queries with only 1/3 of the terms was very similar to Table 6.5. Table 6.5. Instances where the bag of words query was signi cantly better than the semi-random query with 1/2 of the excerpt terms. Cuto Debt Dura- Future Loan Mthly Type tion Inc Due Inc 1 * * * 3 * * * * 5 * * * * * 10 * * * * * 15 * * * * * 20 * * * * * Avg Prec * * * * *

Plan Proc. Profes- Sincere Spec File Status sion Circ * * * * * * * * * * * * * * * * * * * * * * * * * * *

We did not expect that these queries would do better than those using the entire set of excerpt terms. The semi-random queries were somewhat biased in their creation in that words that were in the excerpt set more than once could only be selected once. As was seen from the comparison between the bag of words and set of words, repetition in the excerpt set was a useful piece of information. The semi-random queries were potentially hindered by not being able to select a term based on its statistical use within the excerpt set.

6.1.4 Second set of excerpts We examined a set of base queries generated from the second excerpt-ckbs for the features of debt type, duration, future income, monthly income, plan ling date, 102

and procedural status. In general these excerpts created better queries than their counterparts because of the removal of misleading names and terms. Matches based on the inclusion of proper names had caused many incorrect passages to be retrieved, because the importance given to the names was generally quite high. In particular, the name \Flygare" was in several dierent excerpt-ckbs and was a well-known and often cited court opinion. However, the name of the opinion was not relevant for locating text about the individual features in our set. Along with the removal of proper names, we also removed some extraneous text. This caused most retrievals to improve and a few to suer. More in-depth analysis of how each feature was aected by the new excerpt-ckb can be found in the discussion of each feature in the next section. (Scores for these queries are also found within each feature's discussion.) The new set was an improvement over the old queries; frequently the relevant passages moved up in the ranking by as many as ten to twenty positions. From this experience, we concluded that one must be a bit more careful when creating the excerpt-ckb, particularly regarding the inclusion of proper names.

6.2 Comparison to the manual queries Overall, the rst two types of base query, the bag of words and sum query types, did the best of all the excerpt-based queries. For comparison purposes, we review how these two query types did against the set of expert built queries. The expert queries were revised as relevance judgements were made available and they re ect a high level of knowledge engineering. The excerpt-based queries compared quite favorably against the manual set. For most of the features they did as well as or even better than the manual set. At esl1 , the SPIRE queries were slightly better. At esl3, they were close with the manual queries being slightly better. When we increased to esl5 , the dierence was even less 103

and they were all about the same. Table 6.6 lists all ten features and the twenty test documents associated with three problem cases. It compares the esl3 scores. The table shows that the SPIRE-generated queries performed just about equally to the manual queries at this level.

Table 6.6. Comparison at esl3 between manual and the SPIRE-generated bag of words and sum queries. An \SP" indicates that both SPIRE queries performed better than the manual. An \M" indicates that the manual query performed better. If the manual fell between the two, the SPIRE query performing the best is given: \b" for bag of words and \s" for sum. If all three queries performed equally well, an \=" is shown. Doc ID 001 180 188 204 206 260 289 353 407 427 442 472 693 733 751 764 860 877 915 961

Debt Duration Future Loan Mthly Plan Proc. Profes- Sincere Special Type Income Due Income Filed Status sion Circ = M = = = SP M SP M s M SP = M = M M = = s s M M SP SP M M = SP s SP SP = SP SP M SP M SP SP M M M = SP SP = SP SP = M SP SP = SP M M = = b = M = = = M M M SP SP = M = SP = = s SP = = SP = = = s M b b = SP = M M = SP SP M = = s SP = = = M M M M SP M s SP = = M b M SP = b = M M M s SP M M SP b = b = b = M M = SP = SP s SP SP = SP M s = b SP SP M SP M SP M SP SP = = b = = = M M M = M = SP = SP = M M = SP = b SP = = SP M M M = SP = M SP = = SP M SP = =

Table 6.7 gives the average score at each esl level for the manual, bag of words, and sum queries for all features. Features with an asterisk give values for the second excerpt-ckb. The number before the slash is the number documents included in the average, otherwise the average is over 20 documents. Note that for many of the

104

Table 6.7. Average esl values for the manual and SPIRE-generated queries for all features. Features with a * list values from the second excerpt-ckb. 1 psg requested Manual Bag of Sum Words Debt Type* 0.10 0.47 0.30 Duration* 19/0.6 1.45 1.70 Future Inc* 18/0.0 18/0.6 18/0.2 Loan Due 10/8.9 13/9.3 13/11.2 Monthly Inc* 19/0.23 19/0.08 19/0.08 Plan Filing* 1.77 2.55 2.55 Proc Stat* 0.63 2.20 1.30 Profession 16/1.1 18/1.8 18/1.3 Sincerity 16/0.0 0.25 0.15 Spec Circ 16/0.6 19/1.3 19/2.8 Feature

3 psgs requested Manual Bag of Sum Words 1.27 1.37 1.34 18/2.2 19/3.0 19/4.0 16/1.1 16/1.8 16/2.0 7/16.3 9/16.1 9/18.3 17/0.82 19/3.6 19/5.2 19/7.8 6.85 8.25 1.17 5.50 4.50 12/1.7 15/7.5 15/8.2 12/0.05 3.25 2.72 14/0.9 18/3.1 18/10.0

5 psgs requested Manual Bag of Sum Words 2.45 2.17 2.34 17/3.8 18/5.6 18/6.7 15/2.0 15/3.0 15/3.3 4/26.4 7/38.6 7/35.1 16/2.9 17/12.7 17/11.4 17/8.0 18/22.3 18/22.8 1.24 14.30 11.20 8/2.7 13/9.0 13/11.3 7/0.04 5.62 5.40 9/4.6 16/41.4 16/51.4

retrievals the manual query was unable to retrieve the requested number of passages while the SPIRE queries were able to do so. Table 6.8 gives averages similar to those in Table 6.7, but only includes in the average those documents where all three queries were able to retrieve the requested number of passages. This lowers over 1/2 of the averages for each of the SPIRE queries (only one value increased). Thus, on many of the documents where the manual query was unable to retrieve the requested number of relevant passages, SPIRE's retrievals were deeper than the average. In Table 6.9 we illustrate the dierences in the esl averages when we allow all passages to have a default value during retrieval. To calculate this value, all non-retrieved passages were given the same belief score and therefore the same low probability for retrieval. For example, notice that the averages for special circumstances for the SPIRE queries are now much lower than those for the manual query and that the averages for sincerity are now closer to parity. Since many more relevant passages were in the total ranking for the manual query than were previously, and these passages were all very deep in the rankings, the averages go up quite a bit.

105

Table 6.8. Average esl values, only including those documents where all three queries

were able to retrieve the requested number of passages. Features with a * list values from the second excerpt-ckb. 1 psg requested Manual Bag of Sum Words Debt Type* 0.10 0.47 0.30 Duration* 19/0.6 19/0.42 19/0.63 Future Inc* 18/0.0 18/0.6 18/0.2 Loan Due 10/8.9 10/5.8 10/7.8 Monthly Inc* 19/0.23 19/0.08 19/0.08 Plan Filing* 1.77 2.55 2.55 Proc Stat* 0.63 2.20 1.30 Profession 16/1.1 16/1.4 16/0.8 Sincerity 16/0.0 16/0.0 16/0.0 Spec Circ 16/0.6 16/0.8 16/1.1 Feature

3 psgs requested Manual Bag of Sum Words 1.27 1.37 1.34 18/1.9 18/1.9 18/2.7 16/1.1 16/1.8 16/2.0 7/16.3 7/15.8 7/17.4 17/0.82 17/1.0 17/0.71 19/7.8 19/6.15 19/7.79 1.17 5.50 4.50 12/1.7 12/6.5 12/6.5 12/0.05 12/0.05 12/0.05 14/0.9 14/2.1 14/2.4

5 psgs requested Manual Bag of Sum Words 2.45 2.17 2.34 17/3.8 17/3.75 17/4.21 15/2.0 15/3.0 15/3.3 4/26.4 4/36.8 4/37.6 16/2.9 16/3.7 16/2.3 17/8.0 17/12.2 17/12.7 1.24 14.30 11.20 8/2.7 8/5.5 8/8.1 7/0.04 7/0.35 7/0.30 9/4.6 9/13.3 9/13.9

Table 6.9. Average esl values when all passages have a default belief. Features with a * list values from the second excerpt-ckb. 1 psg requested Feature Manual Bag of Sum Words Debt Type* 0.10 0.47 0.30 Duration* 4.94 1.45 1.70 Future Inc* 18/0.0 18/0.6 18/0.2 Loan Due 13/39.4 13/9.3 13/11.2 Monthly Inc* 19/0.2 19/0.1 19/0.1 Plan Filing* 1.77 2.55 2.55 Proc Stat* 0.63 2.20 1.30 Profession 18/4.0 18/1.8 18/1.3 Sincerity 4.09 0.25 0.15 Spec Circ 19/12.9 19/1.3 19/2.8

3 psgs requested Manual Bag of Sum Words 1.27 1.37 1.34 19/12.1 19/3.0 19/4.0 16/1.1 16/1.8 16/2.0 10/83.5 10/56.7 10/58.7 19/16.1 19/3.6 19/5.2 17.58 6.85 8.25 1.17 5.50 4.50 17/29.4 17/24.9 17/25.4 17.77 3.25 2.72 18/24.8 18/3.1 18/10.0

106

5 psgs requested Manual Bag of Sum Words 2.45 2.17 2.34 18/19.9 18/5.6 18/6.7 16/5.0 16/9.4 16/9.6 8/173.2 8/128.3 8/125.2 18/21.9 18/19.3 18/18.1 18/21.2 18/22.3 18/22.8 1.24 14.30 11.20 14/29.2 14/17.4 14/19.5 46.87 5.62 5.40 16/82.8 16/41.4 16/51.4

Another way to evaluate performance would have been to consider all passages below a certain point in the ranking as requiring too much eort to retrieve, say 30 passages deep. Under this scenario, the scores would then re ect essentially a random retrieval past this threshold.

Table 6.10. Average precision (non-interpolated) for the manual and excerpt-based

queries. Features with a * list values from the second excerpt-ckb. Values with a * are signi cant. Average Precision (Non-Interpolated) Manual Bag of % t-test Sum % t-test Feature Words Chg Chg Debt Type* 0.5691 0.6191 8.77 0.0543 0.6131 7.72 0.0436* Duration* 0.4534 0.5307 17.05 0.1475 0.5021 10.75 0.3084 Future Income* 0.6792 0.6320 -6.95 0.1531 0.6246 -8.03 0.0949 Loan Due Date 0.1693 0.3127 84.75 0.1149 0.3404 101.10 0.1158 Monthly Income* 0.5837 0.5863 0.45 0.9261 0.6137 5.15 0.2414 Plan Filing* 0.3880 0.4170 7.47 0.4508 0.3900 0.50 0.9638 Proc Stat* 0.6339 0.2906 -54.15 0.0000* 0.3176 -49.90 0.0000* Profession 0.3271 0.3968 21.31 0.2633 0.3803 16.26 0.3812 Sincerity 0.3004 0.4689 56.09 0.0015* 0.4776 58.99 0.0009* Special Circ 0.3871 0.4603 18.89 0.2077 0.3748 -3.19 0.7037

We also include Table 6.10 with average precisions for comparison purposes. Again, features with an asterisk give values for the second excerpt-ckb. Where there is an asterisk next to a t-test level, this value was signi cantly dierent from the manual query. Because we did not have a total set of judgments, the use of average precision should not be used exclusively to judge the experiments, but only be used as a rough measure of the outcome. Based on our use of pooled judgements, it is likely that the majority of relevant passages were judged. (The exception to this was with procedural status where there were a plethora of citations within some of the documents and therefore, some of the relevant passages might not have been judged.) Table 6.11 gives the average precisions at a cuto of ten passages. This is a reasonable level to expect a human reader to process in search of relevant information.

107

This table re ects that overall, the SPIRE queries did better than the best manual queries.

Table 6.11. Average precision at 10 passages for the manual and excerpt-based

queries. Features with a * list values from the second excerpt-ckb. Values with a * are signi cant. Average Precision at 10 Passages Manual Bag of % t-test Sum Feature Words Chg Debt Type* 0.6950 0.7450 7.2 0.2981 0.7550 Duration* 0.5050 0.5200 3.0 0.6513 0.5200 Future Income* 0.7056 0.6167 -12.6 0.0722 0.6167 Loan Due Date 0.1769 0.2308 30.4 0.4531 0.1923 Monthly Income* 0.6000 0.6211 3.5 0.5311 0.6316 Plan Filing* 0.4150 0.4150 0.0 1.0000 0.4050 Proc Stat* 0.7850 0.3650 -53.5 0.0000* 0.3650 Profession 0.4111 0.4222 2.7 0.8704 0.3556 Sincerity 0.4300 0.5950 38.4 0.0095* 0.6100 Special Circ 0.3789 0.4211 11.1 0.4241 0.3368

% Chg 8.6 3.0 -12.6 8.7 5.3 -2.4 -53.5 -13.5 41.9 -11.1

t-test 0.1625 0.6663 0.0453* 0.8243 0.3012 0.8098 0.0000* 0.4136 0.0040* 0.3227

These various tables all provide dierent ways of examining the results. In general, they give the same overall picture, but from dierent perspectives: the two excerptbased queries did comparably well on most of the features, but there were some noticeable dierences. There were two features where the manual queries did better; procedural status and future income, and two features where the SPIRE-based queries did distinctly better; sincerity and loan due date. With the other features, the results were closer. Below we analyze performance across each feature. We discuss results from the original case-base and those instances when we removed proper names and other pronouns as well as unnecessary text to form a second excerpt-ckb. We begin by covering the two features where SPIRE did not do as well as the manual queries: procedural status and future income. We then analyze the two where SPIRE did better: sincerity and loan due date. Finally, we review the rest of the features: debt type, duration, monthly income, plan ling date, profession, and special circumstances. 108

6.2.1 Procedural status Within the test documents there were many references to the procedural status of a case, although the discussion was not necessarily about the status of the current situation. In particular, frequently when a giving a citation, the opinion would state the outcome of the case, such as \armed" or \con rmation denied", which lent information as to the procedural status of the previously decided case. Therefore, there tended to be many passages within each document that discussed the procedural status of a case and were relevant to the feature, but did not give any information about the value of this particular case's procedural status. This led to procedural status having the largest average number of relevant passages in a document, 31.4. (Refer back to Table 4.8 for statistics on individual documents.) The manual query was consistently able to retrieve and highly rank passages that mention the procedural status of a case, while the queries from the excerpts did not do as well. Table 6.12 gives the values for the manual and the two SPIRE queries at all three esl levels. The reason for the success of the manual query is easily explained: discussion about the feature will normally include at least one of a small set of easily enumerated keywords, such as \con rmation" or \appeal". Not all of these terms were present in SPIRE's excerpt-ckb, but all were included in the manual query. For example, \armation" was never given as the status of any of the cases found in our small corpus. This is an instance where knowledge of a domain-speci c vocabulary, particularly one of limited terms, is easily enumerated and should be used as the basis for forming queries on a feature. The nal manual query used for procedural status was: !c! Procedural Status #q1= #passage20( #phrase( confirmation of her plan) #phrase( confirmation of his plan) #phrase( confirmation of their plan)

109

Table 6.12. ESL values for the manual and SPIRE-generated queries for procedural status.

PROCEDURAL STATUS 1 psg requested 3 psgs requested 5 psgs requested Doc-ID Manual Bag of Sum Manual Bag of Sum Manual Bag of Sum Words Words Words 001 0.00 0.00 3.00 3.00 5.00 8.00 3.00 9.00 10.00 180 2.00 0.00 2.00 2.00 10.00 8.00 2.00 16.00 8.00 188 0.00 1.00 0.00 0.00 9.00 4.00 0.00 13.00 17.00 204 0.00 0.00 0.00 4.00 0.00 0.00 4.00 7.00 4.00 206 0.00 0.00 0.00 0.00 0.00 0.00 0.00 14.00 7.00 260 0.00 5.00 5.00 0.00 10.00 10.00 0.67 12.00 12.00 289 6.67 2.00 1.00 8.00 19.00 14.00 8.00 68.00 30.00 353 0.00 1.00 0.00 0.00 4.00 0.00 0.00 18.00 3.00 407 0.00 0.00 0.00 0.33 0.00 2.00 0.67 3.00 4.00 427 0.00 19.00 8.00 1.00 19.00 11.00 1.00 22.00 17.00 442 0.00 0.00 0.00 1.00 3.00 3.00 1.00 8.00 8.00 472 0.00 0.00 4.00 0.00 6.00 10.00 0.00 6.00 14.00 693 0.00 2.00 3.00 0.00 6.00 6.00 0.00 14.00 8.00 733 0.00 3.00 1.00 0.00 4.00 1.00 0.00 5.00 1.00 751 0.00 9.00 1.00 0.00 17.00 4.00 0.00 23.00 9.00 764 2.00 5.00 2.00 2.00 6.00 9.00 2.00 20.00 20.00 860 0.00 9.00 1.00 0.00 16.00 2.00 0.00 16.00 10.00 877 0.00 3.00 3.00 0.00 5.00 3.00 0.00 7.00 3.00 915 2.00 11.00 0.00 2.00 14.00 9.00 2.00 50.00 15.00 961 0.00 2.00 1.00 0.00 11.00 17.00 0.40 25.00 25.00 Avg 0.63 3.60 1.75 1.17 8.20 6.05 1.24 17.80 11.25

110

#phrase( objects to confirmation) objects objections appeal affirmation confirmation denial vacation );

Upon examination of the excerpt-ckb, we noted that the excerpts for procedural status had quite a bit of leading text that either contained proper names or was super uous. Four of the excerpts were shortened when we created the second casebase. Below are the original excerpts and the resulting ones with the removed text in italics:

Excerpt: \debtor, Dr. Deborah Hawkins, seeks this Court's con rmation of her Chapter 13 repayment plan" Became: \seeks this Court's con rmation of her Chapter 13 repayment plan" Excerpt: \New York State Higher Education Services Corporation (`NYSHESC'), and the State University of New York (`SUNY'), have timely led [sic] objections to con rmation of Laura Makarchuk's (`Debtor') [sic] proposed Chapter 13 plan." Became: \have timely led objections to con rmation of Laura Makarchuk's (`Debtor') proposed Chapter 13 plan" Excerpt: \Gordon and Sharon Flygare appeal the denial of con rmation of their Chapter 13 bankruptcy plan." Became: \appeal the denial of con rmation of their Chapter 13 bankruptcy plan." Excerpt: \BNY's motion relates to dismissal or conversion, and not con rmation" Became: \motion relates to dismissal or conversion, and not con rmation"

Removal of these terms was generally useful, but occasionally was detrimental. Table 6.13 shows the reduction in the number of passages that must be read in order 111

to reach the various number of requested relevant passages. The table shows that the bag of words query certainly bene ted more by removing these terms than did the sum query. In fact, the changes had basically little eect on the sum query. Table 6.13. Average reduction in esl for the two SPIRE queries procedural status when using the shortened excerpt set. Average reduction in esl ESL level Bag of Words Sum 1 1.45 0.45 3 2.70 1.55 5 3.60 0.05

For both queries the improvement of average precision between the two excerptckbs was signi cant. (For the bag of words query the improvement was 18.55% with a t-level of 0.0076, for the sum query the improvement was 11.68% for a t-level of 0.0141.) This was primarily due to improvements at mid-range levels of recall (30 to 70 percent). Precision at the top-ranked passage went up by 28.6% for both queries. Further, the bag of words query improved over the top 5 passages retrieved by 50.0%. Two documents were particularly aected in a negative fashion by the new queries, 001 and 353. The esl scores for both of these documents were severely aected as they increased by between 8 to 24 non-relevant passages at the esl5 level. The dierence was because of the removal of the term \Education" from the excerpt-ckb. Some of the relevant passages coincidentally contained \educational", which stemmed to \education", and thus these passages now had fewer matching terms. Table 6.14 provides the esl scores for the second excerpt-ckb. Other documents were negatively aected by having the shortened excerpts. In particular, with the sum query at esl5 , document 915 increased by 7, 733 increased by 6, and there were several other minor increases. On the other hand, document 961 decreased by 14, 188 decreased by 9, two documents decreased by 7, and several others had decreases of 5 or less. The net result was a change of only 1 fewer passage to be read among the 20 documents for the sum query at this level. Even though 112

Table 6.14. ESL values for the manual and SPIRE-generated queries for the second

excerpt-ckb for procedural status.

PROCEDURAL STATUS 1 psg requested 3 psgs requested 5 psgs requested Doc-ID Manual Bag of Sum Manual Bag of Sum Manual Bag of Sum Words Words Words 001 0.00 5.00 2.00 3.00 7.00 10.00 3.00 17.00 29.00 180 2.00 0.00 2.00 2.00 3.00 5.00 2.00 11.00 6.00 188 0.00 0.00 0.00 0.00 0.00 1.00 0.00 4.00 8.00 204 0.00 0.00 0.00 4.00 0.00 1.00 4.00 2.00 4.00 206 0.00 0.00 0.00 0.00 0.00 0.00 0.00 13.00 8.00 260 0.00 2.00 1.00 0.00 6.00 9.00 0.67 8.00 9.00 289 6.67 4.00 1.00 8.00 23.00 11.00 8.00 54.00 28.00 353 0.00 3.00 0.00 0.00 4.00 1.00 0.00 42.00 23.00 407 0.00 0.00 0.00 0.33 0.00 0.00 0.67 0.00 0.00 427 0.00 14.00 8.00 1.00 14.00 8.00 1.00 17.00 10.00 442 0.00 0.00 0.00 1.00 2.00 2.00 1.00 7.00 6.00 472 0.00 0.00 3.00 0.00 4.00 8.00 0.00 4.00 9.00 693 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 4.00 733 0.00 2.00 1.00 0.00 3.00 2.00 0.00 5.00 7.00 751 0.00 3.00 0.00 0.00 3.00 2.00 0.00 9.00 2.00 764 2.00 2.00 3.00 2.00 3.00 4.00 2.00 16.00 24.00 860 0.00 1.00 1.00 0.00 7.00 3.00 0.00 7.00 8.00 877 0.00 3.00 3.00 0.00 5.00 3.00 0.00 7.00 6.00 915 2.00 5.00 0.00 2.00 16.00 10.00 2.00 47.00 22.00 961 0.00 0.00 1.00 0.00 10.00 10.00 0.40 12.00 11.00 Avg 0.63 2.20 1.30 1.17 5.50 4.50 1.24 14.30 11.20

113

there were documents where the new bag of words query did worse, overall there was a net decrease of 62 non-relevant passages across the 20 documents as there were 5 documents for which the esl improved by 10 or more passages. Reviewing the non-relevant retrieved passages shows which terms from the excerpts were the most problematic. A single term caused a large number of nonrelevant passages to be retrieved: \repayment". It was given in the following excerpt: \creditors objected to con rmation of debtor's Chapter 13 repayment plan". The other primary match for false hits was the combination of \student" with \loan" as was given in this excerpt: \In this chapter 13 proceeding, one student loan creditor,. . . objected to con rmation of debtors' plan." Since many of the cases involved student loans as one of the sources of debt, this phrase appeared quite often. Even though we removed several proper names from the excerpts for the second case-base, we were unable to remove one: \Makarchuk". This name was in the midst of an excerpt and we opted not to remove any text from within an excerpt, only from the ends. This left intact the following excerpt: \objections to con rmation of Laura Makarchuk's (`Debtor') proposed Chapter 13 plan." This caused a few non-relevant passages to be ranked highly and might cause problems with a corpus of cases that occur chronologically later and refer back to this opinion. The excerpts did not perform nearly as well as well as the manual query when examining esl values. Additionally, the excerpts were signi cantly worse on average precision and at every cuto level we measured. Table 6.15 gives both the average precisions at the six cutos as well as the average non-interpolated precision. Procedural status is a feature for which it is relatively easy to enumerate a short list of keywords that one would expect to see given as the value for each court case. Because there is this list of keywords, other terms, such as those found in the excerptckb's, proved to be noisy and obfuscated retrieval. If one were to be more restrictive as to judging what text was relevant (e.g., only allow text that described the procedural 114

Table 6.15. Average precision at cutos and non-interpolated precision for procedural status. Values with a * are signi cant. Cuto Manual Bag of Pct Words Chg 1 0.8000 0.4500 -43.8 3 0.7667 0.4167 -45.7 5 0.8000 0.3900 -51.3 10 0.7850 0.3650 -53.5 15 0.7267 0.3200 -56.0 20 0.6725 0.3325 -50.6 Avg 0.6339 0.2906 -54.15

t-test

Sum

0.0153* 0.0021* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*

0.4500 0.4500 0.4300 0.3650 0.3867 0.3325 0.3176

Pct Chg -43.8 -41.3 -46.3 -53.5 -46.8 -50.6 -49.90

t-test 0.0153* 0.0015* 0.0000* 0.0000* 0.0000* 0.0000* 0.0000*

status of the case at hand to be relevant), then it is possible that the excerpts might provide bene t in giving context.

6.2.2 Future income Not all of the test documents contained text that described the likelihood of an increase in the debtor's income, or future income. Two of the twenty had no relevant text and two others had only two relevant passages. The rest of the documents had seven or more relevant passages with an average of 15.4 relevant passages per text. (Refer back to Table 4.8 for the number of relevant passages found in each document.) When reviewing the excerpts for extraneous text and proper nouns, only one small section of text was removed: \Mr. Severs testi ed that". This caused only minor changes in the rankings; none of the scores varied by more than two passages. Overall, the eect was positive, while there were two documents that had worse scores by a matter of one passage at either esl1 or esl3. Neither of these showed any degradation at later request levels. Table 6.16 shows the esl values for all of the test documents using the second excerpt-ckb. The SPIRE queries did not do as well as the manual query for future income, but, in general, the dierences were not large when we examined esl scores. However,

115


excerpt-ckb for future income.

FUTURE INCOME 1 psg requested 3 psgs requested 5 psgs requested Doc-ID Manual Bag of Sum Manual Bag of Sum Manual Bag of Sum Words Words Words 001 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 180 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 188 0.00 0.00 0.00 0.00 4.00 16.00 -1.00 -1.00 -1.00 204 0.00 0.00 0.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 206 0.00 0.00 0.00 1.00 3.50 3.50 3.50 11.40 15.40 260 0.00 0.00 0.00 0.50 0.00 0.00 1.00 0.00 1.00 289 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 353 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 407 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 427 0.00 0.00 0.00 0.00 1.60 3.60 0.00 2.80 4.80 442 0.00 0.00 0.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 472 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 693 0.00 8.00 2.00 0.00 12.00 2.00 0.33 15.00 4.00 733 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 751 0.00 0.00 0.00 5.00 2.00 1.00 10.00 4.00 14.67 764 0.00 2.00 1.00 0.00 2.00 2.00 0.00 2.00 2.00 860 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 877 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 915 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 961 0.00 1.00 0.00 10.50 4.00 4.00 13.50 9.40 6.00 Avg 18/0.0 18/0.6 18/0.2 16/1.1 16/1.8 16/2.0 15/2.0 15/3.0 15/3.3

116

because all of the retrievals are so good, the average precisions at cutos are extremely high and the excerpts suer by comparison. Table 6.17 shows these values. Table 6.17. Average precision at cutos and non-interpolated precision for future income. Values with a * are signi cant. Cuto Manual Bag of Words 1 1.0000 0.8333 3 0.8889 0.7593 5 0.8111 0.7222 10 0.7056 0.6167 15 0.6037 0.5407 20 0.5083 0.4833 Avg 0.6792 0.6320

Pct Chg -16.7 -14.6 -11.0 -12.6 -10.4 -4.9 -6.95

t-test

Sum

0.0827 0.0896 0.2151 0.0722 0.1352 0.3997 0.1531

0.8889 0.7593 0.7111 0.6167 0.5296 0.4917 0.6246

Pct Chg -11.1 -14.6 -12.3 -12.6 -12.3 -3.3 -8.03

t-test 0.1631 0.0302* 0.0702 0.0453* 0.0962 0.5469 0.0949

The primary cause for retrieval of non-relevant passages with the SPIRE queries was the combination of the terms \regular" and \income". Discussion of the meaning of the concept of \regular income" occurred with some frequency when discussing individuals that were self-employed, unemployed, receiving student stipends, etc. Since the debtor in court opinion 206 was unemployed, the debtor in court opinion 693 worked part-time for her husband while a student, and the debtor in court opinion 751 received a research fellowship stipend, all of these opinions had extensive discussion of whether the debtor had \regular income". To make matters worse, the two terms were found in the same excerpt: \prospect of a regular job with substantially increased income is not great", causing both the bag of words and the sum queries to suer. The primary cause for retrieval of non-relevant passages with the manual query was the variety of ways discussion about \income" occurred. (The manual query is below.) Besides being regular, \income" was found with \gross", \monthly", \disposable", \net", as well as being mentioned with \increases in", \is suciently stable", and many other words and phrases. This was less of an issue with the SPIRE queries as other terms had higher belief scores. 117

!c! Future Income #q1= #passage20(#phrase(increase in income) #phrase(future increases) #phrase(continued employment) );

Both types of query, manual and SPIRE-generated, performed extremely well: the top ve retrievals were all relevant for ve of the documents. Even when there were only a small number of passages to be found, both query types did well. Overall, the manual query was better than those generated by SPIRE, but the dierences were generally not considered signi cant.

6.2.3 Sincerity The SPIRE queries did rather well at highly ranking at least one passage pertinent to discussion about the sincerity of the debtor. Additionally, both SPIRE queries ranked relevant passages as the rst three and ve retrievals for many of the documents. The top three passages for the SPIRE queries were relevant for fteen of the test documents and the top ve passages were all relevant for six of the set. Table 6.18 displays the esl values for all of the test documents. The best manual query did not do as well as the SPIRE queries as it did not retrieve any relevant passages for four of the opinions. It also continued to lose ground as larger numbers of relevant passages were requested. The manual query was unable to retrieve ve relevant passages for more than half of the documents; thirteen of the twenty retrievals received a score of -1.00 at esl5 . The SPIRE queries outperformed the best manual query across all three esl scoring levels. For comparison, we give Table 6.19 with average precisions and precisions at cutos. These re ect that the SPIRE queries were better, with many of the dierences being signi cant. Frequently, many opinions would discuss both the sincerity and motivation of the debtor at the same time. However, when making judgments for the feature of 118

Table 6.18. ESL values for manual and SPIRE-generated queries for sincerity. SINCERITY 1 psg requested 3 psgs requested 5 psgs requested Doc-ID Manual Bag of Sum Manual Bag of Sum Manual Bag of Sum Words Words Words 001 0.00 0.00 0.00 0.67 1.00 1.00 -1.00 2.00 2.00 180 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 0.00 0.00 188 0.00 0.00 0.00 0.25 0.00 0.00 0.75 6.00 5.00 204 -1.00 2.00 2.00 -1.00 4.00 4.00 -1.00 4.00 4.00 206 0.00 0.00 0.00 -1.00 0.00 0.00 -1.00 1.00 1.00 260 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 289 0.00 0.00 0.00 -1.00 3.00 3.00 -1.00 25.00 27.00 353 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 3.00 3.00 407 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 427 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 442 0.00 0.00 0.00 -1.00 0.00 0.00 -1.00 2.00 2.00 472 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 1.00 693 -1.00 3.00 1.00 -1.00 50.00 41.50 -1.00 53.50 47.00 733 -1.00 0.00 0.00 -1.00 0.00 0.00 -1.00 1.00 1.00 751 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 764 -1.00 0.00 0.00 -1.00 0.00 0.00 -1.00 4.00 4.00 860 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 877 0.00 0.00 0.00 -1.00 7.00 5.00 -1.00 7.00 7.00 915 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 1.00 1.00 961 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 2.00 3.00 Avg 16/0.0 0.25 0.15 12/0.1 3.25 2.72 7/0.1 5.62 5.40

Table 6.19. Average precision at cutos and non-interpolated precision for sincerity.

Values with a * are signi cant.

Cuto Manual Bag of Pct Words Chg 1 0.8000 0.9000 12.5 3 0.7333 0.8667 18.2 5 0.6100 0.7300 19.7 10 0.4300 0.5950 38.4 15 0.3533 0.4900 38.7 20 0.3025 0.4300 42.1 Avg 0.3004 0.4689 56.09

t-test

Sum

0.1625 0.0880 0.0358* 0.0095* 0.0239* 0.0234* 0.0015*

0.9000 0.8833 0.7500 0.6100 0.4900 0.4375 0.4776

119

Pct Chg 12.5 20.5 23.0 41.9 38.7 44.6 58.99

t-test 0.1625 0.0583 0.0153* 0.0040* 0.0194* 0.0172* 0.0009*

sincerity, the readers had to be careful not to include text that talked exclusively about the motivation of the debtor, as that was a separate feature. Conversely, the term \motivation" was included in the manual query since the motives of the debtor were frequently such that one could ascertain the debtor's sincerity from their motives. We give one such example from court opinion 289: In the matters under consideration there were no malevolent motives in the selection of 16 month plans. However, there is no demonstrated desire for repayment of creditors in any of these plans either.

The most common cause for SPIRE queries to have highly ranked non-relevant passages were combinations of \repayment", \eort", \substantial", and \creditors". While this set of terms led to relevant passages, there were many combinations that did not assist in allowing a reader to ascertain the sincerity of the debtor. Below we give examples of both cases:

Not Relevant:

{ \However, the amount of the proposed repayment to unsecured creditors is only one of the many factors which the courts must consider in determining whether the plan meets the statutory good faith requirement." [court opinion 289] { \Chapter 13 plan need not provide for substantial repayment of unsecured debts in order to be led in `good faith';" [court opinion 764]

Relevant:

{ \payments should be commensurate with debtor's ability to pay and payment is insucient if debtor is permitted to discharge a guaranteed student loan while maintaining a comfortable middle class lifestyle without making substantial eorts to repay unsecured debts);" [court opinion 764]

120

{ \Failure to provide substantial repayment is certainly evidence that a debtor is attempting to manipulate the statute rather than attempting to honestly repay his debts. . . " [court opinion 877] The relevant text describing sincerity varied considerably from one document to the next. There were myriad means of expressing whether the debtor was being open and honest when proposing a repayment plan. One document stated that the debtor's testimony was \very credible", yet went on to indicate that the debtor may not have been sincere in proposing the plan by stating: The objecting creditors also argue that the repayment to unsecured creditors of only 15% of their claims is insubstantial and that the debtor has not accounted for income from all sources. [court opinion 693]

Another opinion analyzed the sincerity of the debtor by examining the resultant eects of the repayment plan: In Nkanang, this Court found that the debtor did not demonstrate good faith in that there was no showing made by the debtor that he sincerely wished to repay his unsecured creditors, including a creditor owed $3,977.01 for a student loan. The debtor's 1% composition plan was almost entirely directed at re nancing the debtor's expensive automobile. [court opinion 961]

Yet another opinion, 180, evaluated the debtor by looking at whether the debtor \demonstrates a realistic and diligent eort at both repayment and rehabilitation". Document 860 spent a great amount of text covering the feature of sincerity. In this particular opinion the issues of motivation and sincerity were extensively dealt with and there were over 100 passages that discussed the topic. The Court used a large number of dierent ways of describing these two features and the resultant values, including: \petition appears to be tainted with a questionable purpose", \a basically dishonest scheme", \wholehearted attempt to pay", \there was no collusion", \honest mistake", and \frank and open testimony". Within other opinions we found: \serious attempt to repay", \malevolent motives", \demonstrate a better eort", \niggardly payments", and \credible testimony". 121

Due to the variance in the ways that this feature was discussed, having a casebase of excerpts upon which to draw terms worked particularly well in locating similar portions of text in new documents. The SPIRE queries were able to demonstrate that reuse of context in the form of excerpts worked well for nding text that remarked on the sincerity of the debtor.

6.2.4 Loan due date There were only a very limited number of relevant passages for the feature of loan due date, if there was any discussion at all: seven of the twenty opinions did not make any references to when any of the loans were due and three documents only had one small piece of relevant discussion (two total passages). Additionally, there was one text with each of three, four, ve, six, or seven relevant passages. The largest number of relevant passages within a single document was only 16. Of those texts with discussion of the loan due date, the average number of relevant passages was only 7.4 per text, across all twenty documents the average was only 6.9 passages. This was the least discussed of all the features. (Refer to Table 4.8 for the number of relevant passages contained in each test document.) Interestingly, the esl scores for the test documents varied considerably. For comparison, in Table 6.20, we give the scores for the documents where there were between one and four relevant passages within a test document. Table 6.21 shows the results for all documents. Due to the sparsity of relevant passages, we might expect that both the manual and SPIRE-generated queries would have trouble locating them. The results provide some interesting comparisons. When a query did well, it did quite well. When it did poorly, it did quite poorly. Overall, the SPIRE queries did quite well as compared to the manually crafted query. They were able to retrieve the requested number of passages for almost every 122

Table 6.20. ESL values for the manual and SPIRE-generated queries for loan due date where there were between one and four relevant passages.

LOAN DUE DATE 1 psg requested 3 psgs requested Doc-ID Num Manual Bag of Sum Manual Bag of Sum Rel Words Words 001 2 -1.00 17.00 20.00 188 3 -1.00 0.00 0.00 -1.00 2.00 2.00 204 4 6.00 3.00 0.00 6.00 3.00 0.00 206 2 6.50 2.00 10.00 442 2 20.67 5.67 1.33

Table 6.21. ESL values for the manual and SPIRE-generated queries for loan due date.

LOAN DUE DATE 1 psg requested 3 psgs requested 5 psgs requested Doc-ID Manual Bag of Sum Manual Bag of Sum Manual Bag of Sum Words Words Words 001 -1.00 17.00 20.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 180 7.00 1.00 8.50 19.00 38.50 38.50 33.00 69.50 69.50 188 -1.00 0.00 0.00 -1.00 2.00 2.00 -1.00 -1.00 -1.00 204 6.00 3.00 0.00 6.00 3.00 0.00 -1.00 -1.00 -1.00 206 6.50 2.00 10.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 260 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 289 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 353 0.00 0.00 0.00 16.00 0.33 5.00 -1.00 32.50 21.00 407 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 427 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 442 20.67 5.67 1.33 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 472 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 693 0.00 30.00 30.00 0.00 68.00 68.00 0.00 68.00 68.00 733 0.00 0.00 0.00 0.67 0.00 4.00 1.67 1.00 5.00 751 0.00 16.67 22.67 -1.00 32.00 41.00 -1.00 47.50 47.50 764 47.25 0.00 0.00 47.75 0.00 0.00 -1.00 45.50 27.00 860 -1.00 45.50 47.50 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 877 2.00 0.00 6.00 24.67 1.00 6.00 71.00 6.50 8.00 915 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 961 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 Avg 10/8.9 13/9.3 13/11.2 7/16.3 9/16.1 9/18.3 4/26.4 7/38.6 7/35.1

123

document that had relevant text. Conversely, the best manual query did not retrieve any of the relevant passages in three of the documents, did not nd three relevant passages for one other document, and similarly found less than ve for two others, while the SPIRE queries were able to retrieve the requested number of relevant passages in all these documents. There was only one document where both types of query were unable to retrieve ve relevant passages. Not all of the retrievals were at levels such that a human user might be expected to read that deeply into the retrieved set. For instance, at esl1 , the best SPIRE query achieved scores of 17.00 and 45.50, (documents 001 and 860), at esl3 a score of 32.00, (document 751), and at esl5 scores of 21.00, 27.00 and 47.50, (documents 353, 764, and 751). Similarly, there were documents where the manual query had diculty in achieving good retrievals: at esl3 documents 877 and 764 had scores of 24.67 and 47.75. There were two documents where the manual query far exceeded the SPIRE queries, documents 180 and 693. For one of these, the six relevant passages in the document all contained the term \deferred". The manual query retrieved these six and no other passages while the SPIRE queries did not rate \deferred" as highly as other terms such as \graduating" or the combination of \repayment" and \loan". For the second document, there were three relevant passages that all contained \repayment". Again, the manual query ranked these passages higher than the SPIRE queries did because the SPIRE queries gave more belief to other excerpt terms. The manual query also matched and highly ranked two relevant passages containing \period". The SPIRE queries had these two passages lower in the retrieval because the belief values given to some of the other excerpt terms were higher. Below is the manual query: !c! Loan Due Date #q1= #passage20(#phrase(became due) #phrase(repayment began) #phrase(grace period)

124

deferment );

The SPIRE queries did quite well on document 764 as the top three passages were all relevant. Figure 6.1 gives these passages with matching terms boldfaced. Because these passages match on a combination of excerpt terms they placed at the top of the rankings. The manual query had few terms that were found in combination, and, as a result, most passages matched on a single term. Frequently these terms had other usages and hence, the poor performance by the manual query. In particular, for this opinion, there were many passages that matched on only the term \repayment". Thus, the relevant passages were lost in the midst of this group. 720 730 740 750

corporation under contract with the U.S. Department of Education, guaranteed payment of these loans. the notes became due in October 1983, ten months after Mr. Porter graduated from Law School. Mr. and Mrs. Porter defaulted on the loans and Old National Bank . . .

Figure 6.1. Passages 720, 730, and 740 [court opinion 764]. Similarly, the manual query was able to retrieve two relevant passages at ranks 3 and 4 for document 877 with the term \deferred". It was not until ranks 38 and 39 when the two relevant passages with \repayment" appeared. The SPIRE queries more highly ranked several passages that contained a combination of excerpt terms and achieved much better results. While there was a shortage of text that discussed the loan due date, the SPIRE queries were usually able to locate at least ve of them, while the manual queries had much more trouble. In both cases, though, neither method did particularly well. These queries had the lowest average precisions among the set of ten features. The average precisions for the excerpt-based queries were much higher than that of the manual query (see Table 6.22). Despite the magnitude of these dierences, due to 125

Table 6.22. Average precision at cutos and non-interpolated precision for loan due

date.

Cuto Manual Bag of Pct Words Chg 1 0.3077 0.3846 25.0 3 0.2821 0.3846 36.4 5 0.2308 0.3385 46.7 10 0.1769 0.2308 30.4 15 0.1333 0.1641 23.1 20 0.1038 0.1308 25.9 Avg 0.1693 0.3127 84.75

t-test

Sum

0.6727 0.4874 0.3915 0.4531 0.5735 0.5124 0.1149

0.3846 0.3077 0.2462 0.1923 0.1538 0.1346 0.3404

Pct Chg 25.0 9.1 6.7 8.7 15.4 29.6 101.10

t-test 0.6727 0.8831 0.9135 0.8243 0.7058 0.4759 0.1158

the high variance in scores and the low number of documents with relevant passages, the dierences were not considered signi cant. Further, even though both SPIRE queries had better precisions at all measured cuto points, sometimes by more than forty percent, these dierences were also not measured as signi cant. In summary, the SPIRE queries worked better than the manual query in nding relevant passages for loan due date. The combination of terms found in the excerpts helped these queries not only retrieve relevant passages, but helped to rank them better. When an opinion only used the terms from the manual query with a single or small number of meanings then the manual query would do better than the excerpts. Otherwise, the rankings were not nearly as good.

6.2.5 Debt type Overall, the SPIRE-generated queries were comparable to the best of the manually crafted queries. The manual query did better at esl1 and esl3 , while the SPIRE queries were better at esl5 . All the queries did quite well at not retrieving nonrelevant passages. For eight of the twenty documents, the rst 5 passages were all relevant for both of the SPIRE queries. The same was true for six documents with the best manual query. With either type of query, there were only two documents

126

where the user would have to read more than ten non-relevant passages to nd ve relevant ones. When we reexamined the excerpt-ckb debt type was one of the features where we removed some of the text to eliminate as many proper names as possible. When we shortened the excerpts, the case-base for this feature was signi cantly reduced in length. One excerpt went from three sentences down to one and two other excerpts were altered considerably. Below are the changes made to the excerpt-ckb, with the removed text in italics:

Excerpt: \Debtor was arrested in 1984 and while in custody, attacked and injured a guard, Marc Nelson (\Nelson"). Debtor was prosecuted criminally for aggravated assault. Nelson sued debtor for damages in state court." Became: \Nelson sued debtor for damages in state court." Excerpt: \at the time the Debtor obtained this loan from Aetna, he [had] already discussed his possible intention to le a Petition for Relief under the Bankruptcy Code" Became: \at the time the Debtor obtained this loan from Aetna" Excerpt: \Debtors did le their Petition on March 30, 1984 or two weeks after Mr. Myers obtained the loan from Aetna." Became: \obtained the loan from Aetna"

Comparing results between the two excerpt-ckbs, there was a slight overall improvement for the bag of words query, while the sum query declined. Table 6.23 shows the average change for both SPIRE queries at the three esl levels and Table 6.24 lists the esl scores for the second excerpt-ckb. Even though the average improvement for the bag of words query was larger than the decline for the sum query, (in absolute terms), the relative change in the average precisions was considered signi cant for the 127

sum query and not the bag of words. (There was a 1.75% average precision improvement for the bag of words query resulting in a t-level of 0.40, while the sum query declined 2.87% for a t-level of 0.04.)

Table 6.23. Average reduction in esl for the bag of words and sum for debt type when using the second excerpt-ckb. Average reduction in esl ESL level Bag of Words Sum 1 0.35 0.08 3 0.23 (0.08) 5 0.60 (0.19)

Examining the second excerpt-ckb results, the sum query did slightly better than the bag of words one, but not signi cantly so at any esl value. When we compared average precisions the bag of words query was mildly better. However, the sum query did better at cutos of 3, 5, and 10 passages, although the dierences were all two percent or less. The esl scores for the SPIRE queries were slightly worse at esl1 and esl3 , and slightly better at esl5. The majority of the dierences can be attributed to one document, court opinion 206. The reason for the many false hits by the SPIRE queries could be traced to the combination of the terms \nondischargeable", \discharge[d,able]", and \debt". These were frequently found in close proximity to one another, (many times all three terms would appear in the same passage), but often did not refer to either a dischargeable or a nondischargeable debt. Despite the poor retrieval done on document 206, both SPIRE queries had higher average precisions than the manual query (0.569 for the manual versus 0.619 and 0.613 for the bag of words and sum queries respectively), and the sum query was signi cantly better with a t-level of 0.04. (See Table 6.25.) Looking at precision at various cutos, the SPIRE queries were much worse at 1 passage (both were 15.8% worse), but recovered quickly and were better at all of 3, 5, 10, 15, and 20 passages. 128


excerpt-ckb for debt type.

1 psg requested Doc-ID Manual Bag of Sum Words 001 0.00 0.00 0.00 180 0.00 2.00 2.00 188 0.00 0.00 0.00 204 2.00 0.00 0.00 206 0.00 5.50 2.00 260 0.00 1.00 1.00 289 0.00 0.00 0.00 353 0.00 0.00 0.00 407 0.00 0.00 0.00 427 0.00 0.00 0.00 442 0.00 1.00 1.00 472 0.00 0.00 0.00 693 0.00 0.00 0.00 733 0.00 0.00 0.00 751 0.00 0.00 0.00 764 0.00 0.00 0.00 860 0.00 0.00 0.00 877 0.00 0.00 0.00 915 0.00 0.00 0.00 961 0.00 0.00 0.00 Avg 0.10 0.47 0.30

DEBT TYPE 3 psgs requested 5 psgs requested Manual Bag of Sum Manual Bag of Sum Words Words 0.00 0.00 0.00 0.00 0.00 0.00 2.00 2.33 2.33 4.60 6.00 5.00 0.00 1.00 1.00 2.00 2.00 1.00 2.29 0.50 0.00 2.86 1.00 1.00 3.29 13.00 14.50 5.57 15.50 19.00 0.57 5.00 4.00 1.14 5.00 4.00 0.00 0.00 0.00 4.89 0.00 6.00 0.00 0.00 0.00 0.00 0.00 0.00 4.60 0.00 0.00 6.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 7.00 4.00 3.00 7.00 6.00 4.50 0.00 0.00 0.00 6.50 0.50 1.00 1.00 1.00 1.00 2.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.67 0.50 1.00 4.00 5.00 3.00 1.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 1.00 0.00 0.00 1.00 1.40 0.80 0.00 0.00 0.00 0.29 0.00 0.00 1.27 1.37 1.34 2.45 2.17 2.34

Table 6.25. Average precision at cutos and non-interpolated precision for debt type. Values with a * are signi cant.

Cuto Manual Bag of Words 1 0.9500 0.8000 3 0.7833 0.8167 5 0.7700 0.7900 10 0.6950 0.7450 15 0.7067 0.7400 20 0.6900 0.7175 Avg 0.5691 0.6191

Pct Chg -15.8 4.3 2.6 7.2 4.7 4.0 8.77

t-test

Sum

0.1864 0.6810 0.7250 0.2981 0.3622 0.2698 0.0543

0.8000 0.8333 0.8000 0.7550 0.7300 0.7100 0.6131

129

Pct Chg -15.8 6.4 3.9 8.6 3.3 2.9 7.72

t-test 0.1864 0.4810 0.5906 0.1625 0.5278 0.4283 0.0436*

Original Manual Query: #q1= #passage20(loan debt fraud #or (#and( #or( student educational consumer) debt) #and( #or( student educational consumer) loan)) #phrase(student loan) #phrase(civil judgement) );

List Query: #q2= #passage20(loan debt fraud #or (#and( #or( student educational consumer) debt) #and( #or( student educational consumer) loan)) #phrase(student loan) #phrase(civil judgement) #phrase(educational debt ) #phrase(irs debt) tax #phrase(bank loan) #phrase(auto loan ) #phrase(farm loan) #phrase(consumer debt) #phrase(real property debt) #phrase(student loan and medical) #phrase(civil damages) #phrase(civil judgement) #phrase(criminal judgment) #phrase(judgement debt) fraud #phrase(fraudulently obtained bank loan) #phrase(fraud judgement) secured #phrase(other unsecured) );

Figure 6.2. The original and list query for debt type.

130

Debt type was a feature for which we could enumerate a set of feature values. Therefore, we generated a manual query that was basically a long list of the feature's possible values. We gathered the various values for debt type that were found in our collection of 55 court opinions. We added this set of 18 types of debt to the best manual query. Figure 6.2 gives the manual query as well as this list query of debt types. The list query was slightly worse than the original manual query. At esl1 and esl5 the list did worse, at esl3 it did slightly better. While the same number of documents were both positively and negatively aected by using the list query at esl5 , there was an overall degradation in scores. Table 6.26 displays esl scores for the manual and list queries for all the documents.

Table 6.26. Manual and list query esl scores for debt type. DEBT TYPE 1 psg requested 3 psgs requested 5 psgs requested Doc-ID Manual List Manual List Manual List 001 0.00 0.00 0.00 0.00 0.00 0.00 180 0.00 0.00 2.00 2.00 4.60 4.00 188 0.00 0.00 0.00 0.00 2.00 2.00 204 2.00 2.00 2.29 2.00 2.86 2.40 206 0.00 0.00 3.29 4.00 5.57 11.00 260 0.00 2.25 0.57 2.75 1.14 3.25 289 0.00 0.00 0.00 0.00 4.89 4.89 353 0.00 0.00 0.00 0.00 0.00 0.00 407 0.00 0.00 4.60 1.25 6.20 1.75 427 0.00 0.00 0.00 0.00 0.00 0.00 442 0.00 0.00 7.00 7.00 7.00 7.00 472 0.00 0.00 0.00 0.00 6.50 11.00 693 0.00 1.00 1.00 1.00 2.00 1.50 733 0.00 0.00 0.00 0.00 0.00 0.00 751 0.00 0.00 2.67 2.00 4.00 5.33 764 0.00 1.00 1.00 1.00 1.00 1.00 860 0.00 0.00 0.00 0.00 0.00 0.00 877 0.00 0.00 0.00 0.00 0.00 1.00 915 0.00 0.00 1.00 1.00 1.00 1.00 961 0.00 0.00 0.00 0.00 0.29 0.00 Avg 0.1 0.3 1.3 1.2 2.5 2.9

131

The decreases in performance for documents 206 and 472 were not related. The additional non-relevant passages high in the ranking for 206 were due to the cooccurrence of \judgment" and \bank". For 472 it was because of either \fraud" or \criminal". Interestingly, the top ten passages for 472 did not change, indicating that the original terms were more valuable than the additional ones the list provided. In summary, creating the long list of debt types was not worth doing as it did not improve the retrievals. The eort required to build this list was not worth the eort. Removing extraneous text from the excerpts helped the bag of words query, while it slightly reduced the scores for the sum query. Overall though, the SPIRE queries did extremely well. There was only one document where, on average, more than ten non-relevant passages preceded ve relevant ones, and for eight of the documents the top ve passages were all relevant. For another ve documents, an average of two or less non-relevant passages would have to be read to nd ve relevant ones. Use of the excerpt-ckb proved to be quite good for retrieving relevant text about debt type.

6.2.6 Duration The bag of words and sum queries did comparably to the expert manual query for duration. Table 6.27 shows results for duration at all three esl levels for the manual and two best SPIRE-generated queries for the test documents. There were some interesting dierences between the types of queries. The SPIRE queries were much better at esl5 on documents 180 and 860. Also, there was one document, 472, where the manual query did not retrieve enough relevant passages at any esl level, while the SPIRE queries both did. Conversely, the manual query was much better for documents 188 and 206. (We defer discussion about matching terms for the moment.) Duration was one of the features where we removed some of the proper nouns and extraneous text from the excerpt-ckb. and then had SPIRE regenerate its queries. 132

Table 6.27. ESL values for manual and SPIRE-generated queries for duration. 1 psg requested Doc-ID Manual Bag of Sum Words 001 0.00 0.00 0.00 180 0.00 0.00 0.00 188 0.00 1.00 1.00 204 0.00 0.00 0.00 206 0.00 2.00 2.00 260 4.00 0.00 0.00 289 0.00 1.00 1.00 353 0.00 0.00 0.00 407 1.00 0.00 0.00 427 0.00 0.00 0.00 442 0.00 0.00 0.00 472 -1.00 27.00 29.00 693 0.00 0.00 0.00 733 5.00 0.00 0.00 751 1.33 3.00 3.00 764 0.00 0.00 0.00 860 0.00 2.00 4.00 877 0.00 0.00 0.00 915 0.00 0.00 0.00 961 0.00 0.00 0.00 Avg 19/0.6 1.80 2.00

DURATION 3 psgs requested 5 psgs requested Manual Bag of Sum Manual Bag of Sum Words Words 0.67 2.00 1.00 4.00 3.50 2.00 1.00 0.00 0.00 14.00 0.00 4.00 1.25 2.00 2.00 1.75 15.00 15.00 5.00 0.00 0.00 5.00 1.00 2.00 0.00 2.00 2.00 0.00 21.00 17.00 4.40 0.00 2.00 4.80 2.00 4.00 0.33 1.00 1.00 1.00 1.00 1.00 0.00 4.00 6.00 2.33 4.00 6.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0.67 2.00 7.00 8.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 29.00 33.00 -1.00 42.00 61.00 0.00 2.00 2.00 0.00 2.00 3.00 13.67 10.00 15.00 -1.00 -1.00 -1.00 3.50 4.00 3.00 5.00 5.00 5.00 0.22 0.00 0.00 0.44 0.50 0.50 4.00 2.00 5.00 13.60 5.00 5.00 1.00 0.00 0.00 1.50 4.00 2.00 4.00 0.00 0.00 4.00 3.00 2.00 0.00 4.00 2.00 0.00 6.00 7.00 18/2.2 19/3.4 19/4.3 17/3.8 18/7.1 18/8.2

133

Three of the excerpts had extraneous text at their ends and one contained the name \Flygare". Below are the changes to the excerpts that had unnecessary text removed from their ends (the removed text is in italics):

Excerpt: \he would pay his two unsecured creditors, BNY and the other $2000 bank loan, $100 per month for 36 months, for a total of $3,600" Became: \he would pay his two unsecured creditors, BNY and the other $2000 bank loan, $100 per month for 36 months," Excerpt: \On November 19, 1982, the debtor led a modi ed plan providing for payments of $200 per month for 36 months, for a total of $7,200." Became: \On November 19, 1982, the debtor led a modi ed plan providing for payments of $200 per month for 36 months" Excerpt: \His plan proposed to pay $50 per month of 36 months, less than 1.5% of the value of the debt" Became: \His plan proposed to pay $50 per month of 36 months,"

Removing these unnecessary pieces of text helped with retrieval on almost half of the documents. While the average reduction in scores was small, the eects on several individual documents was much greater. Table 6.28 shows the average reduction in passages needing to be processed between the two excerpt case-bases. Using the second excerpt-ckb made a large dierence for document 188 at esl5 : the number of non-relevant passages for both the bag of words and sum queries dropped from 15.00 down to 2.00, primarily due to the removal of \Flygare" as a query term. Reducing the excerpt-ckb positively aected other sets of scores, in particular, those for document 472. There were only three instances where the esl was adversely aected, and these changes were minor. Document 427 was hurt at esl3 but recovered by esl5. The changes in the rankings for document 427 can be attributed to several dierences 134

Table 6.28. Average reduction in esl for the bag of words and sum for duration when using the shortened excerpt set. Average reduction in esl ESL level Bag of Words Sum 1 0.35 0.35 3 0.35 0.20 5 1.35 1.35

between the two queries. One passage was lower in the ranking because there no longer was a coincidental match with the term \1.5", (\1.5" was a percentage of a loan and it matched with a portion of a WestLaw Keynote R .) Some non-relevant passages moved up in the ranking because of their inclusion of the term \36" within a citation (e.g., \Johnson 36 B.R. 67" and \Rushton 58 B.R. 36"). Table 6.29 gives a complete listing of all scores using the second excerpt-ckb. We examined the retrieved passages to see why certain documents did worse than others. In most of the SPIRE-based cases, the terms that would cause non-relevant passages to be retrieved and be highly rated were: \unsecured" and \creditors". Frequently these terms co-occurred, thus causing many non-relevant passages to be ranked highly. For both the manual and the excerpt-based queries, there were highly ranked non-relevant passages that contained the phrase \proposed to pay" (or a similar variant), which was frequently followed by either a dollar amount or the phrase \to the [secured/unsecured] creditors". Other non-relevant passages discussed a duration of time, but they were not for the repayment plan, such as the length of a prison term or the length of a fellowship (e.g., \his three-year postdoctoral research fellowship"). The large number of false hits with documents 180 and 860 by the manual query were due to a variety of causes. The phrase \monthly payments" was found with multiple instances of \continued", \timely", and \proposed". There were occasional appearances of the phrases \monthly rental payments" and \full payment of a debt".

135


excerpt-ckb for duration.

1 psg requested Doc-ID Manual Bag of Sum Words 001 0.00 0.00 0.00 180 0.00 0.00 0.00 188 0.00 0.00 1.00 204 0.00 0.00 0.00 206 0.00 2.00 2.00 260 4.00 0.00 1.00 289 0.00 1.00 1.00 353 0.00 0.00 0.00 407 1.00 0.00 0.00 427 0.00 0.00 0.00 442 0.00 0.00 0.00 472 -1.00 21.00 22.00 693 0.00 0.00 0.00 733 5.00 0.00 0.00 751 1.33 3.00 3.00 764 0.00 0.00 0.00 860 0.00 2.00 4.00 877 0.00 0.00 0.00 915 0.00 0.00 0.00 961 0.00 0.00 0.00 Avg 19/0.6 1.45 1.70

DURATION 3 psgs requested 5 psgs requested Manual Bag of Sum Manual Bag of Sum Words Words 0.67 1.00 0.00 4.00 3.00 2.00 1.00 0.00 0.00 14.00 0.00 5.00 1.25 2.00 2.00 1.75 2.00 2.00 5.00 0.00 0.00 5.00 1.00 2.00 0.00 2.00 2.00 0.00 18.00 15.00 4.40 0.00 1.00 4.80 1.00 4.00 0.33 1.00 1.00 1.00 1.00 1.00 0.00 3.00 6.00 2.33 4.00 6.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0.67 7.00 12.00 8.00 12.00 12.00 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 22.00 27.00 -1.00 37.00 49.00 0.00 2.00 2.00 0.00 2.00 2.00 13.67 7.00 14.00 -1.00 -1.00 -1.00 3.50 4.00 3.00 5.00 5.00 5.00 0.22 0.00 0.00 0.44 0.50 0.50 4.00 2.00 4.00 13.60 4.00 4.00 1.00 0.00 0.00 1.50 2.00 2.00 4.00 0.00 0.00 4.00 3.00 2.00 0.00 4.00 2.00 0.00 6.00 7.00 18/2.2 19/3.0 19/4.0 17/3.8 18/5.6 18/6.7

136

All of these were misleading. These phrases along with the \proposed to pay" phrases cited above led to the high the esl scores. Below is the hand-crafted manual query: !c! Duration #q1= #passage20(#SUM( duration #phrase(monthly payments) #phrase(per month) #3(propose to pay) ));

Overall, the SPIRE queries did just about the same as the manual at retrieving relevant passages. The bag of words query was a bit better and the sum query was comparable or slightly worse. There was one document where the manual query did not retrieve enough relevant passages to satisfy the request while both SPIRE queries did, although the relevant passages were at depths greater than twenty. (Table 6.30 provides average precision at cutos and non-interpolated precision for reference.)

Table 6.30. Average precision at six cutos and non-interpolated precision for duration.

Cuto Manual Bag of Words 1 0.7500 0.7500 3 0.6667 0.6667 5 0.6100 0.6200 10 0.5050 0.5200 15 0.4267 0.4933 20 0.3950 0.4375

Pct Chg 0.0 0.0 1.6 3.0 15.6 10.8

t-test

Sum

1.0000 1.0000 0.8859 0.6513 0.0563 0.1283

0.6500 0.6500 0.5700 0.5200 0.4600 0.4125

Pct Chg -13.3 -2.5 -6.6 3.0 7.8 4.4

t-test 0.4283 0.8582 0.5190 0.6663 0.3299 0.4723

Making a second excerpt-ckb with a reduced amount of text helped about half of the documents, many of them only slightly, but two were improved by more than ten passages. The improvements between the two ckbs were considered signi cant: the bag of words query at a t-level of 0.008 and the sum query at a t-level of 0.025 for average precision. This emphasizes the point that taking a bit of care to ensure that excerpts are appropriate to the feature aids in the quality of the retrieval.

137

6.2.7 Monthly income The feature of monthly income was mentioned in all but one of the test documents. Text that referred to or discussed \regular income" was considered to be relevant, even though, frequently, no dollar amount was given. Additionally, text with the phrase \ability to earn" were judged relevant. If the text was \future income" or \future increases in income", the containing passage(s) were not judged to be relevant. Further, text discussing \disposable income" was judged to be not relevant. SPIRE did quite well on this feature. It exceeded the performance of the best manual query in many instances. There were two documents where the manual query did not retrieve three relevant passages while the SPIRE queries did. Additionally, on one of these documents, the SPIRE queries retrieved at least ve relevant passages. Table 6.31 gives the esl scores for the manual and two SPIRE-generated queries for the test collection. Inspection of the excerpt-ckb for the potential removal of proper names or extraneous text resulted in the removal of the term \Flygare" from the beginning of one excerpt. This improved the scores of three of the documents and did not hurt any. Table 6.32 gives the esl values for those documents that showed improvement when \Flygare" was removed from the case-base. Examination of the average precision at cutos showed an interesting trend: the SPIRE queries did better for the rst retrieved passage, worse for the top 3 and 5, and then better again at lower depths. Table 6.33 gives these values. Poor retrievals for the SPIRE queries can be attributed to the many times that \disposable income" was mentioned. While \disposable income" has a technical meaning as being that amount of income left over after expenses have been met, unfortunately, one of the excerpts for monthly income included: \net disposable monthly income". This allowed the many non-relevant passages that discuss disposable income to be retrieved and generally place high in the rankings. As a side note, if 138

Table 6.31. ESL values for the manual and SPIRE-generated queries for monthly income.

MONTHLY INCOME 1 psg requested 3 psgs requested 5 psgs requested Doc-ID Manual Bag of Sum Manual Bag of Sum Manual Bag of Sum Words Words Words 001 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 180 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 188 3.00 2.50 2.50 -1.00 33.50 45.50 -1.00 -1.00 -1.00 204 0.00 0.00 0.00 6.50 0.00 0.00 7.00 7.00 11.00 206 0.00 0.00 0.00 3.00 0.00 0.00 3.00 0.00 1.00 260 0.00 0.00 0.00 -1.00 38.50 60.50 -1.00 -1.00 -1.00 289 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 353 0.00 0.00 0.00 0.00 0.00 0.00 0.00 17.00 1.00 407 0.00 0.00 0.00 0.00 2.00 0.00 -1.00 171.33 171.33 427 0.00 0.00 0.00 0.50 0.00 0.00 1.00 0.00 0.00 442 0.00 0.00 0.00 1.50 5.00 4.00 14.00 13.50 4.00 472 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 693 0.00 0.00 0.00 0.00 2.00 0.00 0.00 2.00 0.00 733 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 751 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 764 0.00 0.00 0.00 0.50 7.00 7.00 8.50 9.00 7.00 860 0.00 0.00 0.00 0.00 0.00 0.00 8.67 8.33 10.33 877 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 915 1.33 0.00 0.00 2.00 0.00 0.00 2.67 2.00 2.00 961 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Avg 19/0.2 19/0.1 19/0.1 17/0.8 19/4.7 19/6.2 16/2.9 17/13.7 17/12.3

Table 6.32. Changes between the two excerpt case-bases for monthly income. MONTHLY INCOME Doc-ID Num Psgs Bag of New Change Requested Words Query 188 1 2.50 1.50 1.00 188 3 33.50 20.50 13.00 260 3 38.50 30.50 8.00 407 5 171.33 156.67 14.66 442 5 13.50 11.50 2.00

139

Sum

New Change Query 2.50 1.50 1.00 45.50 32.50 13.00 60.50 52.50 8.00 171.33 156.67 14.66 4.00 4.00 |

Table 6.33. Average precision at cutos and non-interpolated precision for monthly

income.


Pct Chg 5.9 -6.1 -5.6 3.5 0.7 4.3 0.45

t-test

Sum

0.3306 0.4535 0.4477 0.5311 0.9254 0.4876 0.9261

0.9474 0.8421 0.7474 0.6316 0.5298 0.4526 0.6137

Pct Chg 5.9 -2.0 0.0 5.3 3.4 6.2 5.15

t-test 0.3306 0.7486 1.0000 0.3012 0.6248 0.2987 0.2414

we removed \net disposable" from the excerpt-ckb, further improvements took place. Four documents bene ted, either at esl3 or esl5 . There was only one degradation in performance, a change from 0.00 to 1.00 at esl3 for one of the bag of words retrievals. Table 6.34 shows the average esl scores for the rst and second excerpt-ckbs as well as this test case. The second line for each ckb gives the average when including only those documents where all three queries were able to retrieve the requested number of passages. Table 6.34. Average esl values for the manual and three excerpt-ckbs. The second line lists the average for only the documents where all three queries retrieved the requested number of passages. 1 psg requested CKB Manual Bag of Sum Words 1st 19/0.23 19/0.08 19/0.08 2nd 19/0.23 19/0.08 19/0.08 3rd 19/0.23 19/0.08 19/0.08

All Features 3 psgs requested Manual Bag of Sum Words 17/0.82 19/4.7 19/6.2 17/1.0 17/0.71 17/0.82 19/3.6 19/5.2 17/1.0 17/0.71 17/0.82 19/3.4 19/4.8 17/0.76 17/0.41

5 psgs requested Manual Bag of Sum Words 16/2.9 17/13.7 17/12.3 16/3.8 16/2.3 16/2.9 17/12.7 17/11.4 16/3.7 16/2.3 16/2.9 17/11.1 17/10.7 16/2.0 16/1.5

Besides \disposable income", another cause for false hits was discussion about monthly payments, which had co-occurrences of \per" and \month" with \payments". 140

This second combination also aected the manual queries as \payments" stemmed to \pay" and all three terms were in the manual query: !c! Monthly Income #q1= #passage20( #phrase(take home pay) #phrase(per month) #phrase(total income) #phrase(per week) #phrase(monthly salary) #phrase(monthly income) earning ) );

Below we give the text from some of the relevant passages that discuss monthly income. These passages all give some indication of the level of income, (or perhaps indicate the absence of income) without ever explicitly stating an amount:

\Stephens has two children and has subsisted on welfare and social security . . . " [court opinion 188] \The debtors also stated that they were earning approximately $175 a month in excess of their expenses." [court opinion 260] \the debtor would not have her degree and, presumably, would not enjoy her current income. . . " [court opinion 764] \The Debtor's husband was unemployed for the rst couple of years of their marriage. . . " [court opinion 915]

Finding a good set of terms that would match with these passages and not retrieve many non-relevant passages would be quite dicult. Use of the excerpt-ckb allowed for retrieval of three of these passages. Most of the retrievals were at ranks such that a user might be expected to search that deeply. (The including passages in document 188 were not retrieved, from 260 they were ranked 1 and 2, from 764 they were ranked 14/15 by the bag of words query and 11/12 by the sum, and from 915 they were ranked 22/23 by the bag of words query and 28/29 by the sum.) In summary, both the SPIRE and manual queries did very well. For all of the documents that mention the monthly income of the debtor, there were only two 141

documents where the top passage for either type of query was not relevant. When three passages were requested, there were nine documents where both types of query had the top three passages being relevant. Similarly, when requesting ve relevant passages, both types had no false hits for ve of the documents. Overall though, the SPIRE queries were a slight bit better at all esl levels. Further, they were able to provide the requested number of passages in three documents where the manual query did not.

6.2.8 Plan ling date Initially, the manual query outperformed the SPIRE queries, although there were cases where the SPIRE queries did do better. Once we identi ed that there were a large number of misleading terms in the excerpt-ckb and we removed them, the excerpt-based queries improved dramatically. Overall, the manual and second excerpt-ckb queries were comparable for plan ling date. The manual query did a bit better, on average, at esl1 , the SPIRE queries were better at esl3, and the picture at esl5 is less clear: the SPIRE queries were able to nd the requested number of passages for one more document than the manual, however if we discount that document, then the manual query had a better average. If we examine the average (non-interpolated) precisions, then the SPIRE queries did better. The SPIRE queries also had better precision values at high recall: the SPIRE query scores were quite a bit better at cutos of 1 and 3 passages. Table 6.36 displays average precision at cutos and non-interpolated precision. Each query type had documents where it did better than the other types. Table 6.35 gives the esl scores for the manual and two SPIRE queries using the second excerpt-ckb. The diculty experienced by all the queries in nding the plan ling date was partially due to the way in which the bankruptcy opinions express the date. Below are some relevant passages from the opinions: 142


excerpt-ckb for plan ling date.

PLAN FILING DATE 1 psg requested 3 psgs requested 5 psgs requested Doc-ID Manual Bag of Sum Manual Bag of Sum Manual Bag of Sum Words Words Words 001 0.00 0.00 0.00 -1.00 13.00 17.00 -1.00 -1.00 -1.00 180 0.00 0.00 0.00 1.00 6.00 6.00 1.00 6.00 6.00 188 2.33 3.00 3.00 4.00 4.00 4.00 6.00 8.00 9.00 204 0.00 0.00 0.00 0.50 1.00 8.00 4.80 47.00 49.50 206 3.15 0.00 0.00 3.45 0.00 0.00 3.75 0.00 0.00 260 1.60 0.00 0.00 2.80 3.00 3.00 4.00 5.00 4.00 289 0.00 0.00 2.00 2.50 22.33 22.33 -1.00 -1.00 -1.00 353 0.00 0.00 0.00 0.00 0.00 0.00 0.67 1.00 0.00 407 8.00 1.00 1.00 8.00 7.00 7.00 8.00 7.00 7.00 427 5.00 0.00 0.00 14.71 3.00 5.00 20.14 8.00 9.00 442 0.00 0.00 0.00 1.33 1.00 0.00 6.00 2.75 1.75 472 0.00 3.00 2.00 4.67 3.00 3.00 8.33 3.00 5.00 693 0.00 0.00 0.00 1.33 0.67 0.67 12.67 11.00 18.00 733 0.00 0.00 0.00 0.00 1.00 2.00 2.00 12.00 15.00 751 3.33 0.00 0.00 13.17 0.00 9.00 17.50 14.00 15.00 764 8.00 8.00 8.00 74.50 13.00 19.00 -1.00 194.00 194.00 860 0.00 0.00 2.00 0.00 3.00 5.00 0.50 5.00 6.00 877 2.00 36.00 33.00 13.50 52.00 49.00 24.50 72.00 64.00 915 2.00 0.00 0.00 3.00 4.00 5.00 15.50 5.00 7.00 961 0.00 0.00 0.00 0.50 0.00 0.00 1.00 0.00 0.00 Avg 1.77 2.55 2.55 19/7.8 6.85 8.25 17/8.0 18/22.3 18/22.8

Table 6.36. Average precision at cutos and non-interpolated precision for plan ling date.


Pct Chg 36.4 10.0 0.0 0.0 -3.7 1.6 7.47

143

t-test

Sum

0.1036 0.6142 1.0000 1.0000 0.6739 0.8866 0.4508

0.6500 0.5167 0.4700 0.4050 0.3467 0.3200 0.3900

Pct Chg 18.2 3.3 -4.1 -2.4 -4.6 0.0 0.50

t-test 0.4936 0.8628 0.7547 0.8098 0.5142 1.0000 0.9638

\At the time of ling the Chapter 13 proceeding," [case opinion 289] \LeMaire signed a promissory note evidencing a debt to his parents of $12,722 only one day prior to ling his bankruptcy petition. Prior to this ling, LeMaire had . . . " [court opinion 860]

In neither case was a calendar date given, and additionally, the rst text fragment was the only relevant text within one of the documents. We note that pattern matching techniques, or use of concept recognizers [7, 39, 49], would also be unable to locate these passages. Plan ling date was the feature where we rst noticed that many of the passages retrieved by the SPIRE queries contained the name of another court case. In particular, \Flygare" and \Okoreeh-Baah" appeared frequently among the retrieved passages. Consequently, we removed the text containing these terms, as well as \The debtors, Majid and Hasiba Ali" from the excerpt-ckb. The excerpt \DebtorsAppellants, Mr. and Mrs. Okoreeh-Baah, led for bankruptcy on November 19, 1985." became simply \ led for bankruptcy on November 19, 1985." Since these names were at the beginning of the excerpt, the integrity of the rest of the excerpt was retained. Besides the proper names at the beginning of three of the excerpts, several ended with \under Chapter 7" or similar text. Since the particular provision or statute under which the plan was being led is not relevant to the date when the plan was led, we also removed these portions of the excerpts. Altogether, we shortened 7 of the 10 excerpts. (The complete excerpt-ckb can be found in Appendix C.) Below are some of the reductions to the excerpt-ckb with removed text in italics:

Excerpt: \if the debtor were to seek dismissal of his chapter 13 case and le a petition under chapter 7" Became: \if the debtor were to seek dismissal of his chapter 13 case and le a petition" 144

Excerpt: \On January 21, 1986, Debtor led her voluntary petition for relief under Chapter 7 of the Code." Became: \On January 21, 1986, Debtor led her voluntary petition for relief" Excerpt: \On September 24, 1982, the debtor led with this court his petition for relief under Chapter 13" Became: \On September 24, 1982, the debtor led with this court his petition for relief"

The new queries showed a dramatic improvement in retrieving relevant passages. Four of the documents had esl values that dropped by twenty or more passages. At esl5 , the average number of passages to be processed dropped by more than 10. Table 6.37 shows the average reduction across the twenty documents when using the second excerpt-ckb as compared to the original one. For both queries the change was signi cant. (Average precision for the bag of words query improved by 32% and had a t-level of .0015, and the sum query improved by 29% for a t-level of .001.)

Table 6.37. Average reduction in esl for the bag of words and sum for plan ling

date when using the second excerpt-ckb.

Average reduction in esl ESL level Bag of Words Sum 1 2.30 3.95 3 9.85 9.70 5 10.53 10.94

Table 6.38 shows the original and new scores at esl3 along with the amount of change. For only ve of the documents, at all three esl levels, was there any sort of negative eect on either one or both of the SPIRE queries. In most cases this was slight. For one document, 764, the scores at esl3 were worse with the reduced excerpts, but then were improved at esl5 . The second excerpt-ckb caused one document, 764, to experience a large increase at esl3 . The bag of words query rose by 5 passages and the sum query by 11. These 145

Table 6.38. Dierences between the base and second set of excerpts at esl3 for plan ling date.

PLAN FILING DATE Bag of Words Doc-ID Original New Drop Orignal 001 18.00 13.00 5.00 31.00 180 27.00 6.00 21.00 27.00 188 25.00 4.00 21.00 27.00 204 16.00 1.00 15.00 25.00 206 0.00 0.00 { 0.00 260 4.00 3.00 1.00 11.00 289 71.33 22.33 49.00 69.33 353 0.00 0.00 { 0.00 407 28.00 7.00 21.00 28.00 427 9.00 3.00 6.00 9.00 442 4.00 1.00 3.00 2.00 472 3.00 3.00 { 5.00 693 0.67 0.67 { 0.67 733 1.00 1.00 { 3.00 751 5.00 0.00 5.00 5.00 764 8.00 13.00 (5.00) 8.00 860 18.00 3.00 15.00 19.00 877 93.00 52.00 41.00 86.00 915 5.00 4.00 1.00 4.00 961 0.00 0.00 { 0.00

146

Sum New Drop 17.00 14.00 6.00 21.00 4.00 23.00 8.00 17.00 0.00 { 3.00 8.00 22.33 47.00 0.00 { 7.00 21.00 5.00 4.00 0.00 2.00 3.00 2.00 0.67 { 2.00 1.00 9.00 (4.00) 19.00 (11.00) 5.00 14.00 49.00 37.00 5.00 (1.00) 0.00 {

changes were due to the removal of the term \appellants" from the excerpt-ckb. This caused one of the relevant passages to move from 11 to 17 with the bag of words query and from 9 to 23 with the sum query. The other cause for minor negative uctuations in the rankings was the decreased number of times the term \chapter" appeared in the excerpt corpus. The second excerpt-ckb had four fewer instances of this term. When we contrasted the esl scores from the SPIRE and manual queries we noticed that there were instances where each does better than the other. The manual query ranked passages that contained both \ led" and \bankruptcy" quite high, while many times these passages talked about \the ling of the petition" without making any reference to the timing of when the petition was submitted. Many other highly ranked non-relevant passages discussed the ling of an objection or motion. The SPIRE queries did both well and poorly when a date was mentioned, depending on the document. The excerpts contained many dierent month and year terms and these could be mentioned throughout an opinion whenever a citation was given. Within some documents these calendar values provided many non-relevant retrievals, within others, they provided good matches. Typically when SPIRE did better than the manual query it was due to a calendar match in conjunction with other matching excerpt terms. When we tried removing all references to dates from the excerpt-ckb (all day, month, and year terms), the retrievals were considerably worse. There were huge jumps for documents 764 and 001 at esl3; the scores went up by 144 and 275, respectively. At esl5 , there were many scores that went up, some by 20 to 35 points. So, date information was de nitely useful for retrieval, but it was not necessarily in the best format. Consistently being able to rank highly those passages that were relevant was dicult for both types of query. Overall, the SPIRE queries generated from the second excerpt-ckb were about the same as the best manual query. There were a 147

number of documents where each type of query did much better than the other, and by a large margin, (the excerpt-based queries did better on documents 427 and 915 and the manual query did much better on documents 204 and 877). In addition, there was one document in which the SPIRE queries were able to nd the requested number of relevant passages and the manual query was not. Overall, the queries formed from the second excerpt-ckb performed comparably to the expert manual query. They were slightly better than the manual query when examining the top one and three passages and when requesting three relevant passages. Plan ling date is a feature where the use of a concept-based matching together with a set of terms would likely enhance the retrievals. Being able to specify within the query that belief should be given to passages that contain a calendar date would be useful. However, this would not retrieve all of the passages that were relevant to a date-type query as shown above.

6.2.9 Profession There were only three excerpts within our 13 document case-base that mentioned the profession of the debtor. Once the system had stopped and stemmed these excerpts, there were only 18 unique content words left. This was one of the smallest set of terms that SPIRE had for use in matching. Even so, SPIRE still did quite well: the bag of words and sum queries were better than the manual at esl1 and esl5 and were comparable at esl3 . Table 6.39 gives the esl values for the manual and SPIRE queries. There was quite a uctuation in performance when we examined average precision scores at cutos. The SPIRE queries were much better than the expert manual query for the rst three passages, 33 and 44 percent better. Then, the manual query caught up and was better at ve passages, but was worse again by 20 passages. (See Table 148

Table 6.39. ESL values for manual and SPIRE-generated queries for profession. 1 psg requested Doc-ID Manual Bag of Sum Words 001 2.67 0.00 0.00 180 3.33 2.67 2.67 188 -1.00 10.00 10.00 204 0.00 0.00 0.00 206 0.00 0.00 0.00 260 0.00 0.00 0.00 289 0.00 4.00 0.00 353 2.00 0.00 0.00 407 4.00 1.00 0.00 427 -1.00 -1.00 -1.00 442 0.00 9.50 2.00 472 2.00 0.00 0.00 693 1.60 0.00 4.50 733 -1.00 -1.00 -1.00 751 1.33 0.00 0.00 764 -1.00 0.00 0.00 860 0.00 6.00 4.40 877 0.00 0.00 0.00 915 0.00 0.00 0.00 961 0.00 0.00 0.00 Avg 16/1.1 18/1.8 18/1.3

PROFESSION 3 psgs requested 5 psgs requested Manual Bag of Sum Manual Bag of Sum Words Words -1.00 30.00 30.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 1.00 2.00 10.00 -1.00 -1.00 -1.00 1.42 0.00 0.00 1.84 0.50 0.50 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 0.00 4.00 8.00 0.00 4.00 8.00 -1.00 4.00 8.50 -1.00 4.00 15.00 4.00 4.00 5.67 4.00 4.00 15.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 1.00 31.00 31.00 -1.00 51.14 51.14 4.00 0.00 0.00 6.33 3.00 3.00 2.80 12.50 5.00 -1.00 13.00 9.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 2.67 16.00 2.00 4.00 16.00 2.00 -1.00 0.50 6.00 -1.00 5.00 6.50 0.00 6.00 7.20 0.40 12.40 25.00 0.00 0.00 0.00 0.67 0.00 0.50 2.00 3.20 9.00 4.00 4.40 11.20 2.00 0.00 0.00 -1.00 0.00 0.00 12/1.7 15/7.5 15/8.2 8/2.7 13/9.0 13/11.3

149

6.40.) It would seem that the excerpts were able to provide excellent initial and matches, but some of the intermediate values were not as good as would be hoped.

Table 6.40. Average precision at cutos and non-interpolated precision for profession.

Cuto Manual Bag of Pct Words Chg 1 0.5000 0.6667 33.3 3 0.4444 0.5000 12.5 5 0.4667 0.4444 -4.8 10 0.4111 0.4222 2.7 15 0.3370 0.3519 4.4 20 0.2917 0.3250 11.4 Avg 0.3271 0.3968 21.31

t-test

Sum

0.3313 0.7029 0.8324 0.8704 0.7883 0.4953 0.2633

0.7222 0.5370 0.4333 0.3556 0.3333 0.3250 0.3803

Pct Chg 44.4 20.8 -7.1 -13.5 -1.1 11.4 16.26

t-test 0.1631 0.4616 0.7430 0.4136 0.9387 0.4859 0.3812

Within the test set, two of the documents made no mention of the debtor's profession, while others devoted a signi cant amount of text to describing the current and previous types of employment held by the debtor(s). When judging, text that merely mentioned \employment history" without elaborating, was not considered relevant. If these passages had been judged relevant, all three queries would have improved. There were many instances where the court opinion cited a job title, but the enclosing passages were not judged relevant. For example, the name of the judge presiding over the court would typically include the honori c \Judge" before giving the name. Passages that mentioned the lawyers and gave their names, or text that stated whether this particular plan would place a burden on the \trustee" were not considered relevant. Only text that talked to the profession of a debtor was judged relevant. The Largest contributor to false positives for all of the queries was the phrase \employment history". Another contributor for the SPIRE queries was \position". It was sometimes used in the sense of stating an opinion. Court opinion 860 was particularly aected by this use of \position" as six of the ten passages containing 150

\position" were not relevant. The manual query retrievals were less aected as the other query terms had higher belief scores. Interestingly, the use of \old" (found in the excerpt-ckb) sometimes appeared in close proximity to the profession of the debtor. This was because the opinions would give a short description of the debtor to include both the debtor's age and profession, for example: \forty-four year old dairy farmer" [court opinion 877] and \28-year old dental student" [court opinion 693]. While there were other uses of \old", only three of the bag of words retrievals were adversely aected, and not by very much. The sum retrievals were not hurt by these multiple usages. Overall, it was better to have the term than not, showing that the terms surrounding a feature's value could prove useful. Based on our experience with giving a list of keywords for the procedural status manual query, we decided to try to improve the manual results for profession. While it was relatively easy to recognize the values for this feature it was much more timeconsuming to compile a comprehensive listing of possible values. We took the best manual query, which turned out to be quite short, and added in the titles of various jobs. The original query was only three terms: !c! Profession #q1 = #passage20( employed position worked);

We created three additional queries to test the eectiveness of enumerating job titles. Table 6.41 lists these three queries. We collected a set of 16 titles given for the debtors in our collection of 55 court opinions. These, plus the terms \employed", \position", and \worked", formed the rst \list" query. Overall, this list query did quite poorly. There was one document that now had 107 non-relevant passages before reaching three relevant ones. This same document had had no non-relevant passages with the original query. There were a few documents

151

Table 6.41. List queries for profession. List 1

secretary receptionist-secretary systems-analyst manpower budget-analyst federal-employee doctor dentist chiropractor psychotherapist mental health counselor veterinarian law-student instructor teacher unemployed

List 2

secretary #2(receptionist secretary) #2(systems analyst) manpower #2(budget analyst) #2(federal employee) doctor dentist chiropractor psychotherapist #3(mental health counselor) veterinarian #2(law student) instructor teacher unemployed

152

List 3

secretary #2(receptionist secretary) #2(systems analyst) manpower #2(budget analyst #2(federal employee) doctor dentist chiropractor psychotherapist #3(mental health counselor) veterinarian #2(law student) instructor teacher unemployed student engineer #2(chemical engineer) director #2(export manager) photographer salesman #2(self employed) #2(branch manager)

where this query improved results, but the overall eect was very negative. Table 6.42 shows the results of all the list queries as well as the manual query. Generally, the cause for this degradation was the inclusion of the term \student". Because many debtors had outstanding student loans, passages containing \student" were unduly rewarded. Many opinions had discussion about the percentage of debt that was due to student loans and others talked of plans that were or were not able to discharge a student loan debt. For the next list query we placed restrictions on the proximity of the titles that were more than a single term, such as \mental health counselor" and \budget analyst". This considerably improved retrievals over the rst list query. Now, where there were decreases in performance, they were much smaller. The third list query was even longer: it included an additional 9 job titles that were found in one of the court opinions. In general, this query did worse than either the original or the second list query. Most of the decreases in performance can be attributed to the inclusion of the title \student". Comparing the best of these queries, the second list, to the SPIRE queries now had them all about equal. The SPIRE queries were still a bit better, particularly at esl5 . There were still documents where the manual query was unable to retrieve the requested number of passages while the SPIRE queries were able to do so. There were two reasons for this: 1) some good terms that were associated with the feature and not the value, such as \practice", were in the excerpts and not in the manual query, and 2) there was an occasional random matching of a term from the excerpts with a relevant passage. Creation of the list queries was not worth the eort. They did not do better than the shorter, manual query, and merely raised the issue of knowing when to stop adding terms.

153

Table 6.42. ESL values for the original and three list queries for profession. Doc-ID 001 180 188 204 206 260 289 353 407 427 Doc-ID 001 180 188 204 206 260 289 353 407 427 Doc-ID 001 180 188 204 206 260 289 353 407 427

Orig List Manual 1 2.67 25.00 3.33 0.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 2.00 0.00 4.00 7.00 -1.00 -1.00 Orig Manual -1.00 -1.00 -1.00 1.00 1.42 -1.00 0.00 -1.00 4.00 -1.00 Orig Manual -1.00 -1.00 -1.00 -1.00 1.84 -1.00 0.00 -1.00 4.00 -1.00

PROFESSION 1 Passage Requested List List Doc-ID Orig List List List 2 3 Manual 1 2 3 6.00 4.00 442 0.00 6.40 0.00 2.40 0.00 0.00 472 2.00 2.50 2.00 3.25 0.00 0.00 693 1.60 21.00 1.60 17.00 0.00 0.00 733 -1.00 -1.00 -1.00 -1.00 0.00 0.00 751 1.33 4.67 1.33 0.00 0.00 0.00 764 -1.00 2.00 -1.00 -1.00 0.00 0.00 860 0.00 0.00 0.00 0.00 0.00 0.00 877 0.00 0.00 0.00 0.00 4.00 4.00 915 0.00 2.00 0.00 1.00 -1.00 -1.00 961 0.00 0.00 0.00 0.00

List 1 -1.00 -1.00 -1.00 1.00 0.67 -1.00 1.00 22.50 7.00 -1.00

3 Passages Requested List List Doc-ID Orig List List List 2 3 Manual 1 2 3 -1.00 -1.00 442 1.00 7.20 1.00 3.20 -1.00 -1.00 472 4.00 5.50 4.00 5.75 -1.00 -1.00 693 2.80 45.80 2.80 41.80 1.00 1.00 733 -1.00 -1.00 -1.00 -1.00 0.67 0.67 751 2.67 6.67 2.67 2.67 -1.00 -1.00 764 -1.00 2.00 -1.00 -1.00 0.00 0.00 860 0.00 19.00 3.00 9.00 -1.00 22.00 877 0.00 107.00 0.00 107.00 4.00 4.00 915 2.00 51.67 0.83 42.67 -1.00 -1.00 961 2.00 0.00 0.00 0.00

List 1 -1.00 -1.00 -1.00 -1.00 3.33 -1.00 12.00 -1.00 7.00 -1.00

5 Passages Requested List List Doc-ID Orig List List List 2 3 Manual 1 2 3 -1.00 -1.00 442 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 472 6.33 8.67 6.33 8.33 -1.00 -1.00 693 -1.00 47.40 -1.00 43.40 -1.00 -1.00 733 -1.00 -1.00 -1.00 -1.00 3.33 2.00 751 4.00 8.00 4.00 6.00 -1.00 -1.00 764 -1.00 53.33 -1.00 -1.00 0.00 2.00 860 0.40 19.00 3.00 9.00 -1.00 -1.00 877 0.67 107.00 0.67 107.00 4.00 4.00 915 4.00 53.00 2.50 44.00 -1.00 -1.00 961 -1.00 0.00 0.00 0.00

154

6.2.10 Special circumstances Similar to the feature of sincerity, there was a high variance in the number of relevant passages on special circumstances found within each of our test documents. Some had no discussion at all, while others had paragraphs detailing mitigating circumstances. The average number of relevant passages was 12.4, but nine documents had 7 or fewer relevant passages, while two documents had more than 30. (Refer back to Table 4.8 for the number of relevant passages for each document in the test set.) On the whole, the SPIRE queries did well. They met or exceeded the performance of the manual query on every document at esl1 , were worse on only two documents at esl3 , and were only worse on a total of four documents at esl5. Conversely, the manual query did not retrieve any relevant passages for three documents, did not retrieve at least three relevant passages for two additional documents, and did not retrieve at least ve relevant passages for three more documents, while the SPIRE queries were able to do so. Table 6.43 gives the esl values for the test documents. (For reference, Table 6.44 gives average precision at cutos and non-interpolated precision.) While the averages at esl5 look as though the SPIRE queries did poorly, there was one document that accounts for the majority of the dierence, document 427. Reasons for this poor performance will be given below. If we exclude this document from the averages, the bag of words average becomes much lower than that of the manual query, and the sum average becomes a bit higher than the manual. The sum query did much better than the bag of words query for special circumstances. This was due to the combination of \special" and \circumstances" appearing together in a single excerpt. This combination always indicated a relevant passage, and thus, for the sum query these passages were always the top rated ones. Both the SPIRE and manual queries suered from bad matches on either of \special" or \circumstances". \Circumstances" provided the greater number of false hits as it was not unusual for the court to address the issue of good faith from the per155

Table 6.43. ESL values for the manual and SPIRE-generated queries for special circumstances.

SPECIAL CIRCUMSTANCES 1 psg requested 3 psgs requested 5 psgs requested Doc-ID Manual Bag of Sum Manual Bag of Sum Manual Bag of Sum Words Words Words 001 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 180 0.00 0.00 0.00 0.00 1.00 0.00 -1.00 12.50 7.50 188 0.00 1.00 0.00 0.00 1.00 0.00 1.50 4.00 4.00 204 -1.00 12.00 22.00 -1.00 13.00 23.00 -1.00 50.33 69.33 206 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 26.00 103.00 260 0.00 0.00 0.00 0.00 0.00 2.00 2.00 0.00 2.00 289 9.33 9.00 6.00 -1.00 11.00 36.00 -1.00 197.00 192.00 353 0.00 0.00 0.00 0.00 0.00 0.00 -1.00 -1.00 -1.00 407 -1.00 0.00 2.00 -1.00 0.00 31.00 -1.00 1.00 49.50 427 0.00 0.00 3.00 6.00 9.00 4.00 6.00 105.00 82.00 442 0.00 1.00 0.00 7.00 16.00 8.00 -1.00 -1.00 -1.00 472 0.00 0.00 0.00 0.00 0.00 2.00 0.00 2.00 17.00 693 0.00 0.00 0.00 0.00 0.00 2.00 -1.00 88.00 111.00 733 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 751 0.00 0.00 8.00 0.00 0.00 12.00 1.00 5.00 12.00 764 -1.00 0.00 12.00 -1.00 -1.00 -1.00 -1.00 -1.00 -1.00 860 0.00 0.00 1.00 0.00 2.00 3.50 30.50 3.00 8.00 877 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 915 0.00 0.00 0.00 -1.00 2.00 57.00 -1.00 167.50 165.50 961 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Avg 16/0.6 19/1.3 19/2.8 14/0.9 18/3.1 18/10.0 9/4.6 16/41.4 16/51.4

Table 6.44. Average precision at cutos and non-interpolated precision for special

circumstances.

Cuto Manual Bag of Pct Words Chg 1 0.7895 0.7368 -6.7 3 0.7193 0.7193 0.0 5 0.5789 0.6211 7.3 10 0.3789 0.4211 11.1 15 0.2842 0.3404 19.8 20 0.2211 0.2763 25.0 Avg 0.3871 0.4603 18.89

156

t-test

Sum

0.6669 1.0000 0.4647 0.4241 0.1864 0.0714 0.2077

0.6316 0.5789 0.5053 0.3368 0.2667 0.2184 0.3748

Pct Chg -20.0 -19.5 -12.7 -11.1 -6.2 -1.2 -3.19

t-test 0.0828 0.0880 0.2326 0.3227 0.6352 0.9105 0.7037

spective of \the totality of the circumstances". The SPIRE queries also had a large number of false hits due to \fraudulent" and \victim". These two terms accounted for a large number of highly-ranked non-relevant passages with the sum query. The term \victim" showed up in many bankruptcy opinions because one of the types of debt a debtor may be trying to eliminate was due to a civil or criminal judgment. The following sentence from court opinion 260 illustrates: Whether or not the judgment is considered to be a form of restitution, to allow the Chases to avoid the thrust of the judgment would be to undermine the criminal justice system and to pro t by their false promise to pay the victim of a crime.

The excerpt-based queries retrieved the passages below, even though neither \special" nor \circumstances" was used to describe the situation. The retrieval was made because of matches with \surrounding" and \debts": This Court is concerned not only with the nature of the debt, but also with the link between the events surrounding the incurring of the debt and the subsequent petition for Chapter 13 reorganization.

The main reason SPIRE was able to retrieve many more relevant passages than the manual query was because of matches on either of \debtor" or \debt". Both of these were in the excerpt-ckb and provided matches deep in the rankings. Conversely, \problems", as in \unusual problems" or \exceptional problems", was not in the excerpt-ckb. \Problem" was in the manual query and this caused the large dierence in scores on document 427. There were no obvious extraneous phrases or names that could be removed from the special circumstances case-base. There was one citation embedded within an excerpt, but we left excerpts intact. However, we did did an experiment in which we removed this citation: In re Bellgraph, 4 B.R. 421 (Bkrtcy.W.D.N.Y.1980). The resultant queries performed slightly better at all esl levels. The largest amount of improvement was shown deep in the rankings: at esl5 one document improved by 9.33 and two others by 8.00. 157

In brief, the SPIRE special circumstances queries did well as compared to the manual query. They were able to retrieve relevant passages in many documents where the manual query was unable to do so. In total, there were seven documents where the manual query was unable to retrieve ve relevant passages, while both SPIRE queries did so. However, when the manual query did retrieve enough relevant passages, they were generally quite early in the rankings. So, we have a trade-o: the SPIRE queries were able to retrieve the requested number of passages, but sometimes the passages were deep in the rankings, (although for two of the test documents the top ve passages were all relevant,) while the manual query was frequently unable to retrieve the requested number of relevant passages, but when it could, those passages were at the top of the rankings.

6.2.11 Summary SPIRE did well on most features. For only two of the features did the SPIRE queries do worse than the manual. For the other eight, SPIRE outperformed or scored as well as the carefully crafted manual queries. The SPIRE queries signi cantly outperformed the manual for sincerity, while the manual query signi cantly outperformed SPIRE on procedural status. Further, the SPIRE queries did much better on loan due date, although the dierences were not statistically signi cant by a t-test at the 0.05 level (on average precision or the cuto levels). The manual query was better than the SPIRE queries for future income, but not signi cantly so. For the other six features, the results were closer; mostly the SPIRE queries performed about the same or slightly better than the expert manual queries. Table 6.45 gives a quick review of the dierences found between the SPIRE (SP) and manual (M) queries. The obvious reason for missing relevant passages was the lack of inclusion of important terms in the excerpt-ckb; the wording used to describe the feature may not have been in the small set of court opinions that derived the excerpts. For 158

Table 6.45. Review of results for the two best SPIRE query types and the manual

query set. SP denotes the SPIRE queries and M denotes the manual. Feature Duration Monthly Income Sincerity

Type Description Numeric All did well; bag of words better than M better than sum. Numeric All very good; SP mildly better. Boolean

SP signi cantly better. SP terms provided good context, occasional mismatches. M hurt by text on motivation being mixed with that on sincerity. Special Boolean Tradeo: SP retrieved, but deep; M retrievals better, Circumstances but missed many. Loan Due Date Date SP better. Few relevant passages. Term combinations helped SP. Plan Filing Date Date Excerpt date terms both helpful and harmful; need concept matching. Debt-Type Set Best results of all features; only one bad SP retrieval. List did not help M. Profession Set SP better. List did not help M. Procedural Category M signi cantly better. Excerpt terms interfere with Status in keywords the ranking. Future Income Category M better, but SP notably good too.

159

example, the excerpt-ckb for procedural status did not contain \armation", that for profession did not contain \worked", and the excerpt-ckb for special circumstances did not contain \problem". All of these terms would have helped during retrieval. While it is obvious that there were useful terms missing from the excerpt-ckbs, what is less obvious, is what terms should be included to assure retrieval of the some of the other relevant passages (while not bringing in many non-relevant ones). There were several relevant passages for monthly income (discussed in Section 6.2.7) where selection of a good set of query terms would be quite dicult. Calendar dates pose another question for term selection. Queries that contained the set of day, month, and year terms performed horribly. There were too many passages where these individual terms could match, and the value given to the terms was not proportional to their value when compared to the other query terms. Use of a concept recognizer, matching patterns or regular expressions, that assigns credit to passages that contain a date, rather than credit for each individual term, would be better. We found several reasons why relevant passages would end up low in the rankings. First, the weighting of the query terms was not quite correct for a few of the retrievals. Some of the more descriptive terms might not be given enough value to place the passages containing them high enough in the rankings. One item for future work would be to compare these results to those when our homogeneous document collection is a part of a larger, related, but more heterogeneous collection. The second reason for relevant passages to end up low in the rankings is that there might not have been enough context terms to distinguish between good word usages and bad ones. This could be ameliorated by using some syntactic knowledge of term usages when processing the query terms. The third reason is that the passages would coincidentally contain a matching term. This was occasionally the case and allowed several relevant passages for special circumstances to be retrieved; their matches were 160

not really based on any terms that were descriptive of the feature. In summary, the following are the possible reasons for low-ranked relevant passages:

weighting of query terms was not quite correct, bad combinations of good terms, or a coincidental match retrieved a relevant passage.

On the other hand, there were two reasons we could discern for false positives: 1. bad combinations of good terms, and 2. good words or phrases that had additional contexts or usages. The rst situation, bad combinations of good terms, can be illustrated with the feature of future income: \regular" and \income" both occurred in the excerpt-ckb but retrieved passages about a distinctly dierent feature. Similarly, for duration, \unsecured" and \creditors" were misleading, particularly when they occurred together. Duration also suered from the second problem, that of having good phrases used with an alternative meaning. The phrase \proposed to pay" was sometimes followed by an amount or the type of creditor, rather than a time frame. Profession had bad matches using the word \old" because the text was not descriptive of the characteristics of the debtor. Besides the obvious reason for nding good matches, that of having good terms in the excerpt-ckb, there were the occasional coincidental hits. Profession bene ted from this in one document where \thirty-two year old dentist" from the excerptckb matched \two children and has subsisted on welfare and social security" [court opinion 188]. While there were some instances of coincidental matches, these did not make up the bulk of the retrievals. In general, the excerpt terms provided good matches, and in some situations they were excellent. For the feature where there was a short list of keywords, the excerpt terms merely added noise to the retrievals. The use of local context, as given in 161

the form of an excerpt-ckb, for generating queries, succeeded in retrieving relevant passages, and usually quite high in the rankings.

6.3 Summary Two of the base queries performed the best of all the query types created from the excerpts. The bag of words and sum queries employed minimal structure and usually achieved quite good results. The phrases we added were mixed in their utility. If the relevant text for the feature was frequently indicated by a few noun groups, then the phrases would work well. Unfortunately, there were features where this was not the situation and the inclusion of the noun groups was detrimental. Results from the set of words queries clearly indicated the usefulness of employing frequency information when forming queries (or the harm from removing this information). If a term appeared multiple times in a query, that information should be captured and used to boost the score of matches based on this term. The use of the Kwok-based weighting scheme for selecting good terms did not work well with our excerpts. There were several reasons for this. The scheme was devised as a means of reducing query ambiguity. The idea was to boost the values of a limited number of terms, generally less than three, and to do so in situations when there were only a few query terms available. Our implementation altered the scores of a much larger set of terms. Further, we did not treat pairs of terms or phrases as atomic elements when assigning weights. Being able to do so might have given this scheme better retrieval results. The selection of only a fraction of the excerpt terms to create a query also did not perform well. Perhaps if we had allowed for selection based on the frequency that the terms appeared in the case-base, and to allow for selection with replacement, the

162

results might have improved. For now, use of the entire excerpt-ckb seems to be the better method. Manually creating queries composed of a long generic list of value keywords did not perform as well as might be expected. Use of the excerpt-ckbs outperformed these lists. These lists pose a complementary problem to the human who must decide when to stop adding items to them. Conversely, when there exists a short, domain-speci c list, particularly one of technical keywords, it is best to manually compose the query. This was the situation with the procedural status feature. There are only a small number of possible outcomes that can occur when deciding a court case. These outcomes are easily enumerated and work quite well to nd other instances of the feature. Finding passages that discuss relevant dates without explicitly mentioning a calendar date was hard to do with either SPIRE's or a manual approach. Use of a concept recognizer would help with these features. Additionally, concept recognizers could be used to help retrievals for other features, speci cally those that mention monetary amounts. The number of relevant passages in a documents was not necessarily indicative how easy or hard it would be to retrieve them. A typical example comes from the duration feature. Document 206 had 14 relevant passages, yet the SPIRE queries were unable to locate 5 relevant ones until an average of 18 non-relevant passages had been read. Conversely, for document 188, there were 7 relevant passages and only an average of 2 non-relevant ones had to be processed. On a larger scale, future income had only the sixth largest average for number of relevant passages, yet scores were among the highest across all of the features (for all of the bag of words, sum and manual queries). We did not require minimum numbers of matching terms in order to retrieve a passage. If we had required some minimum number of such terms, then some of the 163

manual queries would have suered: many of the retrievals matched on only a single term. Requiring additional matches would place an even tougher burden on the user: instead of creating a query of many terms, the user must now focus on coming up with two or more words that will be found together. This is a much harder task. Because this task would be more strenuous on the user, and our objective is to reduce the burden placed on the user, we refrained from requiring minimum numbers of matching terms. Removal of proper names from the excerpt-ckbs proved to be advantageous. Removing some extraneous text also bene ted the excerpt-based queries. This shows that some level of caution should be taken when building the case-bases. The question of retaining excerpt integrity still remains: is it reasonable to break an excerpt into multiple parts to remove a name (a case citation) or other text from it? This would keep the case-base more on-point, particularly if the removed text has tendencies to appear within the discussion of many dierent features and is not tremendously indicative of the feature at hand, or for that matter may not be indicative of any feature at all. Creating the case-bases of excerpts did not require extensive knowledge of the domain, as was typically necessary to generate the manual queries. Further, it was reasonable to use non-domain experts to create the excerpt-ckbs. However, some care should be taken to minimize the inclusion of extraneous text. Being a bit careful to ensure that only those portions of the text that address the feature and its resultant value are retained can make a signi cant dierence in the results. Even more importantly, the excerpts should be checked to make sure there is exclusion of as many proper names as feasible. For these features, they provided no added information and, in fact, were detrimental to locating relevant text.

164

CHAPTER 7 CONCLUSIONS AND FUTURE WORK 7.1 Conclusions If CBR systems are to succeed in commercial environments, then they will have to have case-bases that are comprehensive. Depending on the domain, this may mean hundreds or even thousands of cases. If the cases have to be manually constructed, then it is unlikely that the case-base will ever be mature enough to be deployed since the labor costs will be prohibitively expensive. In order to eld these systems, the overhead for case construction must be reduced, the amount of labor must drop, and the expense expended on expert labor must be minimized. We have shown one means for reducing this cost: rather than reading or processing an entire document to create a case, we focus a user or IE system on those passages most likely to contain information about a requested case feature. To do this we use a case-base of excerpts to derive passage-based queries, which are processed by an IR system over a single document at a time. The approach is novel in two respects. First, it speci cally retrieves passages and not documents. In particular, this approach retrieves very small elements, twentyword windows. These are much smaller elements than those previously tested. Second, we used a set of excerpts to construct automatically our queries. In some sense we have made the rst attempt at relevance feedback at the passage level. We have demonstrated that these case-bases of excerpts work quite well and they even do as well as or better than manually constructed expert queries. A small set of excerpts are sucient to retrieve relevant passages for a feature of interest. 165

The two simplest queries, the bag of words and sum queries, met or exceeded the performance of their manual counterpart for the majority of features. Similar to the use of relevance feedback to generate a query that will retrieve related documents, use of these excerpts demonstrates that we can reuse prior context at a lower level to retrieve relevant passages. The cost of generating the excerpt-ckbs is not great since they can be built simultaneously with the case representation. We have thus greatly reduced the burden placed on the user to generate an ideal set of terms to describe an information need. Further, the user need not be concerned with how to optimally assign and combine words and phrases using the set of available query operators to form a query. This can be done automatically. By retrieving relevant passages with the excerpt-based queries we have reduced the burden placed on the user who would have had to read the entire document. When the documents are lengthy and the cost of the processing prohibitive, this method provides a low-cost alternative.

7.2 Future work The approach, as executed by SPIRE, locates relevant documents and then passages, but these passages are consecutive windows and are not based on thematic or discourse units. This method is ne for indexing and retrieval, but presentation to a user may be confusing. Additionally, we have yet to take advantage of information based on the structure of the excerpt during retrieval, nor have we incorporated any domain-speci c knowledge into the process. Techniques that increase access to information will enhance its value. Below are several short term projects that would help reduce the amount of text presented and enhance retrieval accuracy while minimizing user eort:

re ne the query formulation process, 166

construct a theory of relevance feedback that addresses elements smaller than full-length texts, design interfaces to display results, and classify documents that do not contain information about the feature.

We elaborate on each of these areas below.

7.2.1 Query construction 7.2.1.1 Re ning the de nition of relevant For most of our features, we have done well at constructing queries that locate text about the feature. The next step is to distinguish between text that speaks to the value of a feature from text that merely discusses the feature. This will require eort to determine how these two sets of text can be automatically distinguished from each other. Finding the means to do so would assist not only in achieving better retrievals, but in aiding an information extraction system. We also want to distinguish text that discusses a prior event from a current one. If we can manually segregate the text into these various categories, then we can see what distinguishing characteristics each set of text may have so that we may automate the process. We have not directly addressed the situation where multiple passages are needed to discern a value. Future work will examine these sets of passages and their associated features for methods of generating multiple queries and retrievals or other methods to notify the user that the solution spans more than one passage. It may be that the passages are in close proximity to one another. Were that the case, it would be a simple matter of either expanding the passage size or increasing the amount of the presented material.

167

7.2.1.2 Query expansion We currently only use information gleaned from the excerpts to form our queries. We have not yet explored how to use information from across the entire collection to nd terms related to those in the excerpts. One technique is to apply query expansion. We could try using a thesaurus to add new terms. Alternatively, we could further re ne our expansion operation to take advantage of domain knowledge by building an association thesaurus [28]. We have several choices of which database we should build our association thesaurus from. We could use either the entire collection or just those documents related to the problem case as found in the case knowledge base. Another possibility would be to build the thesaurus from an intermediate collection, such as that comprised of the documents retrieved by SPIRE with the document-level query. Thesauri built from the second and third collections would be highly domain- and problem-speci c.

7.2.1.3 Learning Learning techniques have been successfully applied to IR tasks such as routing and ltering [22, 23]. Learning techniques could be used to decide which excerpts or even portions of excerpts should be included in a passage query. There are several drawbacks to this approach. The most prominent is the need for additional training data. We do not want to burden the user with the task of generating large amounts of training data and we do not want to \over-train" our queries such that they work exceptionally well on the training texts, yet are too tailored to do well over other documents. We have not yet learned when it is appropriate to add new excerpts to the casebase. Our results have already shown the need to keep information about terms that appear multiple times in the excerpt case-base. However, we need to be careful that we do not add too many copies of closely related excerpts so that some terms 168

dominate and overwhelm the retrievals. We want to ensure that we have coverage of the most typical means of expressing the feature, yet allow for the retrieval of possible exceptions.

7.2.1.4 Stopping and stemming We used a stopped and stemmed collection for our experiments. This was based on a set of experiments that compared collections built by varying whether or not we stopped and/or stemmed. The results were suciently ambiguous that we chose to use the stopped and stemmed version. This does not preclude us from changing to a unstopped or unstemmed version when we attempt to make greater use of prepositions and their proximity relationships to important terms. We have noted many instances where the phrase we would like to match includes a preposition or in some situations, a small set of related prepositions. Other work has shown the bene t of using stop words [52]. Future work should examine methods of automatically identifying these good phrases rather than the noun groups that are more typically designated by phrase identi cation tools.

7.2.1.5 Concept recognizers Concept recognizers that might be useful are monetary amounts, dates, and time periods. Speci c dollar amounts show up in many of the features in our set and actual values are frequently found in the text. Example features with monetary amounts are monthly payments (toward either a bankruptcy plan or a loan), monthly income, monthly expenses, amount of the debt, income from the home oce, etc. A monetary concept recognizer would simply scan for instances of a money symbol followed by a numeric value, or the terms associated with currency such as \dollar", \pound", etc., along with a value. The concept of time periods, such as weekly, monthly, annually, and yearly, likewise show up in our domains with some regularity. Besides the descriptors just 169

mentioned, the phrases \per week", \per month", and \per year" would additionally be a part of the time period concept. In a related vein, posing queries that request matches to an open-ended concept | such as a date, monetary value, or time period | could provide some assistance in better locating text that contains the value of a feature. Being able to request a match that includes any instance of a speci ed concept would be better than trying to enumerate the myriad ways of giving the value in the query. Requesting a match on any instance of the concept could be a big gain.

7.2.2 Relevance feedback In traditional relevance feedback we indicate to the IR system a set of documents that is relevant to our information need and, possibly, a set that is not. Using these tagged documents the system decides which \features", terms or pairs of terms found in varying proximities, are the most likely to bene t retrieval if they were added to the query. The system uses information about the totality of each document and the document set as a whole to base the statistics for feature selection. When we now have available passages that are being marked as relevant, how can we use this information to adjust our query: What does it mean to say that a \passage" is relevant? How should we treat a relevant passage when gathering statistics for relevance feedback? To enhance passage retrieval we should examine the various levels from which to draw our passage statistics; whether a document and all of its passages are to be considered as a single collection, generating its own statistics, or whether we use statistics from the total collection of documents. A third option is to treat the collection as a set of passages, rather than as a set of documents. This is similar to what is done in PIRCS [34] and Clarit [15], which break large documents into smaller, more uniform length, pieces for processing.

170

Future work should explore in more depth the question of how the statistics gathered at each of these levels aects retrieval of passages. Using this information we can determine what an ideal value should be for an inverse \document" frequency under various conditions. It may be that \one size ts all" and that using document-level statistics is ideal regardless of the size of the element being retrieved or fed back.

7.2.3 Display of context Were we to present passages to the user for consideration, the user would need to see some amount of context. Therefore, we must learn how to display both individual and overlapping passages (regions of retrieved text) so that a user can easily identify what is important, (i.e., the feature's value or the text alluding to the value). Additionally, we should provide visualization of the document and the related passages to show where they appear throughout, thus aiding in discerning relationships between the feature and the document. There are numerous interface issues when dealing with the presentation of results to a user. For example,

How many retrieved passages should be shown to a user? In what format should they be shown? Do we present a passage, a sentence, a paragraph, or some kind of listing with the highest ranked results?

Additionally, we have the ability to selectively highlight portions of a text within a display. What portion of the text should be displayed and what amount should be highlighted? Do we display only the passage itself, or do we include additional text to possibly provide additional context? Do we expand the passage in each direction so that complete sentences are shown? Do we further expand so that the user sees only complete paragraphs? Or, nally, do we just display the entire document and then highlight selected portions of it? 171

Assuming that we use highlighting to focus the user's attention, what text do we emphasize? From less to more highlighting, some of our options are: emphasizing starting at the rst matching word within a passage, the entire retrieved passage, the sentence(s) surrounding and containing the passage, or the paragraph(s) that do the same. One nal consideration is the possibility of only emphasizing those words that matched query terms. If we use any sort of query expansion technique, we should distinguish between those items that were in the original query and those added by expansion. All of these questions need to be addressed if we are to have a viable system that real users will adopt.

7.2.4 Classi cation of documents One limitation of our approach is that we are not able to automatically determine when there is no passage within a new document that will satisfy our information need. We cannot currently segregate those retrieval instances where there are no relevant passages to be found from those where there are. This is currently true of most information retrieval systems and is related to the task of text classi cation [38, 78, 35, 41]. Because we rank order the passages, we are able to state that the top-most passage is the one most likely to contain information relative to the examples or queries we have presented to the system. If the requested information is not present in the top ranked items, and we believe our retrieval algorithm to be sound, then we would like to be able to state that the information is not present. Contributing to this problem is the inherent ambiguity of language. There are many dierent ways of expressing or relaying the same meaning. We may categorize a document as not containing any information about our feature. Yet, we are unable to know if there was pertinent information present, albeit in an unfamiliar form. It 172

may well be that the information exists in the document in a form with which we are not acquainted; our exemplars do not cover this particular situation. Therefore, unless we have a \perfect" feature identi er that is able to encompass all current and future means of expressing the feature and its value, we will be bound to miss some of the relevant passages. The best we can do at this point is to state that the information, if in a form we can recognize, based on information derived from our set of exemplars, we believe to be most probably found within the top set of returned passages. This problem of distinguishing documents that contain no information from those that do is an area for future work.

7.3 Long term outlook We have taken one step forward in reducing the amount of human processing required to convert a document into a symbolic representation. We have pro ted by combining a small case-base of excerpts with an information retrieval system to winnow down the amount of presented text. Automatically extracting the data found in these passages, converting it into information, and lling in the case-frames without human intervention are the next steps in fully automating the conversion. One means of doing so would be to integrate SPIRE with a natural language processing system. In this scenario, SPIRE would play two roles. In the rst, it would inform the natural language processing system of relevant documents for use in building a domain speci c lexicon. Many NLP systems require a corpus of documents applicable to each new domain. Using the rst phase of SPIRE's system, we could retrieve a set of documents and hand these over to the NLP system for constructing its lexicon. In the second role, where feasible, SPIRE would send the retrieved passages, or an appropriate document element, to an information extraction system to have values 173

extracted. As IE technology advances, there may be many more opportunities where joint eorts will further reduce the amount of manual processing needed to convert a text into a format more easily manipulable by rule-based, knowledge-based, and object-oriented systems.

174

APPENDIX A SAMPLE CASE-FRAME The tables below give each slot in the case-frame from the bankruptcy \good faith" domain. Also given is the type of each slot. Examples are given for the slots of type set and category. The rst entry in the table is the name of the object class that contains the slot; it is signi ed with a pair of parentheses. Three slots were redundant as they had the same values, but were contained within dierent objects: Proposed-Payments, Profession, and surplus. Proposed-Payments can be found to be in both Payments and held the same value as Amount. Profession was in both the Honest-Debtor-Case and Student-Loan-Case and Attempts-To-Pay was in Bankruptcy-Case and Makarchuk-Factors. There were also three slots that were not implemented: Citation-Links in the Legal-Case, Payments-Already-Made in Plan-Payment, and Judgment-Type in Judgment-Debtor-Case. Overall, there 67 slots in the representation, of which three were redundant and three were not implemented, for a total of 61.

175

Slot Name

legal-case ( ) citation year level judge summary procedural-status decision-for citation-links factual-prototype-link alternative-factual-prototype-link legal-prototype-link legal-theory-link

Slot Type free text numeric category: :bankruptcy-court :appeals-court proper name free text category: 'plan-con rmation 'plan- ling category: 'debtor 'creditor not implemented category: 'student-loan 'farm same as above category: 'Estus 'Kitchens set: 'memphis-theory 'estus-theory

Figure A.1. Top-level legal case representation along with object class and sample values.

bankruptcy-case ( ) chapter plan-con rmed past- lings chapter-7- ling-date plan- ling-date unfair-manipulation-of-code attempts-to-pay

numeric boolean boolean date date boolean boolean

Figure A.2. Top-level bankruptcy case representation. estus-factors ( ) duration-of-plan preferential-creditor-treatment secured-claims-modi ed special-circumstances frequency-relief-sought trustee-burden

numeric boolean boolean boolean numeric boolean

Figure A.3. Estus-factors representation for plan duration. 176

estus-factorpayments-and-surplus ( ) proposed-payments numeric surplus numeric

Figure A.4. Estus-factors representation for payments and surplus. estus-factoremployment-history-prospects ( ) employment-history category: :poor :neutral :good earnings-potential category: :small :medium :large likelihood-income-increase category: :unlikely :likely

Figure A.5. Estus-factors representation for employment history. estus-factorplan-accuracy ( ) plan-accuracy boolean inaccuracies-to-mislead-court boolean

Figure A.6. Estus-factors representation for plan accuracy. estus-factor-debttype-and-discharge ( ) debt-type set: 'auto-loan 'fraud nondischarge-7 boolean or category: :after-5-years.

Figure A.7. Estus-factors representation for debt type. estus-factormotivation-sincerity () motivation category: :discharge-educational-debt sincerity boolean

Figure A.8. Estus-factors representation for motivation and sincerity. 177

makarchuk-factors ( ) relative-timing relative-total-payment-amount relative-monthly-payment-amount use-of-skills-gained attempts-to-pay relative-educational-loan-debt de-minimis-payments other-relevant-consideration

category: :before :after category: :greater :equal :less-than category: :greater :equal :less-than boolean boolean numeric boolean set: :stress :medical-expenses

Figure A.9. The various Makarchuk factors. plan-payment ( ) substantiality monthly-income monthly-expenses amount surplus percent-of-surplus-income percent-repayment-unsecured-debt payments-already-made

boolean numeric numeric numeric numeric numeric numeric not implemented

Figure A.10. Plan payment representation. debt ( ) debt-type-amount

set of pairs (debt-type amount) (category: numeric) secured-amount numeric unsecured-amount numeric total-amount numeric percent-secured numeric percent-unsecured numeric percent-educational numeric otherwise-dischargeable boolean date-repayment-obligation-commences date

Figure A.11. Debt type and amount representation. 178

student-loan-case ( ) profession dropout change-in- eld loan-due-date

set: 'dentist 'teacher boolean boolean date

Figure A.12. Student loan bankruptcy case representation.

honest-debtor-case ( ) profession set: 'dentist 'teacher

Figure A.13. Honest debtor bankruptcy case representation.

judgment-debtor-case ( ) judgment-type not Implemented

Figure A.14. Judgment debtor bankruptcy case representation.

179

APPENDIX B INSTRUCTIONS Readers were given instructions that described the type of terminology they should underline within each text. Also included were several sample segments of text that illustrated the feature. The sample fragments were drawn directly from the respective excerpt-ckbs. Below are the instructions as given to the undergraduates, excluding the sample fragments. The excerpt-ckbs can be found in Appendix C.

B.1 Sincerity This is a tricky one; it's a judgement call. This should be text that mentions whether or not the debtor(s) really intended to pay o their debts, is making an honest eort, or isn't trying to deceive the court. Some of the sample segments below don't make much sense unless they are in the context of the rest of the case. That's ne. Mark those text segments that seem to describe the \sincerity" of the debtor, whether indicating that the debtor might or might Not have been sincere.

B.2 Special circumstances Text that describes any unusual events or factors in the debtor's life that may have lead to the bankruptcy or may aect their ability to repay debts. These may be things like: medical issues or expenses, moving to some location where the cost of living exceeded the debtor's expectations, being in prison, stress related problems, an inability to get a job ("thwarted-professional-expectations"), a pending divorce, etc. 180

B.3 Plan duration Text that describes the length of the plan - usually it is given in months, but it might also be described in years.

B.4 Monthly income Please mark text that describes how much money the debtors are earning. This could be in terms of weekly, monthly, or annual income. Terms such as \income", \salary", and \pay" might be given. Additionally, there may be discussion about the income that doesn't give any exact amount of the income. This is acceptable text as it is describing or discussing the debtor's income. SPECIAL NOTE: Disposable income is not the same thing as monthly income. Disposable income is all the money that is left over after expenses are paid. Please do not mark text that describes disposable income.

B.5 Loan due date Highlight text that describes when a debtor should start making payments on a loan. The text might not give a date, but instead, describe a time period when payment should or did commence, i.e., \payment should begin 2 years after graduation". In all cases, there should be some reference to a date, whether explicit or implicit.

B.6 Plan ling date Text that describes when the debtor led for bankruptcy. Notice that the rst sample segment of text does not actually include a date, but does talk about ling a plan (petition).

181

B.7 Debt type Text that describes the nature of money that the debtor owes. The keywords "secured" and "unsecured" will occur frequently throughout the texts, as they describe the two general classes of debts. To mark a segment with either of these terms, they need to be in sentences where they are descriptive of the debt "type" in order to be relevant. More speci c types of debts we have previously assigned to cases include: Table B.1. Previously encountered debt types. student-loan educational-debt irs-debt tax bank-loan auto-loan farm-loan consumer-debt real-property-debt

medical civil-damages civil-judgement criminal-judgment judgement-debt fraud fraudulently-obtained-bank-loan fraud-judgement

B.8 Profession Please mark text descriptive of the debtor's profession. Professions that we have seen in previous texts include: Table B.2. Previously encountered professions. secretary receptionist-secretary systems-analyst manpower budget-analyst federal-employee

doctor dentist chiropractor psychotherapist mental health counselor veterinarian

law-student instructor teacher

unemployed

Note the use of "unemployed". That may describe the debtor's current "profession" if they aren't using the skills they have. 182

B.9 Procedural status This is text that describes what level of action is taking place on the case. The various types of actions could be:

Table B.3. Previously encountered values for procedural status. plan-con rmation objection-to-plan-con rmation

appeal-of-plan-con rmation appeal-of-plan-con rmation-denial appeal-of-plan-con rmation-vacation appeal-of-plan-con rmation-armation

motion-to-dismiss-or-convert remand-after-appeal-of-con rmation appeal-of-armation-of-plan-con rmation appeal-of-armation-of-con rmation-denial

B.10 Future income This is the \likelihood that the debtor's income will increase in the future". Text that discusses whether the debtor's income might be raised in the future. The text might be negative or positive on this matter i.e., \raises not being likely" would be an indication hat the future income will not be increasing. Values that we assigned to the cases were: (unlikely likely very-likely).

183

APPENDIX C EXCERPT-CKBS Below are the sets of excerpts for each feature. Six of the features had a second excerpt-ckb. These are found following the respective excerpt-ckb.

C.1 Debt type Original excerpt-ckb: student loan debts sought to be discharged account for over 2/3 of the debtors' obligations. the nature of these debts as student loans two unsecured and otherwise nondischargeable student loan debts sixty-five per-cent of his total debt is student loan obligations Debtors' unsecured debts total $9,570.58, of which $208.02 represents obligations other than student loans or Mrs. Severs' medical bills Debtor was arrested in 1984 and while in custody, attacked and injured a guard, Marc Nelson ("Nelson"). Debtor was prosecuted criminally for aggrevated assault. Nelson sued debtor for damages in state court. at the time the Debtor obtained this loan from Aetna, he already discussed his possible intention to file a Petition for Relief under the Bankruptcy Code. Debtors did file their Petition on March 30, 1984 or two weeks after Mr. Myers obtained the loan from Aetna. debt to Pioneer had been obtained through fraud The major claim against the debtor was incurred as a result of his criminal conduct.

Second excerpt-ckb: student loan debts sought to be discharged account for over 2/3 of the debtors' obligations. the nature of these debts as student loans two unsecured and otherwise nondischargeable student loan debts

184

sixty-five per-cent of his total debt is student loan obligations Debtors' unsecured debts total $9,570.58, of which $208.02 represents obligations other than student loans or Mrs. Severs' medical bills Nelson sued debtor for damages in state court. at the time the Debtor obtained this loan from Aetna, obtained the loan from Aetna. debt to Pioneer had been obtained through fraud The major claim against the debtor was incurred as a result of his criminal conduct.

C.2 Duration Original excerpt-ckb: just over 25 monthly payments the plan would pay out in less than 36 months. The Court would require the Ali's to pay $89 per month for 36 months. if payments are made for 36 months. plan proposes to pay $30 per week for 36 months. Debtor proposes to pay to the trustee $144.00 per month for 36 months proposed a three-year plan for repayment, he would pay his two unsecured creditors, BNY and the other $2000 bank loan, $100 per month for 36 months, for a total of $3,600. On November 19, 1982, the debtor filed a modified plan providing for payments of $200 per month for 36 months, for a total of $7,200. Flygares proposed to make payments under the plan of $106 per month for five years. plan proposed a return to Pioneer of aproximately 1.5% of the amount due, over a three-year period. His plan proposed to pay $50 per month for 36 months, less than 1.5% of the value of the debt. Debtors propose payments of $25.00 weekly for 33-37 months would be paid in full after two years. In the four or five months following this two-year period, the unsecured creditors would be paid the proposed amount of 10% of their claims.

Second excerpt-ckb: just over 25 monthly payments the plan would pay out in less than 36 months. The Court would require the Ali's to pay $89 per month for 36 months. if payments are made for 36 months. plan proposes to pay $30 per week for 36 months. Debtor proposes to pay to the trustee $144.00 per month for 36 months

185

proposed a three-year plan for repayment, he would pay his two unsecured creditors, BNY and the other $2000 bank loan, $100 per month for 36 months, On November 19, 1982, the debtor filed a modified plan providing for payments of $200 per month for 36 months, proposed to make payments under the plan of $106 per month for five years. plan proposed a return to Pioneer of aproximately 1.5% of the amount due, over a three-year period. His plan proposed to pay $50 per month for 36 months, Debtors propose payments of $25.00 weekly for 33-37 months would be paid in full after two years. In the four or five months following this two-year period, the unsecured creditors would be paid the proposed amount of 10% of their claims.

C.3 Future income Original excerpt-ckb: each year she must reapply. the Court cannot see any likelihood of future increases prospect of a regular job with substantially increased income is not great. her health brings into question her future ability to work. he is seeking a second job to supplement their income and expects to find one soon. Mr. Severs testified that he is now free to resume working at two jobs in order to supplement the family's income. most recent financial data reveals no increase in income. no evidence that raises are likely.

Second excerpt-ckb: each year she must reapply. the Court cannot see any likelihood of future increases prospect of a regular job with substantially increased income is not great. her health brings into question her future ability to work. he is seeking a second job to supplement their income and expects to find one soon. he is now free to resume working at two jobs in order to supplement the family's income. most recent financial data reveals no increase in income. no evidence that raises are likely.

186

C.4 Loan due date Original excerpt-ckb: Repayment on the loan, after the expiration of the grace period, began on January 1, 1983, but Mr. Ali was given an additional deferment until July 1, 1983. The loan became due in 1980. loan which became due in March 1980. became due one year after graduation.

C.5 Monthly income Original excerpt-ckb: His take home pay hovers around $400 per month. Her monthly salary is $1,068.00. The debtors list total monthly income of $1,468.00. the debtors list total income of $1,468.00 per month net disposable monthly income for 1979 averaged $1,624.82, He is paid a salary of $460 per week His current gross income is $24,000 per year. a net take-home pay of $503.00 per month a weekly take-home pay of $262.30. Flygare's take-home pay as $1,840 per month. He is currently employed by a firm in a sales capacity, earning approximately $15,642.62 per year. statement shows a combined monthly spendable income of $1,010.50 an annual salary of $30,000.

Second excerpt-ckb: His take home pay hovers around $400 per month. Her monthly salary is $1,068.00. The debtors list total monthly income of $1,468.00. the debtors list total income of $1,468.00 per month net disposable monthly income for 1979 averaged $1,624.82, He is paid a salary of $460 per week His current gross income is $24,000 per year. a net take-home pay of $503.00 per month a weekly take-home pay of $262.30. take-home pay as $1,840 per month. He is currently employed by a firm in a sales capacity, earning approximately $15,642.62 per year. statement shows a combined monthly spendable income of $1,010.50 an annual salary of $30,000.

187

C.6 Plan ling date Original excerpt-ckb: if the debtor were to seek dismissal of his chapter 13 case and file a petition under chapter 7 On April 7, 1981, the debtor herein filed a Voluntary Petition in Bankruptcy under Chapter 7 On January 21, 1986, Debtor filed her voluntary petition for relief under Chapter 7 of the Code. The debtors, Majid and Hasiba Ali, filed a joint chapter 13 petition in bankruptcy on April 8, 1983. filed a voluntary chapter 13 petition on February 27, 1984 Debtors-Appellants, Mr. and Mrs. Okoreeh-Baah, filed for bankruptcy on November 19, 1985. The Flygares filed a second petition in July 1980. On July 30, 1982, debtors filed their Joint Petition under the provisions of Chapter 13 October 18, 1982, the debtor filed a Chapter 13 petition, On September 24, 1982, the debtor filed with this court his petition for relief under Chapter 13

Second excerpt-ckb: if the debtor were to seek dismissal of his chapter 13 case and file a petition On April 7, 1981, the debtor herein filed a Voluntary Petition in Bankruptcy On January 21, 1986, Debtor filed her voluntary petition for relief filed a joint chapter 13 petition in bankruptcy on April 8, 1983. filed a voluntary chapter 13 petition on February 27, 1984 filed for bankruptcy on November 19, 1985. filed a second petition in July 1980. On July 30, 1982, debtors filed their Joint Petition under the provisions of Chapter 13 October 18, 1982, the debtor filed a Chapter 13 petition, On September 24, 1982, the debtor filed with this court his petition for relief

C.7 Procedural status Original excerpt-ckb: In this chapter 13 proceeding, one student loan creditor, University of Kansas, objects to confirmation of the debtors' plan

188

creditors objected to confirmation of debtor's Chapter 13 repayment plan matter under consideration is an Objection to Confirmation debtor, Dr. Deborah Hawkins, seeks this Court's confirmation of her Chapter 13 percentage repayment plan seeking confirmation of his plan The holder of an unsecured claim declared nondischargeable in debtor's preconversion Chapter 7 case objects to confirmation of this composition Chapter 13 plan This cause came on to be heard upon objections to confirmation New York State Higher Education Services Corporation ("NYSHESC"), and the State University of New York ("SUNY"), have timely filed objections to confirmation of Laura Makarchuk's ("Debtor") proposed Chapter 13 plan. Gordon and Sharon Flygare appeal the denial of confirmation of their Chapter 13 bankruptcy plan. debtors, appeal from the district court's decision to vacate the bankruptcy court's confirmation of the Kitchens' chapter 13 plan. what conduct constitutes bad faith sufficient to deny confirmation of a Chapter 13 plan. decision on order to show cause dismissing Chapter 13 or converting case to Chapter 7 BNY's motion relates to dismissal or conversion, and not confirmation

Second excerpt-ckb: In this chapter 13 proceeding, one student loan creditor, University of Kansas, objects to confirmation of the debtors' plan creditors objected to confirmation of debtor's Chapter 13 repayment plan matter under consideration is an Objection to Confirmation seeks this Court's confirmation of her Chapter 13 percentage repayment plan seeking confirmation of his plan The holder of an unsecured claim declared nondischargeable in debtor's preconversion Chapter 7 case objects to confirmation of this composition Chapter 13 plan This cause came on to be heard upon objections to confirmation have timely filed objections to confirmation of Laura Makarchuk's ("Debtor") proposed Chapter 13 plan. appeal the denial of confirmation of their Chapter 13 bankruptcy plan. debtors, appeal from the district court's decision to vacate the bankruptcy court's confirmation of the Kitchens' chapter 13 plan. what conduct constitutes bad faith sufficient to deny confirmation of a Chapter 13 plan.

189

decision on order to show cause dismissing Chapter 13 or converting case to Chapter 7 motion relates to dismissal or conversion, and not confirmation

C.8 Professsion Original excerpt-ckb: debtor is a thirty-two year old dentist who is in private practice debtor was self-employed in his own practice obtained a veterinarian position Debtor had apparently secured presumably permanent employment as a receptionist-secretary

C.9 Sincerity Original excerpt-ckb: The Court believes the debtors' motivation and sincerity are genuine. they have thus far made three regular payments and expect to continue so doing this couple makes a concerted effort to live sensibly and substantially within their means. testimony of a sincere desire to repay as much as possible of all of their debts was believable, and was completely uncontroverted. These consist of her sincerity in attempting to deal with her creditors sincerity is tempered by her desire to avoid returning to Maine. represents an earnest effort to repay his unsecured creditors debtor has not sought permission to extend his plan over a period of 60 months, The Chapter 13 petition was intended to wipe out BNY's claims rather than to repay them.

C.10 Special circumstances Original excerpt-ckb: The Court believes the debtors' medical expenses will increase as time goes on and believes this is a "special circumstance" under factor 8. This debtor has not been the victim of extraordinary "outside" forces.

190

The debtor is now in treatment for the condition that may have contributed to the debtor's need for Chapter 13 relief. She thereupon encountered some difficulties in her personal life. A medical condition forced her to leave work for a two-week stay in the hospital and her marital relationship began to deteriorate. She also claims to have suffered from a nervous condition during that time. She claims that she was unable to pay back these debts because her husband deserted her, leaving bills for her to pay, and that she was confused and upset. She was ill for two weeks in 1981 but that previous illness does not, unlike the debtor in In re Bellgraph, 4 B.R. 421 (Bkrtcy.W.D.N.Y.1980), disable her from paying or infringe her ability to pay her creditors. Debtor was incarcerated in the New Mexico State Penitentiary for fraudulent practices no other special circumstances surrounding Debtor's case are evidenced.

191

APPENDIX D MANUAL QUERIES The following are the nal manual queries. !c! Debt Type #q1= #passage20(loan debt fraud #or (#and( #or( student educational consumer) debt) #and( #or( student educational consumer) loan)) #phrase(student loan) #phrase(civil judgement) ); !c! Duration #q1= #passage20(#SUM( duration #phrase(monthly payments) #phrase(per month) #3(propose to pay) )); !c! Future Income #q1= #passage20(#phrase(increase in income) #phrase(future increases) #phrase(continued employment) ); !c! Loan Due Date #q1= #passage20(#phrase(became due) #phrase(repayment began) #phrase(grace period) deferment ); !c! Monthly Income #q1= #passage20( #phrase(take home pay) #phrase(per month) #phrase(total income) #phrase(per week) #phrase(monthly salary) #phrase(monthly income) earning ) );

192

!c! Plan Filing Date #q1= #passage20( #phrase(filed for bankruptcy) filed #phrase( voluntary petition) ); !c! Procedural Status #q1= #passage20(#phrase( confirmation of her plan) #phrase( confirmation of his plan) #phrase( confirmation of their plan) #phrase( objects to confirmation) objects objections appeal affirmation confirmation denial vacation ); !c! Profession #q1 = #passage20( employed position worked); !c! Sincerity #q1= #passage20( motivation sincerity genuine sensible earnest ) ); !c! Special Circumstances #q1= #passage20( #phrase(special circumstances) difficulties illness #phrase(medical problems) incarcerate );

193

BIBLIOGRAPHY [1] Amba, S., Narasimhamurthi, N., O'Kane, Kevin C., and Turner, Philip M. Automatic Linking of Thesauri. In Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland, August 1996), pp. 181{186. [2] Ashley, Kevin D. Modeling Legal Argument: Reasoning with Cases and Hypotheticals. M.I.T. Press, Cambridge, MA, 1990. [3] Belkin, Nicholas. Personal communication, May 1993. [4] Branting, L. Karl. Integrating rules and precendents for classi cation and explanation: Automating legal analysis. PhD thesis, University of Texas, Austin, Austin, TX, 1991. [5] Buckley, Chris. Implementation of the SMART Information Retrieval System. Tech. rep., Computer Science Department, Cornell University, Ithica, NY, May 1985. 85-686. [6] Callan, James P. Passage-Level Evidence in Document Retrieval. In Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland, July 1994), ACM, pp. 302{ 310. [7] Callan, James P., and Croft, W. Bruce. An Approach to Incorporating CBR Concepts in IR Systems. In Working Notes of the AAAI Spring Symposium Series: Case-Based Reasoning and Information Retrieval { Exploring the Opportunities for Technology Sharing (Stanford, CA, March 1993), AAAI, pp. 28{34. [8] Callan, James P., Croft, W. Bruce, and Harding, Stephen M. The INQUERY Retrieval System. In Database and Expert Systems Applications: Proceedings of the International Conference in Valencia, Spain (Valencia, Spain, 1992), A. M. Tjoa and I. Ramos, Eds., Springer Verlag, NY, pp. 78{83. [9] Callan, James P., Fawcett, Tom E., and Rissland, Edwina L. CABOT: An Adaptive Approach to Case-Based Search. In Proceedings, 12th International Joint Conference on Arti cial Intelligence (Sydney, Australia, August 1991), vol. 2, IJCAI, pp. 803{808. [10] Cooper, William S. Expected Search Length: A Single Measure of Retrieval Effectiveness Based on the Weak Ordering Action of Retrieval Systems. American Documentation 19 (1968), 30{41. 194

[11] Croft, W. B. Experiments with Representation in a Document Retrieval System. Information Technology: Research and Development. 2, 1 (January 1983), 1{21. [12] Croft, W. Bruce, Cook, Robert, and Wilder, Dean. Providing Government Information on the Internet: Experiences with THOMAS. In Proceedings of the Digital Libraries Conference DL'95 (Austin, TX, June 1995), pp. 19{24. [13] Croft, W. Bruce, Turtle, Howard R., and Lewis, David D. The Use of Phrases and Structured Queries in Information Retrieval. In Proceedings of the 14th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Chicago, IL, October 1991), ACM, pp. 32{45. [14] Daniels, Jody J., and Rissland, Edwina L. A Case-Based Approach to Intelligent Information Retrieval. In Proceedings of the 18th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Seattle, WA, July 1995), ACM, pp. 238{245. [15] Evans, David, and Leerts, R. G. CLARIT-TREC Experiments. Information Processing and Management. 31, 3 (1995), 385{395. [16] Gauch, Susan, and Smith, John B. An Expert System for Automatic Query Reformulation. Journal of the American Society of Information Scientists. 44, 3 (April 1993), 124{136. [17] Golding, Andrew R. Pronouncing Names by a Combination of Rule-Based and Case-Based Reasoning. PhD thesis, Stanford University, Stanford, CA, October 1991. [18] Golding, Andrew R., and Rosenbloom, Paul S. Improving Rule-Based Systems Through Case-Based Reasoning. In Proceedings, Ninth International Conference on Arti cial Intelligence (Anaheim, CA, July 1991), vol. 1, AAAI, pp. 22{27. [19] Goodman, Marc. CBR in Battle Planning. In Proceedings, Case-Based Reasoning Workshop (Pensacola Beach, FL, May 1989), DARPA, pp. 264{269. [20] Hahn, Udo. Topic Parsing: Accounting for Text Macro Structures in Full-Text Analysis. Information Processing and Management 26, 1 (1990), 135{170. [21] Hammond, Kristian J. Case-Based Planning. Academic Press, Inc., 1989. [22] Harman, Donna K., Ed. The Second Text REtrieval Conference (TREC-2). National Institute of Standards and Technology, Gaithersburg, MD, 1994. Special Publication 500-215. [23] Harman, Donna K., Ed. The Third Text REtrieval Conference (TREC-3). National Institute of Standards and Technology, Gaithersburg, MD, 1995. Special Publication 500-225. 195

[24] Hearst, Marti A. Cases as Structured Indexes for Full-Length Documents. In Working Notes of the AAAI Spring Symposium Series: Case-Based Reasoning and Information Retrieval { Exploring the Opportunities for Technology Sharing (Stanford, CA, March 1993), AAAI, pp. 140{145. [25] Hearst, Marti A., and Plaunt, Christian. Subtopic Structuring for Full-Length Document Access. In Proceedings of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Pittsburgh, PA, June 1993), ACM, pp. 59{68. [26] Hennessy, Daniel, and Hinkle, David. Initial Results from Clavier: A CaseBased Autoclave Loading Assistant. In Proceedings of the Case-Based Reasoning Workshop (Washington D.C., May 1991), DARPA, pp. 225{232. [27] Human, Scott B. Learning information extraction patterns from examples. In Working Notes of the IJCAI Workshop on New Approaches to Learning for Natural Language Processing (Montreal, Canada, August 1995), AAAI, pp. 127{ 134. [28] Jing, Yufeng, and Croft, W. Bruce. An Association Thesaurus for Information Retrieval. In Intelligent Multimedia Information Retrieval Systems and Management, RIAO '94 (New York, NY, October 1994), pp. 146{160. [29] Kahle, Brewster. Personal communication, June 1997. [30] Kolodner, Janet L. Retrieval and Organizational Strategies in Conceptual Memory: A Computer Model. Erlbaum Associates, 1984. [31] Kolodner, Janet L. Judging which is the \Best" Case for a Case-Based Reasoner. In Proceedings, Case-Based Reasoning Workshop (Pensacola Beach, FL, May 1989), DARPA, pp. 77{81. [32] Koton, Phyllis. Using Experience in Learning and Problem Solving. PhD thesis, Massachusetts Institute of Technology, Boston, MA, May 1988. [33] Krulwich, Bruce, and Burkley, Chad. ContactFinder: Extracting indications of expertise and answering questions with referrals. In Working Notes of the AAAI Fall Symposium Series: AI Applications in Knowledge Navigation and Retrieval (Cambridge, MA, November 1995), AAAI, pp. 85{91. [34] Kwok, K. L. A New Method of Weighting Query Terms for Ad-Hoc Retrieval. In Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland, August 1996), ACM, pp. 187{195. [35] Larkey, Leah S., and Croft, W. Bruce. Automatic Assignment of ICD9 Codes to Discharge Summaries. Tech. rep., University of Massachusetts at Amherst, Amherst, MA, 1995. 196

[36] Lehnert, Wendy G. Case-Based Problem Solving with a Large Knowledge Base of Learned Cases. In Proceedings, Sixth National Conference on Arti cial Intelligence (Seattle, WA, July 1987), vol. 1, AAAI, pp. 301{306. [37] Lehnert, Wendy G. Case-Based Reasoning as a Paradigm for Heuristic Search. Tech. rep., University of Massachusetts at Amherst, Amherst, MA, October 1987. [38] Lewis, David D., and Gale, William A. A Sequential Algorithm for Training Text Classi ers. In Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland, July 1994), ACM, pp. 3{12. [39] Mauldin, Michael. Information Retreival by Text Skimming. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, August 1989. [40] Moat, Alistair, Sacks-Davis, Ron, Wilkinson, Ross, and Zobel, Justin. Retreival of Partial Documents. In Proceedings of the Second Text REtrieval Conference (TREC-2) (Pittsburgh, PA, 1994), D. Harman, Ed., National Institute of Standards and Technology, pp. 181{190. [41] Moulinier, Isabelle, and Ganascia, Jean-Gabriel. Confronting an existing Machine Learning Algorithm to the Text Categorizatino Task. In Working Notes of the IJCAI Workshop on New Approaches to Learning for Natural Language Processing (Montreal, Canada, August 1995), AAAI, pp. 176{181. [42] MUC-4. Proceedings of the Fourth Message Understanding Conference. Morgan Kaufmann, San Mateo, CA, 1992. [43] MUC-5. Proceedings of the Fifth Message Understanding Conference. Morgan Kaufmann, San Mateo, CA, 1993. [44] O'Connor, John. Answer-Providing Documents: Some Inference Descriptions and Text-Searching Retreival Results. Journal of the American Society for Information Science 21, 6 (1970), 406{414. [45] O'Connor, John. Text-Searching Retreival of Answer-Sentences and Other Answer-Passages. Journal of the American Society for Information Science 24, 4 (1973), 445{460. [46] O'Connor, John. Retreival of Answer-Sentences and Answer-Figures from Papers by Text Searching. Information Processing and Management 11 (1975), 155{164. [47] O'Connor, John. Answer Passage Retreival by Text Searching. Journal of the American Society for Information Science 31, 4 (1980), 73{78. [48] Program, TIPSTER Text. Proceedings of the TIPSTER Text Program (Phase I). Morgan Kaufmann, San Franscico, CA, September 1993. 197

[49] Rau, Lisa F., and Jacobs, Paul S. Creating Segmented Databases from Free Text for Text Retrieval. In Proceedings of the 14th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Chicago, IL, October 1991), ACM, pp. 337{346. [50] Rijsbergen, C. J. Van. Information Retrieval. Butterworths, 1979. [51] Rilo, Ellen. Automatically Constructing a Dictionary for Information Extraction Tasks. In Proceedings, The 11th National Conference on Arti cial Intelligence (Washington D.C., July 1993), AAAI, AAAI Press/The MIT Press, pp. 811{816. [52] Rilo, Ellen. Little Words can make a Big Dierence for Text Classi cation. In Proceedings of the 18th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Seattle, WA, July 1995), ACM, pp. 130{136. [53] Rilo, Ellen, and Shoen, Jay. Automatically Acquiring Conceptual Patterns Without an Annotated Corpus. In Proceedings of the Third Workshop on Very Large Corpora (Boston, MA, July 1995), pp. 148{161. [54] Rissland, Edwina L., and Ashley, Kevin D. A Case-Based System for Trade Secrets Law. In Proceedings, 1st International Conference on Arti cial Intelligence an Law (May 1987), ACM, ACM Press. [55] Rissland, Edwina L., and Daniels, Jody J. The Synergistic Application of CBR to IR. Arti cial Intelligence Review 10 (1996), 441{475. [56] Rissland, Edwina L., Skalak, D. B., and Friedman, M. Timur. Heuristic Harvesting of Information for Case-Based Argument. In Proceedings, The 12th National Conference on Arti cial Intelligence (Seattle, WA, August 1994), AAAI, pp. 36{ 43. [57] Rissland, Edwina L., Skalak, D. B., and Friedman, M. Timur. BankXX: Supporting Legal Arguments through Heuristic Retrieval. Arti cial Intelligence Review 10, 1-71 (1996). [58] Rissland, Edwina L., and Skalak, David B. CABARET: Rule Interpretation in a Hybrid Architecture. International Journal of Man-Machine Studies 34 (1991), 839{887. [59] Ro, Jung Soon. An Evaluation of the Applicability of Ranking Algorithms to Improve the Eectiveness of Full-Text Retrieval. Journal of the American Society for Information Science 39, 2 (1988), 73{78. [60] Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. Okapi at TREC-2. In Proceedings of the Second Text REtrieval Conference (TREC-2) (Pittsburgh, PA, 1994), D. Harman, Ed., National Institute of Standards and Technology, pp. 21{33. 198

[61] Rus, Daniela, and Subramanian, Devika. Customizing Information Capture and Access. ACM Transactions on Information Systems (1995). Submitted. [62] Rus, Daniela, and Summers, Kristen. Using White Space for Automated Document Structuring. In Digital Libraries: Current Issues. Lecture Notes in Computer Science 916., N. Adam, B. Bhargava, and Y. Yesha, Eds. Springer-Verlag, 1995, pp. 129{162. [63] Salton, Gerard. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989. [64] Salton, Gerard, Allan, James, and Buckley, Chris. Approaches to Passage Retrieval in Full Text Information Systems. In Proceedings of the 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Pittsburgh, PA, June 1993), ACM, pp. 49{58. [65] Salton, Gerard, and Buckley, Chris. Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Science 41, 4 (1990), 288{297. [66] Salton, Gerard, and Buckley, Chris. Automatic Text Structuring and Retrieval { Experiments in Automatic Encyclopedia Searching. In Proceedings of the 14th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Chicago, IL, October 1991), ACM, pp. 21{30. [67] Salton, Gerard, and Buckley, Chris. Global Text Matching for Information Retreival. Science 253 (1991), 1012{1015. [68] Schutze, Hinrich, and Pedersen, Jan O. A Cooccurrence-Based Thesaurus and Applications to Information Retrieval. In Intelligent Multimedia Information Retrieval Systems and Management, RIAO '94 (New York, NY, October 1994), pp. 266{274. [69] Stan ll, Craig. Personal communication, May 1997. [70] Stan ll, Craig, and Waltz, David. Toward Memory-Based Reasoning. Communications of the ACM 29, 12 (December 1986), 1213{1228. [71] Stan ll, Craig, and Waltz, David L. Text-Based Intelligent Systems. In Statistical Methods, Arti cial Intelligence, and Information Retrieval., Paul S. Jacobs, Ed. Addison-Wesley, 1989. [72] Steier, David, Human, Scott B., and Hamscher, Walter C. Meta-Information for Knowledge Navigation and Retrieval: What's in There. In Working Notes of the AAAI Fall Symposium Series: AI Applications in Knowledge Navigation and Retrieval (Cambridge, MA, November 1995), AAAI, pp. 123{126.

199

[73] Turtle, H. R., and Croft, W. B. Evaluation of an Inference Network-Based Retrieval Model. ACM Transactions on Information Systems 9, 3 (July 1991), 187{222. [74] Turtle, H. R., and Croft, W. B. A Comparison of Text Retrieval Models. Computer Journal 35, 3 (1992), 279{290. [75] Turtle, H. R., and Croft, W. B. Query Evaluation: Strategies and Optimizations. Information Processing and Management 31, 6 (July 1995), 831{850. [76] Veloso, Manuela M. Learning by Analogical Reasoning in General Problem Solving. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, August 1992. [77] Xu, Jinxi, and Croft, W. Bruce. Query Expansion Using Local and Global Document Assessment. In Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland, August 1996), pp. 4{11. [78] Yang, Yiming. Expert Network: Eective and Ecient Learning from Human Decisions in Text Categorization and Retrieval. In Proceedings of the 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland, July 1994), ACM, pp. 13{22. [79] Zobel, Justin, and Dart, Philip. Phonetic String Matching: Lessons from Information Retrieval. In Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland, August 1996), ACM, pp. 166{172.

200

retrieval of passages for information reduction - CiteSeerX

retrieval of passages for information reduction - CiteSeerX

Suggest Documents

retrieval of passages for information reduction | CiteSeerX

Enhancing Retrieval Effectiveness of Diacritisized Arabic Passages

Information Retrieval Research - CiteSeerX

Visual Information Retrieval - CiteSeerX

Multimedia Information Retrieval - CiteSeerX

Augmenting Data Retrieval with Information Retrieval ... - CiteSeerX

Visual Analytics for Information Retrieval Evaluation - CiteSeerX

A Model for Adaptive Information Retrieval - CiteSeerX

Querying sparse matrices for Information Retrieval - CiteSeerX

BioPatentMiner: An Information Retrieval System for ... - CiteSeerX

Latent Topic Feedback for Information Retrieval - CiteSeerX

Parsimonious Language Models for Information Retrieval - CiteSeerX

knOWLer - Ontological Support for Information Retrieval ... - CiteSeerX

Geographically-Aware Information Retrieval for ... - CiteSeerX

User Modeling for Web Information Retrieval - CiteSeerX

Adapting Boosting for Information Retrieval Measures - CiteSeerX

Exploiting disambiguated thesauri for information retrieval ... - CiteSeerX

Term Context Models for Information Retrieval - CiteSeerX

Information Retrieval Support for Ontology Construction ... - CiteSeerX

Using language models for information retrieval - CiteSeerX

Developing Engineering Ontology for Information Retrieval - CiteSeerX

Personalized Information Retrieval Services for Digital ... - CiteSeerX

Information Retrieval Support for Ontology Construction ... - CiteSeerX

Meaningful results for Information Retrieval - CiteSeerX