Experimental Validation of a Semi-Automatic Text Analyzer KEN BARKER
SYLVAIN DELISLE
Department of Computer Science University of Ottawa Ottawa, Ontario, Canada K1N 6N5
[email protected]
Département de mathématiques et d’informatique Université du Québec à Trois-Rivières Trois-Rivières, Québec, Canada G9A 5H7
[email protected]
Abstract The aim of the TANKA project is to acquire knowledge from actual unedited technical text using a minimum of pre-coded semantic information. To make up for this lack of “seed knowledge”, the HAIKU semantic analyzer initially draws on detailed syntactic information provided by the DIPETT parser and on the help of a cooperative user. As more sentences from a text are analyzed, HAIKU builds pattern dictionaries which it uses to make increasingly informed suggestions for semantic analysis. The process is described in detail in Delisle et al. (1996). This document reports on an experiment to process a complete English technical text using the major components of the DIPETT/HAIKU system. The results of the test are also used to forecast the toll on the user over the period required to process a real, complete technical text.
1
Introduction and Overview
1.1
Goals
Earlier work with DIPETT and HAIKU has shown that experiments on small numbers of sentences are insufficient for drawing strong conclusions about the system’s performance and about the level of user interaction over time. The syntactically complex language of the texts used in previous tests of our system (including tax guides and computer manuals) also made evaluation of the whole system difficult. If the parser cannot completely parse a sentence, semantic analysis is often incomplete or even abandoned. These observations suggested that an experiment on a larger, less syntactically complex text was needed to realize certain goals: 1. to complement earlier testing of the DIPETT parser with a systematic evaluation of its syntactic coverage on an unprepared text. 2. to validate the coverage of sentences.
HAIKU’s
set of Cases on a large number of English
Experimental Validation of a Semi-Automatic Text Analyzer
2
3. to validate the Case Analysis (CA) algorithms and to determine if the accumulation and analysis of patterns from previous inputs allow the CA module in HAIKU to improve its performance significantly over time. 4. to evaluate the performance of the Clause-Level Relationship Analyzer (CLRA) module in HAIKU. 5. to investigate the effects of sentence length and complexity on system performance and user involvement. There are other issues that will not be addressed in this document, but could be explored using the data gathered and structures generated during this experiment: 1. to investigate the resulting semantic structures for the nature and level of detail of domain knowledge they represent. 2. to examine the resulting HAIKU dictionaries to determine the nature of domain or lexical-semantic knowledge they represent. 3. to determine if the relationships between verb, Case Marker Pattern and Case Pattern are specific to a domain; to determine if these relationships are indicative of differences in verb meaning across domains. 4. to check the value of Case Analysis on common verbs of general meaning (do, have, make, take, use, etc.). 1.2
Background
This report assumes that the reader is familiar with various parts of the DIPETT/HAIKU system and with the TANKA project in general. For further information see Barker et al. (1993), Delisle et al. (1993), Barker (1994), Barker & Szpakowicz (1995), Delisle (1994), Barker et al. (1996) and Delisle et al. (1996). 1.3
Preliminaries
1.3.1 The Text The DIPETT parser can produce a parse tree without user interaction. All user interaction time is spent on subsequent steps: pronoun explicitization, Clause-Level Relationship Analysis and Case Analysis. If the parser is unable to find a reasonably correct parse tree1, subsequent processing may be affected, or even abandoned.
1
We use the terms reasonably correct parse tree and good parse to refer to parses that may be imperfect but do not adversely affect subsequent semantic analysis. Section 2.2 gives examples of such imperfect, yet acceptable parses.
Experimental Validation of a Semi-Automatic Text Analyzer
3
In order to evaluate user interaction over time, we needed a text from which DIPETT could produce a good parse for a decided majority of the sentences. However, to remain true to the goals of the TANKA project, we needed a real, unedited text describing a technical domain and aimed at a human reader. The Junior Science Book of Rain, Hail, Sleet & Snow (Larrick, 1961) satisfied our criteria and provided just over five hundred sentences for the experiment. We refer to this book as the test text in the remainder of this document. 1.3.2 The Dictionary has a default dictionary containing part of speech and morphological information for some common open category words, and is essentially complete for English closed category words. Words encountered in a text that are not in the default dictionary are added as needed from a Collins online wordlist. For the current experiment, we emptied DIPETT’s default dictionary and filled it with the vocabulary of the test text. This was done using DIPETT’s fully automatic part of speech tagger module. The tagger looks up unknown words in Collins and inserts a dictionary entry for each possible part of speech of each word in the text. The resulting dictionary contains extraneous entries, since not every part of speech of every word appears in the text. A concordance program was then used to check each word against all of its occurrences in the text. Extraneous dictionary entries for parts of speech not used in the text were deleted.. The goal of this exercise was to construct a minimal dictionary for the test text.
DIPETT
We should note that this dictionary seeding is not equivalent to pre-tagging the text for part of speech information. Many words appear more than once as different parts of speech. Dictionary entries for all of these multiple uses of a word are kept. The dictionary preparation merely ensures that the parser can proceed autonomously by not encountering any unknown words. Dictionary preparation for the entire text required roughly a day’s work. This effort produced some unexpected results of its own. Namely, several of the entries in the Collins wordlist are questionable. For example, the word wide is listed in Collins as a noun. 1.3.3 The Evaluation Grid In order to keep track of all of the sentence analysis data during the experiment, we built a “grid” in Microsoft Excel. The grid contains information related to the processing of all of the sentences in the text (see Appendix A). It is divided into four main sections: Parsing, Pronoun Explicitization, Clause-Level Relationship Analysis and Case Analysis.
Experimental Validation of a Semi-Automatic Text Analyzer
4
Parsing Complete? # of % of string (Y/N) fragments parsed
c
d
Error severity (0-3)
Errors
e
f
g
Figure 1:The Parsing section of the grid
c d e f g
Did DIPETT produce a complete parse? If the parse was not complete, how many parse tree fragments were produced as reported by the fragmentary parser ? If the parse was not complete, what percentage of the tokens in the sentence were used in parse fragments (again, as reported by the fragmentary parser)? The Errors field was used for detailed descriptions of parse errors based on human inspection of the parse tree(s). The Error Severity field was used to characterize misparses according to the following scale: 0: perfect parse. 1: one or more small parse errors or questionable decisions by the parser; misparses had little or no effect on semantic analysis. 2: one or more serious errors in the parse tree; semantic analysis was affected by the misparses but was still possible at least in part. 3: one or more fatal errors; the misparses prevented any semantic analysis.
Pronoun Explicitization 2
Pronoun
c
Antecedent Total # of User action (c/p/p2-p5/o) suggestions (a/c/s)
d
e
f
Figure 2:The Pronoun section of the grid
c 2
The Pronoun field recorded each given pronoun in a sentence detected by DIPETT.
The Pronoun Explicitization section of the grid does not have a separate field to record user onus. The user action in this case corresponds directly to the toll on the user.
Experimental Validation of a Semi-Automatic Text Analyzer
5
dThe Antecedent field recorded the location of each pronoun’s antecedent according to the following codes: c: antecedent in current sentence. p: antecedent in previous sentence. p2: antecedent two sentences back. p3: antecedent three sentences back. p4: antecedent four sentences back. p5: antecedent five sentences back. o: no known antecedent in the text3.
e How many noun phrases did the system suggest as possible antecedents for a given pronoun?
f What action was required of the user?
a: accept the single suggestion made by the system c: choose from two or more suggestions made by the system s: supply an antecedent for the pronoun (none of the system’s suggestions were correct)
Clause-Level Relationship Analysis Marker
c
# of User action suggestions (a/c/s/b)
d
e
Reorder (Y/N)
f
Onus (0-3)
g
Figure 3:The CLRA section of the grid
c The Marker field recorded each conjunction between clauses. d How many Clause-Level Relationship labels did the CLR Analyzer suggest as probable CLRs for the given pair of clauses?
e
3
What action was required of the user? a: accept the single suggestion made by the system. c: choose from two or more suggestions made by the system. s: supply a CLR label for the input (none of the system’s suggestions were acceptable). b: abort CLR Analysis.
The case of no known antecedent included the empty it pronoun (as in “It rains”), the generic you and we (as in “You can see the clouds” and “We say the water evaporates”) as well as other pro-forms (e.g., the “pro-clause” this referring to an entire clause).
Experimental Validation of a Semi-Automatic Text Analyzer
f g
6
Did the two clauses have to be reversed by the user? How heavy a burden on the user was the CLRA interaction (according to the following scale)? 0: one of the suggested CLR labels was the obvious correct one; or none of the suggestions were correct but both experimenters immediately agreed on a single CLR. 1: one of the suggested CLR labels was correct, but not obvious; or none of the suggestions were correct and determining the correct CLR required some reflection. 2: deciding on the correct CLR required serious thought by both experimenters, but after discussing several CLRs and referring to the CLR definitions in Barker (1994), one of the existing nine CLR labels was chosen within two or three minutes. 3: none of the existing nine CLR labels were satisfactory.
Case Analysis CMP # of Verb correct? Situation # suggestions
c
d
e
f
User action (a/c/s/d)
g
CMP-CP Repair (Y/N)
h
Onus (0-3)
i
Figure 4: The Case Analysis section of the grid
c d e
f g
The Verb field recorded the main verb in each clause in the sentence as identified by DIPETT. Was the Case Marker Pattern found by the system correct? The Situation field recorded the relationship (as identified by the Case Analyzer) between the current verb/CMP and previously encountered verbs/CMPs (see Delisle et al., 1996): 1: known CMP already associated with known verb; Case Pattern (CP) is known or new. 2: known CMP not yet associated with known verb; CP is known or new. 3: entirely new CMP with known verb; CP is known or new. 4: new verb with known CMP; new CP (relative to this verb). 5: new verb with (entirely) new CMP; new CP. How many Case Patterns did the system suggest for the given CMP? What action was required of the user? a: accept the single suggestion made by the system. c: choose from two or more suggestions made by the system.
Experimental Validation of a Semi-Automatic Text Analyzer
7
s: supply a CP for the input (none of the system suggestions were acceptable). d: discard the current clause.
h i
Æ
Did the Case Case Marker pairing have to be changed by the user? How heavy a burden on the user was the CA interaction? 0: one of the suggested CPs was chosen immediately by both experimenters; or none of the suggestions were correct but the correct CP was immediately apparent to both experimenters independently. 1: one of the suggested CPs was correct, but not obvious; or none of the suggestions were correct and determining the correct CP required some reflection. 2: deciding on the correct CP required serious thought, but after discussing several possible Cases and consulting the Case definitions from Barker et al. (1996), all markers were assigned Cases from the existing set of twenty-eight within two or three minutes. 3: for at least one of the markers, none of the existing twenty-eight Cases were satisfactory.
1.3.4 The Experiment The experiment itself took place over four days at l’Université du Québec à Trois-Rivières with one author driving DIPETT/HAIKU while the other author timed user interaction and entered data into the grid. Both authors were involved in all decisions during interaction. In a normal knowledge acquisition session using DIPETT and HAIKU, there would be no need to time interaction or to record session statistics. In that case a single user would have no trouble running the system alone. For this experiment, having two users also helped reduce effects of personal biases during evaluation. DIPETT/HAIKU was run in Quintus Prolog 3.2 on a Sparc-20 with a 120 second CPU time limit for parsing. For each sentence in the text, the stopwatch was started immediately after parsing finished. Both authors would then examine the parse tree, recording details of the parse completeness and correctness in the Excel grid (see section 1.3.3). Interaction then proceeded through Pronoun Explicitization, CLR Analysis and Case Analysis. The stopwatch was stopped immediately following the last user interaction in Case Analysis (the last step in sentence analysis). Table 1 gives details of CPU time (for parsing and semantic analysis), user time and real time. Since the experiment was conducted by two expert users working together, the numbers will be lower than what might be expected for a single average user.
total average
parsing 0:49:18 0:00:05.8
CPU time semantic analysis 0:13:47 0:00:02.7
total average
Table 1: Experiment Times
user time 15:32:51 0:01:49
dictionary experiment
real time 7h 27h
Experimental Validation of a Semi-Automatic Text Analyzer
8
The table is divided into three sections: CPU time, user time and real time. The CPU time gives the total processing times for parsing and semantic analysis as well as the average parsing time per sentence parsed (512 sentences) and the average semantic analysis time for each sentence for which some semantic analysis was performed (309 sentences). CPU times are not included in the figures for user time. The user time column gives the total of the user interaction “stopwatch times” (as described above) for all sentences, along with the average user interaction time per sentence. Finally, the real time column gives the number of hours spent on the experiment. Constructing the minimal dictionary (as described in section 1.3.2) took roughly one day, while the analysis of the text itself took almost four days.
2
Parsing
This section gives a summary of the data from the experiment grid including totals, percentages and averages of the parsing-related fields. We also attempt to interpret the data and find trends. Sections 3 through 5 follow the same general format. 2.1
Data
Table 2 gives a summary of parsing data. This data is, for the most part, self-explanatory. Fields that may require more explanation are described below.
total average
tokens 5347 10.42
parses error severity total complete (0-3) 512 412 (80%) 1.02
incomplete parses fragments % parsed 2.95
95.0
Table 2: Summary of parsing data
In DIPETT/HAIKU, the term token refers to words or punctuation marks. Assuming an average of slightly more than one punctuation mark per sentence, we can say that in the test text there were on average about nine words per sentence. Nine words is lower than the average sentence length in texts we have used in previous testing, reflecting the relative simplicity of the test text. There were 513 sentences in the test text, one of which was not parsed. We skipped sentence 475 (“What happens?”) which has no significant semantic content. Roughly 80% of the remaining 512 parses were complete (all tokens accounted for in a single parse tree), though not necessarily perfect. The average error severity is calculated from the 512 individual error severity assessments according to the scale in section 1.3.3.
Experimental Validation of a Semi-Automatic Text Analyzer
9
For incomplete (fragmentary) parses, the % parsed is calculated automatically by DIPETT. 2.2
Interpretation
As mentioned in the previous section, a complete parse does not imply a perfect parse. During the experiment, parse trees were inspected by eye to determine if they were correct, and if not, how severe the errors were. Only perfect parses were assigned a severity of 0. Parse trees with questionable elements having little or no effect on semantic analysis were assigned a severity of 1. Often, these parses are grammatically defensible (according to the DIPETT grammar based on Quirk et al., 1985). The following sentences are examples of parses with an error severity of 1. 23
As they looked down, they saw a sea of white.
76
Warm moist air from the teapot meets cooler air in your kitchen.
162
When a great deal of rain falls in a short time, we sometimes call it a cloudburst.
In sentence 23, “down” was parsed as an adverbial particle instead of as an adverbial. In sentence 76, the prepositional phrase “in your kitchen” was attached to the verb “meet” instead of the more appropriate attachment as a post-modifier to “air”. The parse tree for sentence 162 had “it” and “a cloudburst” as indirect and direct object (respectively) of “call”, instead of parsing “it” as direct object and “a cloudburst” as an appositive. 0 number of sentences 241 (47.07%) average number of tokens 9.29
error severity 1 2 98 96 (19.14%) (18.75%) 11.29 11.40
3 77 (15.04%) 11.75
Table 3: Distribution of the degrees of error severity
Since these misparses had little effect on HAIKU semantic analysis, parses with an error severity of 1 were considered “good”, if not perfect. From Table 3, then, we can see that the total number of perfect or good parses was 339 out of 512, or 66%. For another 19% of sentences (those with parse error severity 2), semantic analysis was possible, but required considerably more user assistance. Only error severity 3 indicates sentences whose parse errors prevented semantic analysis, meaning that 85% of the sentences were available for semantic analysis. Table 3 also shows the average number of tokens in all sentences with parses of a given error severity.
Experimental Validation of a Semi-Automatic Text Analyzer
3
10
Pronoun Explicitization
DIPETT/HAIKU’s
Pronoun Explicitization module allows pronouns to be associated with noun phrases from DIPETT’s parse trees by automatically selecting potential referents and presenting them to the user. See Delisle (1994) for details on this module. 3.1
Data
A summary of the data captured for Table 4.
DIPETT’s
pronoun explicitization module appears in
antecedent (sentences from current) 0 -1 -2 -3 -4 -5 35 95 35 7 3 2
pronouns none 266 89
user action accept choose supply 110 82 74
Table 4: Summary of pronoun explicitization data
The pronouns field gives the total number of pronoun occurrences picked up by DIPETT in the test text (not just distinct pronouns). The seven antecedent fields show the total number of times pronouns’ antecedents were found in the current sentence, the previous sentence, and so on (up to five sentences prior to the current sentence). Included with the antecedent fields is the number of pronouns that had no antecedent in the test text (see footnote 3). 3.2
Interpretation
For each pronoun in a given sentence, the Pronoun Explicitization module offers the user all of the minimal noun phrases4 in the current parse tree as potential antecedents. If the antecedent is not among these noun phrases, the user has the option of supplying his own noun phrase as antecedent, choosing from certain “null” antecedents (such as the empty pronoun or a generic “you” or “we”, for example), or asking to see noun phrases from the parse tree of the previous sentence. In developing the Pronoun Explicitization module, we expected most antecedents to appear in the current or previous sentence. Table 4 confirms that expectation. What was unexpected was the number of antecedents occurring two sentences before the sentence containing the pronoun. This may be due to the relatively simple language of the test text: if the text had been more complex, consecutive sentences might have been conjoined into longer multiple-clause sentences. In that case, we might speculate that there would be fewer antecedents in preceding parse trees. Nonetheless, the data from this experiment
4
The term minimal noun phrase refers to a noun phrase as produced by DIPETT that contains no other embedded noun phrases (within post-modifying prepositional phrases, for example).
Experimental Validation of a Semi-Automatic Text Analyzer
11
suggests that we should change the system to be able to recall several previous parse trees for possible pronoun antecedents. A second interesting result is the number of pronouns labeled as having no antecedent. The majority of these, however, were generic pronouns like the “you” in sentence 315. 315
If you could cut a hailstone in half, you would see many layers of ice.
An observation not captured by the data in the grid is that several pronouns had nonminimal noun phrase antecedents. That is, there were cases in which a pronoun referred to a complex noun phrase (containing another noun phrase as a post-modifier, for example). Since these did not appear among the system’s suggestions, we had to supply them ourselves. This observation suggests a potential extension to the Pronoun Explicitization module. Finally, it is interesting to note that out of 266 pronouns, with no semantic information the system was able to find the single correct antecedent 41% of the time and offered the correct antecedent among multiple suggestions 31% of the time; the user had to type in the antecedent in only 28% of the cases. If the system were to suggest non-minimal noun phrases and noun phrases from more than previous parse, these numbers would no doubt have been even better.
4
Clause-Level Relationship Analysis
The Clause-Level Relationship Analysis module is clauses within the same sentence are explicitly subordinator. The CLR Analyzer allows the user relationships between these clauses. Barker (1994) provide details of this module. 4.1
activated whenever two or more connected by a conjunction or to assign semantic labels to the and Barker & Szpakowicz (1995)
Data markers 64
accept 35
user action choose supply 2 14
abort 13
reorder 7
user onus 0 1 2 3 44 6 1 0
Table 5: Summary of CLR analysis data
The markers field records the total number of CLR connectives found joining finite clauses. Sentence 339 contains an instance of the marker “before”. 339
Both hail and sleet are frozen before they strike the earth.
The four user action fields and four user onus fields are explained in section 1.3.3; their totals will be interpreted in section 4.2. The reorder field records the number of times the user had to reverse the order of CLR arguments. Normally, the system can automatically
Experimental Validation of a Semi-Automatic Text Analyzer
12
determine the correct order of CLR arguments based on the connective and the order of the clauses in the parse tree. 4.2
Interpretation
4.2.1 CLRA Invocation Although CLR Analysis was invoked 64 times for this experiment, processing was aborted 13 times, leaving 51 actual CLR interactions. The 13 aborted CLRs were all the result of inappropriate activation of the CLR Analyzer due to misparses. Sentence 137 was badly5 parsed: “rain” was parsed as a verb with “you see the clouds that produce” parsed as a subordinate clause connected by the subordinator “usually”. This unusual labeling of “usually” as a subordinator is due to DIPETT’s treatment of sentenceinitial subordinates as adverbial clauses. CLR Analysis was attempted with “usually” as an unknown connective. Obviously, there are no CLRs in this sentence so CLR Analysis was aborted by the user. 137
Usually you see the clouds that produce rain.
Sentences 265 and 324 were correctly parsed, but the conjunctions “and” and “but” at the beginning of the sentences confused the CLR Analyzer. For sentence 265, CLRA identified the clausal connective as “and-if”; for sentence 324, it came up with “but-when” as a clausal connective. 265
And if everything is just right, you may see a rainbow.
324
But it's no fun for the farmer when hail hits his crops.
This behaviour is due to heuristics that are used to identify complex connectives such as “even-if” and “if-then” that occur commonly in the following patterns: Even if X, Y. If X, then Y. The heuristics combined the “And if” and “But when” as complex connectives resulting in the following incorrect interpretations: *you may see a rainbow and-if everything is just right. * it’s no fun for the farmer but-when hail hits his crops.
5
The parse for this sentence was assigned an error severity of 2, since Case Analysis was able to proceed for the verb “see” and the verb “produce”.
Experimental Validation of a Semi-Automatic Text Analyzer
13
In sentences 265 and 324, the leading conjunctions are inter-sentential connectives that should not be used for CLR Analysis. In both cases, if these conjunctions had been ignored, CLRA would have proceeded successfully with the correct connectives “if” and “when”. Obviously the complex connective heuristics need to be modified to account for sentence-initial occurrences of certain conjunctions. 4.2.2 Performance Of the 51 appropriate CLR interactions, the system found the single correct CLR 69% of the time, suggested two CLRs (one of which was the correct one) 4% of the time, and suggested the wrong CLR in 27% of the cases. These numbers are in contrast with the experiment described in Barker (1994) whose results from 100 CLR interactions showed 94% correct single suggestions, 4% multiple suggestions and only 2% incorrect suggestions. The poorer performance of the CLR Analyzer in the current experiment may be due to the relative infrequency of complex verb features in the test text. The CLR heuristics attempt to find the most appropriate CLR for a given input based on verb phrase features of the two clauses. Features such as modal auxiliaries, verb phrase polarity and tense allow CLRA to choose among possible CLRs. Unfortunately, the test text uses simple tenses and few or no modals. To compound the problem, where modality was expressed, it was often expressed using non-auxiliary modal forms, such as the adjective “apt” in sentence 112 and the adverb “usually” in sentence 242. 112
As clouds change, the weather is apt to change.
242
When lightning strikes, it usually hits the high pointed objects.
The lack of overt CLR evidence in the test text along with the CLRA’s bias for stronger CLRs over weaker ones would account for the number of times certain CLRs were suggested inappropriately (and rejected by the user), as in sentence 428. 428
The glass is dry when you fill it with ice.
The system suggested the stronger Causation CLR over the weaker (though correct for this sentence) Temporal Co-occurrence CLR. The current experiment suggests that the CLR Analyzer’s bias toward stronger CLRs in the absence of syntactic verb feature clues might be removed. The next curious result is in the reorder field. All seven interactions where we had to reorder the CLR arguments were for user-supplied CLRs. If the user rejects the system’s suggestions, CLRA does not attempt to determine automatically the correct order of the CLR arguments. Sentence 54 contains a CLR for which the system inappropriately suggested Causation. 54
When you drive into a heavy fog on the road, you are driving into a cloud.
Experimental Validation of a Semi-Automatic Text Analyzer
14
The user rejected the suggestion in favour of the more appropriate Entailment. For this sentence, Entailment requires the ordering subordinate clause
main clause
which is the reverse of the default ordering. There are two important things to note from this example. First, there is no reason why the CLR Analyzer cannot attempt to determine automatically the argument order for usersupplied CLRs. The ordering heuristic depends only on the connective and the CLR. Once the user has entered the desired CLR, the system has enough information to make an intelligent guess at the correct argument order. The second observation is that of the seven CLR interactions requiring reordering, five involved sentences for which we had to supply the Entailment CLR for the connective “when”. Checking the CLR Marker Dictionary for “when” reveals that Entailment is not one of the CLRs listed for that marker. Obviously, Entailment needs to be added to the dictionary for “when”. 4.2.3 User Onus The CLR interactions were mostly very simple. Even when the system made inappropriate suggestions, the correct interpretation was usually obvious to the experimenters. All six of the user interactions rated with an onus of 1 involved one of the problems discussed in sections 4.2.1 and 4.2.2. Generally, if the CLR session required the user to supply a CLR and reverse the arguments, it was marked as onus 1. The single interaction rated onus 2 was for sentence 506. 506
When water vapor changes to water, we say it condenses.
The problem with this sentence has to do with the verb “say”. If the main clause had been simply “it condenses”, this would be an Entailment relationship (in the absence of an Equivalence CLR). But for the sentence as given, water vapor changing to water does not really entail our saying it condenses. Nor does it really cause us to say it condenses. Yet water vapor changing to water and our saying it condenses are not acts that merely temporally co-occur. Eventually, we agreed to assign the Entailment CLR based on the interpretation: If it is true that some water vapor wv changes to water, then it is also true that we say wv condenses. 4.2.4 CLR System Coverage Of the nine CLRs in the existing set, eight were needed to cover all of the relationships in the text. The one CLR that never appeared was Disjunction. Given that this CLR has been
Experimental Validation of a Semi-Automatic Text Analyzer
15
needed for experiments on other texts, and that there were only 51 CLR interactions in the current experiment, it would not be prudent to remove Disjunction from the set of CLRs.
5
Case Analysis
5.1
Data
For each finite clause in a sentence, the Case Analysis module suggests semantic labels (Cases) to assign to the relationships between the main verb and its syntactic arguments. See Delisle (1994) and Delisle et al. (1996) for in-depth treatments of the various facets of Case Analysis. verbs 468
correct user action average # of CMP accept choose supply discard suggestions 0 311 55 217 167 29 4.47 384 (69%)
user onus 1 2 3 50 5 0
mean 0.14
Table 6: Summary of case analysis data
The verbs field records the total number of non-stative main verb occurrences in finite clauses in the test text for which Case Analysis was attempted. The average # of suggestions field refers to suggested Case Patterns per CMP. All of the other fields in the summary are explained in section 1.3.3. The only field missing from the summary is the CMP-CP Repair field. The total for this field was 0.There are two situations when the CMP-CP pairing might need to be fixed. The first is if the CA module presents a known CP with Cases in the incorrect order. But if the system presented a Case Pattern with Cases in an order not matching the CMP, we (the experimenters) considered the CP incorrect and did not choose it. The second situation is if the user inputs the Cases in the incorrect order. In this situation, the CMPCP repair facility merely allows the user to correct mistakes. HAIKU consistently presents CMPs in an order corresponding to the subject-verb-object order of English. The regularity of ordering of CMPs made entering the CP in the correct order natural. In the current experiment we never entered a CP in the incorrect order and therefore never needed the repair facility. 5.2
Interpretation
5.2.1 Case Marker Patterns For the 512 sentences parsed, we engaged in Case Analysis interactions for 468 clauses. For each of these clauses, the CA module presented a Case Marker Pattern based on the syntactic verb arguments found in the parse tree. This CMP was correct 311 times (69% of the non-discarded CMPs) and incorrect 137 times. That leaves 20 of the 468 clauses unaccounted for. Of these 20, 19 (4% of the clauses) involved improperly detected clauses that we discarded for CA. The last clause (sentence 236) presents an interesting problem.
Experimental Validation of a Semi-Automatic Text Analyzer
236
16
It may strike within a cloud or between two clouds.
Case Analyzer design stipulated that all prepositions in syndetically and asyndetically conjoined prepositional phrases would be considered individual Case Markers. This assumption is valid if the conjunct (implicit or explicit) is “and”. If the conjunct is “or”, that interpretation may not work. Since there is no facility to store disjunctions of Cases, we accepted the CMP: psubj-within-between based on the following (slightly inaccurate) interpretation: Lightning6 may strike within a cloud and between two clouds Since Case Marker Patterns are derived from the parse tree, it would be interesting to look at how the incorrectly identified CMPs distribute relative to the parse error severity. Table 7 compares error severity to CMP correctness. As expected, sentences with poor parses had a high incidence of incorrect CMPs. Although perfect parses account for 47% of the sentences, they account for only 14% of the sentences with incorrect CMPs. Conversely, parse trees of error severity 2 accounted for only 19% of all sentences but 49% of the sentences with incorrect CMPs. Row three of the table shows the percentage of incorrect CMPs for each group of sentences of a given error severity. For example, 8% of perfect parses had incorrect CMPs, while 70% of parses with an error severity of 2 had incorrect CMPs.
sentences incorrect CMPs ratio
0 241 (47%) 19 (14%) 8%
error severity 1 2 98 (19%) 96 (19%) 48 (35%) 67 (49%) 49% 70%
3 77 (15%) 3 (2%) 4%
Table 7: Parse error severity vs. CMP correctness
It should be noted that even if the user supplies the correct CMP, the system does not attempt to associate the supplied Markers with parse tree constituents. So in the Case structures output by HAIKU, Cases will not be associated with the concepts they label. However, we are currently considering a mechanism that would allow the user to associate Cases from a sentence with noun phrases (for example) manually. 5.2.2 Case Pattern Suggestions For each CMP, the system suggests zero or more Case Patterns, depending on how many times the current verb has been encountered with the current CMP, how many times the current verb has been encountered with the same number of Case Markers, how many times a different verb has occurred with the same CMP, etc. (see Delisle et al., 1996). So as more verbs and CMPs are encountered, the number of suggested CPs rises. Over all
6
The pronoun “it” was replaced with its referent “lightning” by the Pronoun Explicitization module.
Experimental Validation of a Semi-Automatic Text Analyzer
17
439 non-discarded CMPs, the system made on average 4.47 CP suggestions per CMP. However, since the number of suggestions typically rises over a given session, it might be more interesting to look at how the numbers of suggestions change over time. For the current experiment we began with empty Case Pattern and Case Marker Pattern dictionaries7. The system had to process a few sentences before it could start making any suggestions. The first two CA interactions yielded no suggestions. On the third CA interaction, the system made its first suggestion, which was rejected. The first suggestion that was accepted was for the fifth CA interaction. After 42 interactions, the system was suggesting one CP for every CMP (on average). Figure 5 shows how the number of CP suggestions changed over the course of the experiment. 16
# of Suggestions
14 12 10
Max # Suggestions Average # Suggestions
8 6 4 2 401
351
301
251
201
151
101
51
1
0 CA Interactions
Figure 5: Number of CP suggestions over time
After 439 interactions, the maximum number of CP suggestions for any one CMP was 14. From Figure 5 we see that the increase in the maximum number of suggestions is slowing. Consequently, we expect that the maximum may continue to rise as more sentences are processed, but that this number will not leap to a higher order of magnitude. We claim, then, that numbers of suggestions of this magnitude are manageable for the user. Consider sentence 444:
7
It has been our assumption that for domain-specific knowledge acquisition purposes these dictionaries would be emptied at the beginning of each new text (or each new domain, perhaps). However, someone interested in the construction of general dictionaries may prefer always to have the same dictionary.
Experimental Validation of a Semi-Automatic Text Analyzer
444
18
You see a film of dew on the inside of the window.
For the CMP psubj-pobj, the Case Analyzer suggested 12 possible Case Patterns in the following order: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
agt-dir agt-eff agt-obj cont-expr expr-caus expr-manr expr-meas expr-obj expr-tat obj-meas recp-meas recp-obj
Looking at the first suggestion, we accepted Agent as the proper Case for psubj (consistent with our treatment of the verb “see” in this text) meaning that we only had to consider 3 of the 12 suggested patterns. This hierarchical approach to selecting CPs came naturally and suggests a possible modification to the system: once the number of suggestions reaches some threshold, CA interaction could suggest a small number of single Cases for each Marker in turn instead of suggesting complete Case Patterns. This would make CA interactions even more manageable as patterns accumulate. 5.2.3 User Action Looking back to Table 6, we see that the correct Case Pattern was among the system’s suggestions 58% of the time (272 times out of 468 clauses). We had to supply a new Case Pattern for 36% of the clauses and 6% of the clauses were discarded. Again, however, since the system is supposed to improve over time, it is more interesting to look at how these numbers change over time. Figure 6 shows the numbers of times we accepted a single suggestion or chose from multiple suggestions (a + c) versus the numbers of times we had to supply (s) new patterns or discard (d) clauses. For the first half of the sentences, the total number of supplied patterns was roughly equal to the number of patterns chosen from the suggested CPs. After 234 (exactly half) of the clauses had been processed, we had manually supplied CPs 103 times and chosen CPs from among the system’s suggestions 105 times, meaning that the correct CP was among the suggested patterns about half the time. However, by the end of the experiment, we had supplied CPs 167 times and chosen from among suggested CPs 272 times, meaning that the correct CP was among the system’s suggestions about 62% of the time. For the 234 clauses in the second half of the experiment, we supplied CPs 64 times and chose CPs 167 times: the correct CP was among the system’s suggestions roughly 72% of the time. This result supports the claim that learning from past patterns improves the ability of the system to make informed Case Pattern suggestions to the user.
Experimental Validation of a Semi-Automatic Text Analyzer
19
300 a+c
Total Actions
250 200 s 150 100 50 d 451
401
351
301
251
201
151
101
51
1
0 Clauses
Figure 6: User action over time
5.2.4 User Onus In 87% of the clauses that weren’t discarded, the Case Pattern was obvious to us. There were only five clauses for which Case assignment was quite difficult. None of the clauses presented problems that we could not resolve given our existing set of 28 Cases. Sentence 35 is an example of a sentence containing a clause with a user onus of 2. 35
So weather stations began sending up airplanes to get reports on the clouds.
DIPETT correctly parsed “sending up airplanes to get reports on the clouds” as the complement of “began”. But within the “sending up” clause, “airplanes” was parsed as the indirect object with the “to get” infinitive clause parsed as the complement. So CA suggested pobj as the Case Marker corresponding to the “to get” clause and piobj as the Marker for “airplanes”. We replaced that CMP with pobj for “airplanes” and adv for the “to get clause”. Since CA had no suggestions for the verb “send” with the CMP pobj-adv, we had to supply the Case Pattern as well (Object-Purpose). 5.2.5 Case System Coverage Of the twenty-eight Cases in the existing Case set, four were not used in this experiment. The Opposition Case represents an entity that contrasts with or opposes the act but is insufficient to prevent it from happening. Despite the fact that this Case was not used at all
Experimental Validation of a Semi-Automatic Text Analyzer
20
in the analysis of the test text, no other Case adequately captures the relationship marked by such markers as against, considering, despite, in_spite_of, notwithstanding, versus, nevertheless and nonetheless. Similarly, only our Order Case can represent the relative position of an entity within a structured arrangement of entities, as marked by after, ahead_of, before, below, beneath, beyond, underneath, first, last, lastly, next, primarily, second, secondly, etc. The other two Cases not used were Time_from and Time_to. These are the natural temporal counterparts of the locative Location_from and Location_to Cases, both of which appeared in the test text. The Time_from and Time_to Cases are also well supported in the literature (for a comparative survey, see Barker et al., 1996). The Time_from Case is marked by as_of, from, since, with, hence, henceforth, henceforward, hereafter, thence and thereafter. Time_to is marked by into, till, to, until, backward and hitherto. Finally, although HAIKU allows the user to add new Cases to supplement the existing 28, we needed no new Cases to cover the 439 clauses for which Case Analysis was performed.
6
Overall User Onus
We can think of the user time (summarized in section 1.3.4) as a raw indicator of the system’s toll on the user. A more general measure of user onus would have to take into account the number and severity of parse errors, the number of pronoun explicitization interactions and the user actions involved, the number of CLRA interactions and their user onus measures and the number of CA interactions and their user onus measures. However, all of these indications of sentence complexity also directly influenced the amount of time we spent on each sentence. A sentence with three pronouns, two CLRs and three CA interactions would obviously be more burdensome and time consuming than a simple sentence with a single verb. Furthermore, if we wanted to get a rough idea of the potential user burden of a sentence before analyzing it, we could not use these indications of sentence complexity, since it requires a full analysis just to measure them. Instead, we looked at the relationship between sentence length (something easily measurable before analysis) and user time for this experiment. Figure 7 shows the average user time spent on sentences of equal length. The figure also shows the number of sentences of each length. The visual impression of Figure 7 is that average user time increases almost monotonically as sentence length increases. This rule begins to deviate as sentence length reaches the high end of the scale. Since the user time is averaged over all sentences of a given length, these numbers are less reliable where there are fewer sentences of any one length. Nonetheless, user time on longer sentences does vary more than it does for shorter sentences. The higher incidence of parse errors for
Experimental Validation of a Semi-Automatic Text Analyzer
21
50
0
0
Average User Time (seconds)
10
# sentences average time
22
100
20
20
18
150
16
30
14
200
12
40
10
250
8
50
6
300
4
60
2
Number of Sentences
longer sentences means that semantic analysis is more often aborted, resulting in some low user times. However, when semantic analysis is possible, longer sentences often have more clauses and clauses with more verb arguments, resulting in longer user interactions and higher user times.
Number of Tokens
Figure 7: Average user time vs. sentence length
Calculating the correlation coefficient between average user time and sentence length gives a value of ρxy = 0.934, based on the formula in Equation 1.
ρ xy =
Cov ( X , Y ) σx σy where
1 n Cov ( X , Y ) = ∑ x − µx n i =1 i
(
) ( y i − µ y)
Equation 1: Standard formula for the correlation coefficient ρxy
The very high value for the correlation coefficient ρxy suggests that average user time and sentence length are very unlikely to be independent. We can conclude that sentence length is a good indicator of the potential user onus for a given sentence. Finally, it would be very useful to look at how user time changes over the course of the experiment. But since user time depends on sentence length, number of clauses, number of
Experimental Validation of a Semi-Automatic Text Analyzer
22
pronouns, etc., it is not very helpful simply to look at the changing user times for all sentences. On the other hand, since the number of sentences with the same numbers of tokens, clauses, pronouns, etc. is small, we cannot completely isolate user time from dependent variables and still have enough sentences to look for trends. However, we can at least eliminate sentence length by looking at how user time changes for sentences of the same length. Figure 8 shows how user time changed for four of the most common sentence lengths.
49
sentences with 10 tokens
500
moving average
user time (seconds)
450 400 350 300 250 200 150 100 50 0
400 300
moving average
200 100
sentences with 11 tokens
49
43
37
31
25
19
7
13
1
49
43
37
31
25
19
7
13
0 1
user time (seconds)
sentences with 9 tokens
43
1
43
37
31
25
19
13
7
1
0
37
50
31
100
moving average
200 150 100 50 0 25
moving average
150
19
200
7
250
400 350 300 250
13
300
user time (seconds)
user time (seconds)
350
sentences with 12 tokens
Figure 8: User times for common sentence lengths
Unfortunately, the curves themselves are too erratic to see any clear trend. The superimposed moving averages give local averages of the previous ten data points for each point in the user time curves. For all four sentence lengths shown (and indeed for all sentence lengths occurring more than 20 times) average user time decreased as more sentences were analyzed. This result is encouraging, given the concern that increasing numbers of Case Pattern suggestions would make the system unusably burdensome on the user over time.
7
Conclusions
In section 1.1 we identified several important goals of a DIPETT/HAIKU experiment in general and this report in particular. We revisit these goals here.
Experimental Validation of a Semi-Automatic Text Analyzer
23
Evaluation of DIPETT on an unprepared technical text. The parsing data gathered during the experiment are listed in appendix A and summarized in section 2. From that section, we saw that 80% of the sentences were completely parsed and that 85% were parsed well enough for HAIKU semantic analysis to proceed. The numbers recorded provide insight into the types and frequencies of different kinds of parse errors. Although the statistics presented in this report give some idea of the overall performance of DIPETT on a given kind of text, the exact errors themselves may be more useful for improving the system. The experience of using the parser interactively for a large number of sentences along with the error descriptions in the grid reveal biases that cause DIPETT to repeatedly commit certain errors. Examination of the coverage of HAIKU’s Case set. Section 5.2.5 explicitly examined the coverage of the existing Cases. Despite the fact that four of HAIKU’s Cases were not used for the current text, it would be imprudent to discard them, given their definitions and the fact that each of the four Cases is marked by several Case Markers (as listed in section 5.2.5). That section also revealed that no new Cases were needed to capture the semantics of clauses in this particular text. These facts suggest that the Case system has good coverage. Evaluation of Case Analyzer performance over time. In section 5.2, we showed that the system was able to make Case Pattern suggestions based on past processing, and that the number of suggestions did not explode unmanageably over time. We also showed that suggestions made by the Case Analyzer more often than not included the most appropriate Case Pattern (by a margin of 262 to 167). Evaluation of Clause-Level Relationship Analyzer performance. Although the success rate of 73% for CLRA was lower than for some earlier experiments (for the possible reasons given in section 4.2), this experiment confirms the CLRA module as a simple and fairly accurate system for assigning semantic labels to inter-clausal relationships. More important, the experiment revealed areas where the system can be improved. In particular, the CLRA module’s biases for sentences containing fewer syntactic clues may need to be adjusted. Effects of sentence length on system performance and user burden. The relationship between sentence length (number of tokens) and parse errors was found to be inconclusive, as investigated in section 2.2. We subsequently looked at the effect of parse errors on various other parts of the system and found that CLRA invocation and CMP correctness directly related to parse error severity. Section 6 showed that sentence length and average user time were directly related. That section also showed that on average user time decreased over the course of the experiment.
Experimental Validation of a Semi-Automatic Text Analyzer
8
24
References
BARKER, KEN (1994). “Clause-Level Relationship Analysis in the TANKA System”. TR-94-07, Department of Computer Science, University of Ottawa. BARKER, KEN & STAN SZPAKOWICZ (1995). “Interactive Semantic Analysis of ClauseLevel Relationships”. Proceedings of the Second Conference of the Pacific Association for Computational Linguistics, 22-30. Brisbane. BARKER, KEN, TERRY COPECK, SYLVAIN DELISLE & STAN SZPAKOWICZ (1993). “An Empirically Grounded Case System”. TR-93-08, Department of Computer Science, University of Ottawa. BARKER, KEN, TERRY COPECK, SYLVAIN DELISLE & STAN SZPAKOWICZ (1996). “A Case System for Interactive Knowledge Acquisition from Text”. (to be submitted to the Journal of Natural Language Engineering). DELISLE, SYLVAIN (1994). “Text Processing without A-Priori Domain Knowledge: SemiAutomatic Linguistic Analysis for Incremental Knowledge Acquisition.” Ph.D. thesis, TR94-02, Department of Computer Science, University of Ottawa. DELISLE, SYLVAIN, TERRY COPECK, STAN SZPAKOWICZ & KEN BARKER (1993). “Pattern Matching for Case Analysis: A Computational Definition of Closeness”. ICCI-93, 310315. DELISLE, SYLVAIN, KEN BARKER, TERRY COPECK & STAN SZPAKOWICZ (1996). “Interactive Semantic Analysis of Technical Texts”. Computational Intelligence 12(2), May, 1996 (to appear). LARRICK, NANCY (1961). Junior Science Book of Rain, Hail, Sleet & Snow. Champaign: Garrard Publishing Company. QUIRK, RANDOLPH, SIDNEY GREENBAUM, GEOFFREY LEECH & JAN SVARTVIK (1985). A Comprehensive Grammar of the English Language. London: Longman.