Re se a rch F EAT U RE
Searching for Definitional Answers on the Web Using Surface Patterns Alejandro Figueroa and Günter Neumann German Research Center for Artificial Intelligence
John Atkinson, University of Concepcion, Chile
A novel question-answering system employs query rewriting techniques to increase the probability of extracting nuggets from various Web snippets by matching surface patterns. Experimental results show the approach’s promise versus existing techniques.
I
n recent years, researchers developing Web-based question-answering systems (QASs) have shifted their focus from simple factoid questions to complex queries that require processing and drawing inferences from numerous sources. One complex question type that researchers systematically assess at international gatherings such as the Text Retrieval Conference (TREC)1 asks for the definition of a given term—for example, “What is epilepsy?” “Who is Tom Hanks?” or “What is the WWF?” Definitional QASs typically extract sentences that contain the most descriptive information about the search term from multiple documents and then summarize these sentences into definitions. They usually exploit additional external definition resources such as encyclopedias and dictionaries, as these can provide precise and relevant nuggets that are further projected on the corpus. However, integrating each particular resource involves designing a specialized wrapper. Exploiting the Web as a source of descriptive information is therefore a key issue in searching for definitional answers on the Web. Definitional QASs often use the Web to fetch and process a large amount of complete documents, which
68
computer
requires distinguishing descriptive sentences within all the fetched documents. Definition questions occur relatively frequently in Web search engine logs, suggesting that they are an important type of question. However, evaluating QASs that answer definition questions is much more difficult than evaluating those that answer factoid questions because a system response cannot be judged simply as right or wrong. While Web search engines provide clean Web snippets together with a link to each relevant document, definitional QASs have not succeeded in exploiting Web snippets as a direct source of descriptive phrases. To address this problem, we propose a novel approach that employs query rewriting techniques to increase the probability of extracting the nuggets from Web snippets by matching surface patterns. In addition, our method takes advantage of corpus-based semantic analysis and sense disambiguation strategies for extracting words that describe different aspects of target concepts on the Web.
RELATED WORK QASs usually recognize definition sentences by aligning surface patterns with sentences in a target corpus and using
Published by the IEEE Computer Society
0018-9162/09/$25.00 © 2009 IEEE
these patterns to extract, rank, and filter nuggets. Recent research indicates that the larger the corpus, the higher the probability of matching sentences.2 Consequently, performance significantly improves with corpus size. Most systems rank matched sentences according to three criteria:2-4 • accuracy of the surface pattern that signals the corresponding descriptive nugget, • frequencies of words within matched sentences (providing that highly frequent terms are very likely to belong to descriptions), and • frequencies of words that co-occur with the defini endum, or term to be defined (providing they are likely to express their definition facets). QASs that use these ranking criteria can extract satisfactory definitions to questions from top-ranked full documents retrieved from the Web,2 but they do not filter redundant matched sentences. One way to address this problem is to randomly remove one sentence from every pair that shares more than 60 percent of its terms.3 Unfortunately, this approach discards relevant information accompanying the removed sentences, and it does not account for sentences with three or more overlapping phrases. Researchers used this strategy to answer questions at TREC 2003 by aligning surface patterns at the word and part-of-speech levels, with the assistance of a wrapper for the Merriam-Webster online dictionary, from which it retrieved 1.5 nuggets per question. Because nuggets often seem odd and out of place without their context, the technique expanded them to surround 100 nonwhite space characters to enhance readability. To deal with the lack of proper context, one study4 took advantage of WordNet glossaries, online-specific resources such as Wikipedia, and Web snippets to learn frequencies and word correlations, especially with the definiendum. This technique reranked candidate descriptive sentences according to their similarity to a centroid vector based on learned frequencies. Definitional websites greatly improved performance, resulting in few unanswered questions: Wikipedia covered 34 out of 50 TREC 2003 definition questions and Biography.com covered 23 out of 30 questions regarding people; together they helped provide answers to 42 questions. While Web snippets yielded relevant information about the definiendum, they did not provide effective descriptive sentences. To address this problem, some methods obtain n-character windows5 from the top 50 documents retrieved by a search engine and ranked by a supportvector-machine-based classifier that is trained using
previously tagged windows and automatically acquired phrasal attributes. Experimental results showed one acceptable definition within the top-five ranked windows for 116 out of 160 TREC 2000 questions and 116 out of 137 TREC 2001 questions, but this technique still requires huge amounts of training data. An unsupervised and less data-dependent approach6 extracts tagged windows from online encyclopedias. Other robust methods for vector-based definition retrieval use a centroid vector to compute word dependencies learned from the 350 most frequent co-occurring terms extracted from the best Google snippets.7 The system fetches these snippets by expanding the original query using a set of
Def-WQA avoids specialized wrappers or full documents and relaxes conventional pattern matching by using the Jaccard measure to identify different variations of the definiendum.
terms that highly co-occur with the definiendum in sentences obtained by submitting the original query and some task-specific clues such as “biography.” It ranks candidate phrases according to the centroid vector and filters them by ensuring that their cosine similarity to all previously selected sentences is below a threshold.
SEARCHING FOR DEFINITIONS ON THE WEB Unlike state-of-the-art QASs that expand the query with task-specific cues, our proposed Def-WQA approach boosts the retrieval of descriptive phrases by rewriting the query according to the syntactical structure of surface patterns. This rewriting biases search engines in favor of Web snippets that align with such patterns—that is, snippets extracted from well-known definition sites. Def-WQA avoids specialized wrappers or full documents and relaxes conventional pattern matching by using the Jaccard measure to identify different variations of the definiendum. Another distinctive feature of Def-WQA is its data-driven nature—it only uses language-specific surface patterns and does not rely on training data.
System architecture Figure 1 shows the overall structure of our definitional Web-based QAS. The input to the system is a query Q provided by the user.5 First, a definitional miner extracts sentences from the Web likely to contain a definition of δ. Next, a defini tional rule matcher aligns these sentences with a previously defined set of surface patterns. Because names contained APRIL 2009
69
Re se a rch F e ature
Def-WQA Web
Query Definiendum
Definitional miner
Context miner
Sentences
Definitional rule matcher
Matched sentences
Semanitcally closed terms
Sense disambiguator
Definition ranker
Disambiguated terms
Surface patterns
Figure 1. Definitional Web-based QAS system architecture.
in the matched sentences might have multiple meanings— for example, Tom Hanks the actor and Tom Hanks the seismologist—a context miner extracts the closest different senses of the definiendum by observing the correlation of their neighbors in a multidimensional semantic space provided by corpus-based semantic analysis. To resolve ambiguity problems arising from multiple senses for the same semantic space, a sense disambiguator discovers a set of uncorrelated words and disambiguates the different senses of δ to detect the correct one. It then groups potential senses in conceptual clusters and sends these to a definition ranker, which produces an ordered sequence of extracted definitions by selecting representative sentences as the answers. Figure 2 shows example search results of our QAS for two questions: “Who is Akbar the Great?” and “What is epilepsy?”
Definitional miner Extracting sentences from the Web that are likely to contain a definition of δ involves sequentially submitting 10 purpose-built queries in which the first submission corresponds to the initial query (Q1 = “δ”) and the remaining queries aim at a surface pattern Π, like those shown in Figure 3. Because Π1 is likely to provide too many descriptive sentences, the definitional miner splits it into the following four queries: Q2 = “δ is a ” ∨ “δ was a ” ∨ “δ were a ” ∨ “δ are a ” Q3 = “δ is an ” ∨ “δ was an ” ∨ “δ were an ” ∨ “δ are an ”
70
computer
Q4 = “δ is the ” ∨ “δ was the ” ∨ “δ were the ” ∨ “δ are the ” Q5 = “δ has been a ” ∨ “δ has been an ” ∨ “δ has been the ” ∨ “δ have been a ” ∨ “δ have been an ” ∨ “δ have been the ” The next query attempts to discover snippets matching Π2 and Π3: Q6 = “δ, a ” ∨ “δ, an ” ∨ “δ, the ” ∨ “δ, or ” Because Π3 has a low occurrence within Web snippets2 and often yields a synonym of δ, such as “myopia or nearsightedness,” the definitional miner merges these two kinds of patterns into one query. Alternative names of people, organizations, or abbreviations are seldom expressed in this way, but they are likely to match the other clauses within Q6. Consequently, combining both patterns allows the miner to reduce the number of Web searches. Thus, queries Q7, Q8, and Q9 aim at Π4, Π6, and Π7, respectively: Q7 = (“δ” ∨ “δ also ” ∨ “δ is ” ∨ “δ are ”) ∧ (called ∨ nicknamed ∨ “known as”) Q8 = “δ became ” ∨ “δ become ” ∨ “δ becomes ” Q9 = “δ which ” ∨ “δ that ” ∨ “δ who ” Finally, query Q10 retrieves snippets matching Π5 and Π8. As for Q6, the miner merges both patterns into one query on the ground that Π8 deals with δ concerning persons and Π5 focuses essentially on acronyms.2 This avoids an
QUESTION: “WHO IS AKBAR THE GREAT?” ...
The surface patterns used to search for person definitions are:
>>>>> "Akbar the Great is a " OR "Akbar the Great are a " OR "Akbar the Great was a " OR "Akbar the Great were a " >>>>> "Akbar the Great is an " OR "Akbar the Great are an " OR "Akbar the Great was an " OR "Akbar the Great were an " >>>>> "Akbar the Great is the " OR "Akbar the Great are the " OR "Akbar the Great was the " OR "Akbar the Great were the " >>>>> "Akbar the Great has been a " OR "Akbar the Great have been a" OR "Akbar the Great has been an " OR "Akbar the Great have been an " OR "Akbar the Great has been the " OR "Akbar the Great have been the " >>>>> "Akbar the Great, a " OR "Akbar the Great, an " OR "Akbar the Great, the " OR "Akbar the Great, or " ...
Some clusters representing the extracted answers: ———— Cluster SMITH—————Unit: 56 65 0Smith Akbar, the Great Mogul (1542-1605), Clarendon Press, 1919. ———— Cluster KING—————Unit: 52 62 0Akbar the great was the next king of from Mughals (1556-1605). ———— Cluster EMPIRE—————Unit: 150 182 0A royal chronicle tells how Akbar the Great, who ruled India s Mogul Empire in the A. D. 1500 s, captured at least 9,000 cheetahs during his 49-year reign to aid him in hunting deer. Unit: 50 62 1Akbar the Great was a 16 th Century ruler of the Mogul Empire. ———— Cluster EMPEROR—————Unit: 132 150 01556 Akbar the Great becomes Mogul; emperor of India, conquers Afghanistan (1581), continues wars of conquest (until 1605) ...
QUESTION: “WHAT IS EPILEPSY?”
The surface patterns used to search for “thing” definitions are:
...
>>>>> "epilepsy is a " OR "epilepsy are a " OR "epilepsy was a " OR "epilepsy were a " >>>>> "epilepsy is an " OR "epilepsy are an " OR "epilepsy was an " OR "epilepsy were an " >>>>> "epilepsy is the " OR "epilepsy are the " OR "epilepsy was the " OR "epilepsy were the " >>>>> "epilepsy has been a " OR "epilepsy have been a " OR "epilepsy has been an " OR "epilepsy have been an " OR "epilepsy has been the " OR "epilepsy have been the " >>>>> "epilepsy, a " OR "epilepsy, an " OR "epilepsy, the " OR "epilepsy, or " >>>>> "epilepsy became " OR "epilepsy become " OR "epilepsy becomes " ...
Some clusters representing the extracted answers: ———— Cluster STRANGE—————Unit: 77 88 0In epilepsy, the normal pattern of neuronal activity becomes disturbed, causing strange. ———— Cluster SEIZURES—————Unit: 76 89 0Epilepsy, which is found in the Alaskan malamute, is the occurrence of repeated seizures. Unit: 126 149 1Epilepsy is a disorder characterized by recurring seizures, which are caused by electrical disturbances in the nerve cells in a section Unit: 100 115 2Temporal lobe epilepsy is a form of epilepsy, a chronic neurological condition characterized by recurrent seizures. ———— Cluster ORGANIZATION—————Unit: 102 118 0The Epilepsy Foundation is a national, charitable organization, founded in 1968 as the Epilepsy Foundation of America. ———— Cluster NERVOUS—————Unit: 108 127 0Epilepsy is an ongoing disorder of the nervous system that produces sudden, intense bursts of electrical activity in the brain.
Figure 2. Example search results.
unproductive retrieval without reducing the number of retrieved descriptive sentences: Q10 = “δ was born ” ∨ “(δ)” A benefit of this query rewriting strategy is that it biases Web snippets in favor of sentences that fulfill Π so that retrieved Web snippets are likely to provide definitions of δ.7
Definitional rule matcher Surface patterns are useful for distinguishing definition sentences in natural-language texts.2,3,5,8 The definitional rule matcher detects sentences using these patterns by properly aligning syntactic structures largely based on punctuation and words that often convey definitions. Our approach defines several surface patterns to identify the definiendum δ′ and its definition nugget η′ within the sentence. The set of surface patterns shown in Figure 3 APRIL 2009
71
Re se a rch F e ature
Π1: δ’ [is|are|has been|was|were] [a|the|an] η’ Example: “Epilepsy is a disorder characterized by recurring seizures, which are caused by electrical disturbances in …” Π2: [δ’|η’], [a|an|the] [η’|δ’] [,|.] Example: “In epilepsy, the normal pattern of neuronal activity becomes disturbed, causing strange …” Π3: [δ’|η’], or [η’|δ’] Example: “Thomas Jeffrey Hanks, or Tom Hanks, …” Π4: [δ’|η’] [|,] [|also|is|are] [called|named|nicknamed|known as] [η’|δ’] Example: “Epilepsy, also known as seizure disorder …” Π5: [δ’|η’] ([η’|δ’]) Example: “World Wildlife Fund (WWF) …” Π6: δ’ [become|became|becomes] η’ Example: “In 1995, Tom Hanks became only the second man to win back to back Best Actor Oscars …”
Figure 3. Example surface patterns.
is particularly useful for dealing with Web snippets (the p-th pattern is hereafter referred to as Πp). Note, however, that a proper definition identification consists of more complex syntactic rules as δ does not need to accurately match δ′ for the extracted sentences due to their complex internal structure.8 For example, the definiendum δ* = “Tom Hanks” can also be expressed as δ′1* = “American actor Tom Hanks” or δ′2* = “Thomas Jeffrey Hanks.” Complex internal structures can also yield misleading definitions such as “Tom Hanks’s wife.” State-of-the-art strategies partially deal with this issue by providing extra word addition and ordering rules8 or by taking advantage of sophisticated partial linguistic processes such as chunking.3 Unlike these approaches, the rule matcher employs relaxed string matching based on the Jaccard measure J. For a pair of terms wi,wj, J is the ratio between the number of different unigrams shared by both terms and the total number of different unigrams: J(wi,wj) = |wi ∩ wj|/|wi ∪ wj|. In our example, J(δ*,δ′1*) = 2/4 = 0.5 and J(δ*,δ′2*) = ¼ = 0.25. Thus, the rule matcher filters reliable pairs {δ′,η′} by computing J for the selected sentences.3
Context miner There are many-to-many mappings between names and their concepts. The same name or word can refer to several meanings or entities. For instance, places and companies can be named after famous people and owners or locations, respectively. On the other hand, different names can indicate the same meaning or entity. To address this issue, the context miner extracts semantically related terms from sentences.
72
computer
Tom Hanks is an Acad emy Award-winning actor. Thomas Jeffrey Hanks is an actor born in 1959 in California. Tom Hanks is an American seismologist.
These sentences refer to “Ac a demy Awa rd-w i nning actor Tom Hanks” as “Thomas Jeffrey Hanks” and “Tom Hanks,” whereas “Tom Hanks” also indicates an American seismologist. In this context, a sense is a meaning of a word or one possible reference to a real-world entity. The context miner extracts the different senses of δ by observing the correlation of their neighbors in the reliable semantic space provided by latent semantic analysis (LSA). This multidimensional semantic space is built from a term-sentence matrix M that considers δ as a pseudosentence weighted according to traditional tf-idf metrics. The context miner distinguishes all possible different n-grams in S and their frequencies. It then reduces the size of M by removing n-grams that are substrings of other equally frequent terms. This removal step allows the system to speed up computation of the semantic vectors when using singular value decomposi tion (SVD) in LSA. While running time is not a key issue in state-of-theart QASs (effectiveness metrics such as F-score are often used to compare methods), we ran several experiments to obtain a flavor of the times required to execute critical tasks in the context-mining process. The experiments involved queries made by four computers running in parallel (the rest of the tasks ran sequentially). The results showed an average retrieval time of 3.7 seconds with a standard deviation of 1.6 seconds. The total running time including snippets retrieval was 10.7 seconds (StdDev = 5.1 s) using Java APIs as the main implementation tools, which make the application slower. Note that LSA cannot separate the SVD task from the computation of semantic vectors; otherwise, we would need a different kind of corpus-based semantic analysis method that is unproven in the context of QASs. Nevertheless, the obtained running times were 0.07 seconds (StdDev = 0.12 s) for SVD and 1.13 seconds (StdDev = 0.86 s) for LSA. Overall,
Π7: δ’ [which|that|who] η’ Example: “Epilepsy, which is found in the Alaskan malamute, is the occurrence of repeated seizures …” Π8: δ’ [was born] η’ Example: “Tom Hanks was born in 1956 in California …”
Consider, for exa mple, the following set S of recognized descriptive sentences:
running time for SVD and dimensional reduction tasks is not significant in our QAS. In the multidimensional semantic space obtained from LSA, the neighborhood of a particular word wi provides its context 9,10 and consequently its correct meaning by pruning, for instance, inappropriate senses.10 Similarly, δ is also a term defined by its neighborhood in this semantic space. Hence, the context miner selects a set of the highest closely related terms to δ—that is, terms likely to define its meaning. We conducted different experiments to adjust the optimum size of the set of related terms from 5 to 50 and observed best results in terms of semantic closeness and intercluster distance for groups of 40 terms. With relaxed pattern matching, the system also accounts for all n-grams that are substrings of δ, as some internal n-grams are more likely to occur within descriptive sentences—for example, names or surnames are more frequent than their corresponding full names.
Sense disambiguator A problem with the context miner is that it can extract multiple senses for sentences in the same semantic space. To deal with this issue, our QAS incorporates a sense disambiguator that resolves the λ different senses of δ in S by discovering a set of uncorrelated words. Let Φ be the term-sentence matrix, where a cell Φis = 1 if the term wi occurs within the descriptive phrase Ss and zero otherwise. The correlation among words is ˆ = Φ ⋅ Φ′. This dot product between two row given by Φ vectors of Φ reflects the extent to which two terms have a similar pattern of occurrence across the document set. For example, for the words w1 = “Academy,” w2 = “actor,” and w3 = “seismologist,” the computed values ˆ are of Φ and Φ ⎛ ⎜ w Φ = ⎜ 1 ⎜ w2 ⎜⎝ w3
S1 1 1 0
⎛ S2 S3 ⎞ w1 w2 w3 ⎞ 0 0 ⎟ ˆ ⎜ w1 1 1 0 ⎟ Φ = ⎜ ⎟ 1 0 ⎟⎟ ⎜ w2 1 2 0 ⎟ ⎟ ⎜ 0 1 ⎠ ⎝ w3 0 0 1 ⎟⎠
The number of nonselected words γ that co-occur with a term wi is therefore the times the term occurs within the sentence, regardless of its frequency. In our example, because “Academy” and “actor” co-occur in a sentence and “seismologist” does not, the corresponding values of γ(wi) become γ(w1) = γ(w2) = 2 and γ(w3) = 1. Consequently, the disambiguator adds wi to a candidate senses list, providing that this term maximizes γ(wi)—w2 in this case. The term wi must • not co-occur at the sentence level with any other already selected term wj, and • be the highest number of co-occurring and non selected terms wj.
The disambiguator adds terms until no other term wi satisfies these conditions. Since words that indicate the same sense co-occur, term vectors build up an orthonormal basis in which each direction becomes a different potential sense; hence, sentences are clustered by sense. The sense disambiguator associates each sentence with its corresponding term and assigns it to one cluster Cλ . In our example, if the disambiguator randomly selects w2, then γ(w i) is equal to 1 for the three words in the ˆ ˆ >0∧Φ next cycle. It next selects w3 because Φ 33 > 12 λ W 0 ∧ w2 ∈ Wλ , and accordingly is {“actor”, “seismoloλ gist”}. Sentences that do not contain a term in W are grouped in a special cluster C0. Accordingly, the clusters for our working example are C0 = ∅ , C1= {S1, S2} and C2= {S3}.
The context miner extracts the different senses of by observing the correlation ᵟ of their neighbors in the reliable semantic space provided by latent semantic analysis.
Further, the disambiguator will attempt to reassign each sentence in C0 by searching for the strongest correlation between its entities and the entities of a cluster Cλ. For our example, it would assign the sentence “Thomas ‘Tom’ Jeffrey Hanks was a school actor in the Skyline High School in Oakland, California” to C1.
Definition ranker A definition ranker produces an ordered sequence of extracted definitions. Let N(Ss) be a function that returns the normalized nuggets associated with Ss, and WN the set of terms of all normalized nuggets. Pi is the probability of finding a word wi ∈ WN, and is arbitrarily set to 0 for all stop words, thus WN(Ss) is the set of words in N(Ss). For our example, the set of ranked words becomes WN(S ) = {ACADEMY, AWARD, WINNING, ACTOR} WN(S ) = {ACTOR, BORN, IN, 1959, CALIFORNIA} WN(S ) = {AMERICAN, SEISMOLOGIST} 1
2
3
The values of P i for each w i are as follows: {ACADEMY,1/12}, {AWARD,1/12}, {WINNING,1/12}, {ACTOR,2/12}, {BORN,1/12}, {IN,0}, {1959,1/12}, {CALIFORNIA,1/12}, {AMERICAN,1/12}, {SEISMOLOGIST,1/12}. For each cluster Cλ , the definition ranker incrementally computes a set of its sentences Sλ that maximizes the relative novelty based on their nuggets’ coverage and content as coverage(Ss) + content(Ss). Coverage determines the probability that terms within N(Ss) belong to a descripAPRIL 2009
73
Re se a rch F e ature Table 1. Evaluations against baseline. Corpus
Baseline
Def-WQA
Question set
TQ
NAQ
NS
Accuracy
NAQ
NS
Accuracy
AS (percent)
1
133
81
7.35 ± 6.89
0.87 ± 0.20
133
18.98 ± 5.17
0.94 ± 0.07
16 ± 20
2
50
38
7.7 ± 7.0
0.74 ± 0.20
50
14.14 ± 5.30
0.78 ± 0.16
5±9
3
86
67
5.47 ± 4.24
0.83 ± 0.19
78
13.91 ± 6.25
0.85 ± 0.14
5±9
4
185
160
11.08 ± 13.28
0.84 ± 0.20
173
13.86 ± 7.24
0.89 ± 0.15
4 ± 11
5
152
102
5.43 ± 5.85
0.85 ± 0.22
136
13.13 ± 6.56
0.86 ± 0.16
8 ± 14
tion by considering unseen words in θλ. Thus, sentences with a higher number of novel words are preferable to sentences containing many redundant words. Content discriminates the degree to which N(Ss) expresses definition facets of δ on the grounds of highly close semantic terms and entities. The ranker computes this by summing up the semantic relationship between terms within the corresponding nuggets and the relevance of novel entities. It weighs each novel entity e according to its probability Peλ of being in the normalized nuggets of Cλ. In our example, cluster C1 is the only one that contains more than one sentence and is thus worth computing. The corresponding sum Peλ for each entity “Academy,” “Award,” “California,” “1959,” and “American” is 1/4 and 1, respectively. Overall, sentences are ranked according to the order in which they are inserted—that is, higherranked sentences are more diverse, less redundant, and likely to contain entities along with terms that describe aspects of δ.
EXPERIMENTal RESULTS We used two kinds of criteria to assess the performance of our Web-based definitional QAS: a baseline-based comparison that uses results from different TREC and Cross-Language Evaluation Forum evaluations, and a comparison with other existing approaches to definitional question answering in CLEF.9 We set thresholds for the specific surface patterns to 0.25, apart from the parameters corresponding to the first (0.33) and fifth patterns (0.5). The tuning strategy consisted of testing different values on a small subset of questions. Tested values included 0.05 intervals from 0.7 to 0.2. In the case of the parameter for the first pattern (0.33), we also tested values from 0.3 to 0.4. We set the threshold that controls redundancy to 0.01 by checking values lower than 0.1.
Evaluations against baseline In our first assessment, we used five question sets from TREC 2001 (1), TREC 2003 (2), CLEF 2004 (3), CLEF
74
computer
2005 (4), and CLEF 2006 (5). Further, we implemented a baseline in which the QAS retrieved 300 snippets for Q1. The baseline splits snippets into sentences and accounts for a strict matching of δ. We discarded random sentences from a pair that shares more than 60 percent of its terms and sentences that are a substring of another sentence.3 Table 1 compares the experimental results of the baseline and our system. NAQ stands for the number of questions for which the answers contained at least one nugget, and TQ is the total number of questions within the question set. Unlike a previous QAS that found nuggets for only 42 questions by using external dictionaries and Web snippets,4 Def-WQA discovered novel entities for all questions in set 2. In addition, our system discovered nuggets within snippets for the 133 questions in set 1, while another earlier QAS found a top-five-ranked snippet that conveyed a definition solely for 116 questions within the top 50 retrieved documents.5 In addition, Def-WQA extracted short sentences (125.7 ± 44.21 characters considering white spaces; baseline: 118.168 ± 50.2 characters), whereas two previous QASs handled fixed windows of 250 characters.5,6 On the other hand, Def-WQA found sentences that are 109.74 ± 42.15 characters long without considering white spaces, which is comparatively longer than the 100-character nuggets of another older QAS.3 Overall, our system covered 94 percent of the questions, whereas the baseline covered 74 percent. This difference may be mainly due to the query rewriting step and the flexible matching of δ. For all questions in which Def-WQA and the baseline extracted at least one nugget, the systems computed the accuracy and average number of sentences (NS). Our system doubled the number of sentences and so obtained slightly better accuracy. The results in Table 1 also show the proportion of sentences within NS for which the relaxed matching shifted δ to another concept that yielded interesting descriptive sentences (answer shift—AS). For example, “neuropathy” was
shifted to “peripheral neuropathy” and “diabetic neuropathy.” Conversely, some shifts caused unrelated sentences to shift, such as “G7” to “Powershot G7.”
Table 2. Average F-score values for TREC 2003 definition questions. F-score (β = 5)
Average length
BBN Technologies
0.555
2,059.20
Def-WQA
0.530
1,878.00
National University of Singapore
0.473
1,478.74
University of Southern California
0.461
1,404.78
Language Computer Corp.
0.442
1,407.82
Baseline
0.340
583.00
University of Colorado/Columbia University
0.338
1,685.60
ITC-irst
0.318
431.26
Definitional QAS
Evaluations against gold standard To compare Def-WQA with a gold standard, we used the assessor’s list for the TREC 2003 data.1 Our system achieved 0.61 ± 0.33 recall and 0.18 ± 0.13 precision, whereas baseline performance was 0.35 ± 0.34 and 0.30 ± 0.26, respectively. Def-WQA’s higher recall suggests that our method selected additional sentences that contain more nuggets that are seen as key on the assessor’s list. The QAS systems evaluated at TREC can find valid nuggets that the assessor’s list might not judge to be relevant.3 This is even more likely for Web-based systems, as they discover many additional nuggets that a user would regard as relevant but the list excludes. This is a key issue for our system as it significantly increases the number of selected descriptive sentences per question and thus the answer length. We accordingly computed the F(β)-score1 as (β2 + 1)(recall · precision)/(β2precision + recall) for each answer, with β = 5 as used by the majority of TREC systems. Table 2 compares the results of our system to those of the best seven definitional QASs at TREC 2003. Def-WQA extracted descriptive information for all definition questions in the TREC 2003 datasets.1 Compared to other definitional QASs, our system achieved a secondplace F-score of 0.53, making it competitive with the best systems, which achieved a value between 0.33 and 0.55. Although the other systems are not directly comparable to ours, as they often extracted answers from a target corpus whereas Def-WQA did so from the Web, the difference in performance is significant. Another recent QAS finished with an F-score ranging from 0.511 to 0.531 for different configurations.7 This system uses a centroid vector that considers word dependencies learned from the 350 most frequent stemmed co-occurring terms taken from the top 500 snippets retrieved by Google. The system fetched these snippets by expanding the original query using a set of five highly co-occurring terms. These terms co-occur with the definiendum in sentences obtained by submitting the original query plus some taskspecific clues such as “biography.” It is also worth noting that for some δ, the baseline obtained a better F-score (“Akbar the Great,” “Albert Ghiorso,” and “Niels Bohr”), indicating that it extracted an answer closer to the assessor’s list than did Def-WQA. Our system’s slightly lower F-score might be due to a completeness issue: The assessor’s list3 did not judge some nuggets found by TREC systems as relevant despite the fact that users might find them so. Consequently, these nuggets enlarge the response without increasing precision. However, the F-score is the best way to compare the performance of several systems.
A novel feature of Def-WQA is its incorporation of a sense disambiguation strategy capable of distinguishing and resolving different potential senses for some δ—for example, the particle sense and format sense of “atom.” Our system built the answers to questions like “Who is Akbar the Great?” from surface patterns for person definitions such as “Akbar the Great was a” or “Akbar the Great were a.” On the other hand, our system split some senses into two separate ones—for example, “emperor” and “empire” for “Akbar the Great.” This misinterpretation could be due to an independent co-occurrence of “emperor” and “empire” with δ. External sources of knowledge might resolve this problem, but it could become very difficult 7 as some δ can be extremely ambiguous. For example, “Jim Clark” refers to several real-world entities; our sense disambiguator differentiated the photographer, the pilot, and the Netscape creator but grouped many executives named “Jim Clark” in the same cluster. In addition, entities and the correlation of close terms in the semantic space provided by LSA can be two building blocks of a more sophisticated strategy to disambiguate δ.
A
key finding of our research is that Web snippets do not provide the necessary information for complete disambiguation. Hence, using external resources such as WordNet and additional queries for fetching extra information from the Web should be encouraged. In addition, Def-WQA’s definition ranker might approximate a real summarizer by properly arranging highly ranked nuggets and modeling the events’ temporal ordering. These summaries can also be linked to the original documents as traditional papers do. APRIL 2009
75
Re se a rch F e ature Further, while our system has been tested only for English, adaptations to other languages would be easy. This would mainly involve rewriting new surface patterns. Finally, some techniques for detecting morphosyntactical variations would let the system decrease output redundancy. Nevertheless, this redundancy can still be useful to project sentences into a corpus like Advanced Question Answering for Intelligence (www-nlpir.nist.gov/ projects/aquaint).
7. Y. Chen, M. Zhon, and S. Wang, “Reranking Answers for Definitional QA Using Language Modeling,” Proc. 21st Int’l Conf. Computational Linguistics and 44th Ann. Meeting Assoc. Computational Linguistics (COLING/ACL 06), ACL, 2006, pp. 1081-1088. 8. M.M. Soubbotin, “Patterns of Potential Answer Expressions as Clues to the Right Answers,” Proc. 10th Text Retrieval Conf. (TREC 10), NIST, 2001, pp. 293-302. 9. T.K. Landauer et al., eds., Handbook of Latent Semantic Analysis, Lawrence Erlbaum, 2007. 10. W. Kintsch, “Predication,” Cognitive Science, vol. 25, no. 2, 1998, pp. 173-202.
References 1. E.M. Voorhees and D.K. Harman, TREC: Experiment and Evaluation in Information Retrieval, MIT Press, 2005. 2. H. Joho, Y.K. Liu, and M. Sanderson, “Large-Scale Testing of a Descriptive Phrase Finder,” Proc. 1st Int’l Conf. Human Language Technology Research (HLT 01), ACL, 2001, pp. 219-221. 3. W. Hildebrandt, B. Katz, and J. Lin, “Answering Definition Questions Using Multiple Knowledge Sources,” Proc. Human Language Technology Conf. North American Chapter Assoc. Computational Linguistics (HCL-NAACL 04), ACL, 2004, pp. 49-56. 4. H. Cui et al., “A Comparative Study on Sentence Retrieval for Definitional Question Answering,” Proc. 27th Int’l ACM SIGIR Workshop Information Retrieval for Question An swering (SIGIR 04), ACM Press, 2004; www.comp.nus.edu. sg/~kanmy/papers/DefStcStudy_Cui_etal.pdf. 5. S. Miliaraki and I. Androutsopoulos, “Learning to Identify Single-Snippet Answers to Definition Questions,” Proc. 20th Int’l Conf. Computational Linguistics (COLING 04), ACL, 2004, pp. 1360-1366. 6. I. Androutsopoulos and D. Galanis, “A Practically Unsupervised Learning Method to Identify Single-Snippet Answers to Definition Questions on the Web,” Proc. Conf. Human Language Technology and Empirical Methods in Natural Language Processing (HLT/EMNLP 05), ACL, 2005, pp. 323-330.
ines: uide l g r o Au t h g /mc / ter.or u p m .co www .htm u th o r sive /a a v r e p tails: er de rg Furth uter.o co m p @ e iv s p e r va .org / pute r m o .c www sive p e r va
76
computer
Alejandro Figueroa is a PhD student at the German Research Center for Artificial Intelligence. His research interests include question answering and natural-language processing. Figueroa received an MS in language technolo gies from Saarland University, Germany. Contact him at
[email protected]. Günter Neumann is a principal researcher at the German Research Center for Artificial Intelligence. His research interests include natural-language processing and machine learning. Neumann received a PhD in computer science from Saarland University. He is a member of the German Informatics Society and the Association for Computational Linguistics. Contact him at
[email protected]. John Atkinson is an associate professor in the Department of Computer Science at the University of Concepcion. His research focuses on natural-language processing and text mining. Atkinson received a PhD in artificial intelligence from the University of Edinburgh. He is a member of the AAAI, the IEEE, and the ACM. Contact him at atkinson@ inf.udec.cl.
r o f l l Ca cles Arti
ting peero mpu C latest e e h v t i s s on nd Perva pap e r bile, a s e f ul e, mo IEEE r vasiv ible, u e
pe ar a cce s s hard w ents in seek s clude elopm in v s e d ic orld p ed g. To eal-w review ion, putin u re , r t m c o u c r t s terac t itou infras ter in u e r p a ubiqu m nt , n - co , s of t w hu m a loyme ology g dep c tion, te c h n a in r d e t lu in s , inc g an d ration sensin nside o y. c s privac y s te m y, and an d s it r u c e ilit y, s scalab MO
B IL
N E A
B D U
IQ U
IT O
US
SYS
TE M
S