IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006
1601
LETTER
A Definitional Question Answering System Based on Phrase Extraction Using Syntactic Patterns Kyoung-Soo HAN† , Young-In SONG† , Nonmembers, Sang-Bum KIM† , Student Member, and Hae-Chang RIM†a) , Nonmember
SUMMARY We propose a definitional question answering system that extracts phrases using syntactic patterns which are easily constructed manually and can reduce the coverage problem. Experimental results show that our phrase extraction system outperforms a sentence extraction system, especially for selecting concise answers, in terms of recall and precision, and indicate that the proper text unit of answer candidates and the final answer has a significant effect on the system performance. key words: definitional question answering, phrase extraction
1. Introduction Definitional question answering is the task of answering definition questions, such as What are fractals? and Who is Andrew Carnegie?, initiated by TREC Question Answering Track [1]. The answer for the question consists of several essential information nuggets about the question target. The nuggets for the question about Andrew Carnegie include 1) philanthropist, 2) built Carnegie Hall in New York, 3) steel magnate, etc. Contrary to factoid or list questions, definition questions do not have an expected answer type but contain only the question target in the question. Usually all sentences in the retrieved passages by the question target are regarded as answer candidates, and ranked based on several criteria. The top ranked candidates are selected as the final answer nuggets. For practical applications such as mobile service, the answer length must inevitably be limited. As the sentences in news articles are generally so long that the answer length limit is used up quickly, a shorter text unit than a sentence is necessary for generating a concise answer. The idea of using shorter text unit rather than the whole sentence was also adopted by several researchers [2]–[4]. [2] extracted answer phrases using definition patterns, and [3] extracted linguistic constructs, including relations and propositions, using information extraction tools. [4] also used the predicate set defined by semantic categories such as genus, species, cause, and effect. [2] reported the system based on sentence extraction outperformed that based on only the phrase extraction. On the other hand, [3] showed the answers using the linguistic constructs are better than those using raw sentences. We are going to figure out why the performance is not conManuscript received June 6, 2005. Manuscript revised September 1, 2005. † The authors are with the Department of Computer Science and Engineering, Korea University, Seoul, Korea. a) E-mail:
[email protected] DOI: 10.1093/ietisy/e89–d.4.1601
sistent with each other by analyzing empirically the effect of text units on the performance of the definitional question answering system. Definition patterns are known to be so useful for extracting and ranking answer candidates [5]. Most researchers use the manually constructed lexical definition patterns, but the construction task is labor-intensive. Although the lexical patterns can be automatically trained and collected [6], they suffer from a lack of coverage. Therefore, we use syntactic definition patterns which are easily constructed manually and reduce the coverage problem. Our phrase-based definitional question answering using syntactic patterns are explained in detail in Sect. 2, and experimental results are presented in Sect. 3. Finally, we conclude our work in Sect. 4. 2. Phrase-Based Definitional Question Answering Our definitional question answering system consists of four components: question analysis, passage retrieval, candidate extraction, and answer selection. 2.1 Question Analysis In the question analysis phase, the question sentence is parsed to extract the head word of the target. Then, the type of the target is identified using a named entity tagger as one of three types: person, organization, or other thing. We use the target type for calculating the weights of the words in definitional phrases in the later stage. 2.2 Passage Retrieval As the target tends to be differently expressed between documents and the question, a lot of relevant information could not be retrieved by one phase passage retrieval method. Therefore, we firstly retrieve only relevant documents to the target by generating a relatively strict query, and then extract relevant sentences by using a more relaxed one. The query for the document retrieval consists of the words and phrases of the target filtered with a stopword list. If there is a sequence of two words starting with a capital letter, a phrase query is generated with the two words. The remaining words are also used as single query words. Once the documents are retrieved, we generate the passages consisting of each sentence containing the head word of the target.
c 2006 The Institute of Electronics, Information and Communication Engineers Copyright
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006
1602
Then, we check whether the passage can be expanded to the multiple-sentence passage using a simple anaphora resolution technique [7].
ing measure in the factoid question answering system [9]. Therefore, the redundant count of the eliminated candidates is inherited by the survived one, and is used in the candidate ranking phase.
2.3 Candidates Extraction Using Syntactic Patterns 2.4 Answer Selection We extract answer candidates from the retrieved passages using the syntactic patterns shown in Table 1. In this study, we use the syntactic information generated by the Conexor FDG parser [8]. We extracted noun phrases and verb phrases for the answer candidates. The noun phrases are extracted by the patterns noun phrases modifying the question target and copulas, and the verb phrases are extracted by the patterns relative pronoun phrases, participle phrases, and general verb phrases. As the relative pronoun phrases and copulas have the reliable lexical clues, they could be also easily extracted by the lexical patterns such as “TARGET (who|which|that) AP” and “TARGET (is|are|was|were) AP” where TARGET is the question target word, and AP is the answer phrase. However, the non phrases modifying the question target, participle phrases, and general verb phrases have few finite lexical clues. While the only part of them could be extracted by the lexical patterns such as “AP such as TARGET”, “AP (known as|called) TARGET”, the general verb phrases could be hardly extracted by the lexical pattern. The syntactic patterns are useful for extracting such phrases because they are easily constructed manually and can extract more general descriptions about the question target. Therefore, we extract the phrases using the syntactic patterns in order to get high answer recall, and rank them using several criteria in order to get high precision. As the syntactic information induced from the syntactic parser sometimes generates erroneous results, we complement the syntactic information with POS information. We eliminated redundant answer candidates using word overlap and semantic class matching of the head word [7]. Although the redundancy is a problem to making a short novel definition, the redundant information is likely to be important, which is also used as an effective rankTable 1
We used the several criteria to rank answer candidates: head redundancy, term statistics in the relevant passages, external definitions, and definition terminology. The top ranked candidates are selected for the final answer. 2.4.1 Head Redundancy The important facts or events are usually mentioned repeatedly, and the head word is the core of each answer candidate. We calculate the redundancy of answer candidate C by using following formula. r −1 (1) Rdd(C) = exp n where r represents the redundant count of answer candidate C in the candidate set, and n is the total number of answer candidates. For most terms, the fraction r/n is so smaller than 1 that the Rdd(C) is hardly over 1. 2.4.2 Local Term Statistics In addition to the head word, the frequent words in the retrieved passages are important. The Loc(C) presents a local score based on the term statistics in the retrieved sentences (i.e. local sentences) and is calculated as follows: s fi Loc(C) =
ti ∈C
maxs f |C|
(2)
where s fi is the number of sentences in which the term ti is occurred, maxs f is the maximum value of s f among all terms, and |C| is the number of all content words in the answer candidate C.
Syntactic patterns for extracting answer candidates.
Pattern Noun phrases modifying the question target
Description Noun phrases that have a direct syntactic relation to the question target
Relative pronoun phrases Participle phrases
Verb phrases where a nominative or possessive relative pronoun modifies directly the question target Present or past participles, without its subject, modifying directly the question target or the main verb directly related to the question target.
Copulas
Noun phrases used as a complement of the verb be.
General verb phrases
Verb phrases modified directly by the question target which is the subject of the sentence. If the head word of a phrase is among stop verbs, the phrase is not extracted. The stop verbs indicate the functional verbs, which is not informative one such as be, say, talk, and tell.
Example Former world and Olympic champion Alberto Tomba missed out on the chance of his 50th World Cup win when he straddled a gate in the first run. Copland, who was born in Brooklyn, would have turned 100 on Nov. 14, 2000. Tomba, known as “La Bomba,” (the bomb) for his explosive skiing style, had hinted at retirement for years, but always burst back on the scene to stun his rivals and savor another victory. TB is a bacterial disease caused by the Tuberculosis mycobacterium and transmitted through the air. Iqra will initially broadcast eight hours a day of children’s programs, game shows, soap operas, economic programs and religious talk shows.
LETTER
1603
2.4.3 External Definitions
3. Experimental Results
Definitions of a question target extracted from external resources such as online dictionary or encyclopedia are called external definitions. We tried to use the external definitions by designing a scoring formula Ext(C) based on the probability model. P(C | A) +1 (3) Ext(C) = log P(C)
3.1 Experiments Setup
where P(C | A) is the probability that C is one of the real answer nuggets A. Given that each external definition is also one of the real answer nuggets, we can estimate the probability using following external definition model: P(C | A) =
f reqE |C|1 i
|E| |C|1
ti ∈C
P(C) =
f reqB ti ∈C
i
|B|
(4)
where f reqE i is the number of occurrences of term ti in the external definitions E, and the |E| is the total term occurrences in the E. The f reqBi and |B| in the background collection B correspond to the f reqE i and |E| in the external definition E. |C| is used for normalizing the probabilities. 2.4.4 Definition Terminology Although external definitions are useful for ranking candidates, it is obvious that they cannot cover all the possible targets. In order to alleviate the problem, we device a definition terminology score reflecting how the candidate phrase is definition-like. While [10] used similar approaches for ranking answer candidates, we identify the target type and build definition terminology according to the type in our approach. In order to get the definition terminology, we collected external definitions according to the three target types. We compare the term statistics in the definitions to those in the general text, assuming that the difference of the term statistics can be a measure for the definition terminology. PD (ti ) +1 log P(ti ) ti ∈C (5) T mn(C) = |C| where PD (t) and P(t) is the probability of term t in the definitions and the general text, respectively. The probability P(t) is estimated by the Eq. (4), except the length normalization factor. The criteria mentioned so far are linearly combined into a score, and the final answer is extracted from the top ranked candidates up to the length limit.
We have experimented with 50 TREC 2003 topics and 64 TREC 2004 topics, and the answer is searched from the AQUAINT corpus. The TREC answer set for the definitional question answering task consists of several definition nuggets for each target, and each nugget is a short string like our phrase. The evaluation of systems involves matching up the answer nuggets and the system output. As the manual evaluation such as TREC evaluation requires a lot of cost, we evaluated the systems automatically. The evaluation of definition answer is very similar to that of summary, so we used a package for automatic evaluation of summaries called ROUGE [11]. ROUGE has been used for automatic evaluation of summarization task in Document Understanding Conference (DUC), and was successfully applied for evaluation of the definitional question answering [3]. We used ROUGE-L among several measures because it is known to be highly correlated with human judgement. LCS (A, S ) LCS (A, S ) Plcs = |A| |S | (1 + β2 )Rlcs Plcs = Rlcs + β2 Plcs
Rlcs =
(6)
Flcs
(7)
where LCS (A, S ) is the length of the longest common subsequence of the reference answer A and the system result S , and |A| and |S | are the length of them respectively. The LCSbased F-measure Flcs is called ROUGE-L, and β controls the relative importance of recall and precision. We used external definitions from various online sites: Biography.com, Columbia Encyclopedia, Wikipedia, FOLDOC, the American Heritage Dictionary of the English Language, Online Medical Dictionary, and Web pages returned by Google search engine. The external definitions are collected at query time by throwing a query consisting of the head words of the question target into each site. In order to extract definition terminology, we also collected definitions according to the target type: 1,174 persons, 545 organizations, and 696 things entries. The AQUAINT collection is used for general text. We used our document retrieval engine based on BM25 of OKAPI, and processed top 200 documents retrieved in all experiments. 3.2 Syntactic Patterns Figure 1 shows the performance of the system using each syntactic pattern with TREC 2003 topics, and Fig. 2 shows the performance with TREC 2004 topics. The performance is measured by F-measure (β = 1) in Eq. (7) according to the answer length counted by the number of non-white-space characters. As shown in the figures, we found that noun phrases modifying the question target (ModNP) are the best single pattern, that general verb phrases (GenVP) are better
IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006
1604 Table 2 System performance according to answer length and text unit: sentences vs. phrases. (TREC 2003)
Fig. 1
Comparison of phrase types. (TREC 2003)
System 100.S.S 100.S.P 100.P.S 100.P.P 300.S.S 300.S.P 300.P.S 300.P.P 500.S.S 500.S.P 500.P.S 500.P.P 1000.S.S 1000.S.P 1000.P.S 1000.P.P 100.P+.P+ 200.P+.P+ 300.P+.P+ 500.P+.P+ 1000.P+.P+
R 0.0505 0.0633 0.0483 0.0795 0.1601 0.1542 0.1231 0.1721 0.2202 0.2009 0.1716 0.2221 0.3150 0.2387 0.2614 0.2544 0.0811 0.1354 0.1767 0.2565 0.3419
P 0.1601 0.1982 0.1451 0.2309 0.1372 0.1672 0.1137 0.1906 0.1180 0.1546 0.0994 0.1713 0.0866 0.1323 0.0853 0.1413 0.2171 0.1800 0.1576 0.1345 0.0948
F (β = 1) 0.0739 0.0900 0.0705 0.1121 0.1379 0.1340 0.1123 0.1553 0.1445 0.1418 0.1198 0.1592 0.1309 0.1277 0.1203 0.1391 0.1128 0.1469 0.1578 0.1661 0.1427
F (β = 3) 0.0538 0.0670 0.0515 0.0842 0.1516 0.1449 0.1192 0.1644 0.1923 0.1779 0.1545 0.1980 0.2355 0.1923 0.2005 0.2075 0.0856 0.1365 0.1702 0.2230 0.2557
Table 3 System performance according to answer length and text unit: sentences vs. phrases. (TREC 2004)
Fig. 2
Comparison of phrase types. (TREC 2004)
than the others, and that copulas (Copular) are not consistent. When we combine all the patterns (All), the performance was increased significantly. 3.3 Candidate and Answer Text Unit Table 2 shows the system performance according to the answer length with TREC 2003 topics, and Table 3 shows the performance with TREC 2004 topics. The system name suffix S.S, S.P, P.S, and P.P denote the text unit for answer candidates and the final answer. S and P indicate the sentences and phrases, respectively. The 500.S.P represents the system where sentences are used as the candidates and phrases are used as the final answer limited up to 500 nonwhite-space characters. All systems used the same question analysis, passage retrieval, and ranking measures, except the extraction unit for candidates and final answer. If the total length of the extracted candidates is shorter than the length limit, the answer can be shorter than the limit. With the short length limit, the phrase-based system P.P outperforms the sentence-based system S.S. As the length limit gets longer, the performance difference becomes smaller, and the performance rank is reversed at the
System 100.S.S 100.S.P 100.P.S 100.P.P 300.S.S 300.S.P 300.P.S 300.P.P 500.S.S 500.S.P 500.P.S 500.P.P 1000.S.S 1000.S.P 1000.P.S 1000.P.P 100.P+.P+ 200.P+.P+ 300.P+.P+ 500.P+.P+ 1000.P+.P+
R 0.0598 0.0582 0.0404 0.0769 0.1379 0.1231 0.1044 0.1369 0.1916 0.1621 0.1589 0.1726 0.2772 0.2179 0.2577 0.2187 0.0654 0.1109 0.1444 0.2005 0.2847
P 0.2479 0.2319 0.1860 0.2857 0.1911 0.1860 0.1550 0.2113 0.1617 0.1622 0.1431 0.1805 0.1195 0.1373 0.1222 0.1505 0.2511 0.2196 0.1925 0.1592 0.1176
F (β = 1) 0.0947 0.0916 0.0655 0.1185 0.1565 0.1412 0.1215 0.1569 0.1717 0.1495 0.1455 0.1603 0.1638 0.1443 0.1583 0.1494 0.1023 0.1447 0.1620 0.1738 0.1633
F (β = 3) 0.0645 0.0627 0.0438 0.0827 0.1409 0.1258 0.1071 0.1400 0.1865 0.1581 0.1552 0.1687 0.2414 0.1929 0.2263 0.1953 0.0704 0.1162 0.1474 0.1937 0.2457
certain length, over 1000 bytes for TREC 2003 topics and over 300 bytes for TREC 2004 topics. Although the recall is considered to be three times as important as the precision, the observation is about the same. These results support our claim that the phrases shorter than sentences are useful for a concise answer. Under the limitation that the answer length cannot be long, we had better use phrases extracted based on syntactic patterns rather than sentences. Experimental results also show that the phrase-phrase system P.P outperforms the sentence-phrase system S.P in all cases, and that the sentence-sentence system S.S also outperforms the phrase-sentence system P.S. It says that if the answer unit is a phrase, we had better use phrases as
LETTER
1605
answer candidates, and it is true of the sentences. In other words, it is appropriate to use the same text unit between the answer candidates and the final answer. The performance of the phrase-based system is not increased in the long answer because the extracted phrases are not sufficient. If the set of retrieved sentences are very small, the phrase-based system does not perform very well because of the insufficient number of the answer candidates. In this case, we can use the extracted phrases and the sentences from which no phrase is extracted together. Actually, experimental results show that the combination P+.P+ outperforms the single phrase- or sentence-based system. 3.4 TREC 2004 Evaluation Results We participated in TREC 2004 Question Answering Track with a preliminary system [7]. Our TREC 2004 system used the definition terminology only for persons based on an encyclopedia, not using the external definition score. The evaluation results of our system was 0.246 based on the F (β = 3) measure, compared to 0.184 of the median performance of all participation systems. Ours was among top 10 systems. 4. Conclusions We proposed a definitional question answering system that extracts phrases using the syntactic patterns, and studied the effect of text units of the answer candidates and the final answer on a performance of the definitional question answering system. Our interesting findings in this study can be summarized as follows: • Phrases are likely to be a good processing unit rather than sentences for the concise definitional question answering. For a long answer, it is appropriate to use the phrases and sentences together. • Noun phrases modifying the question target is the best single pattern for extracting the phrases. • We had better use the same text unit between the answer candidates and the final answer. Although the syntactic patterns are useful for extracting the phrases having little finite lexical clues, they do not
have to be used for extracting the phrases having sufficiently reliable lexical clues such as relative pronoun phrases. For the future work, we are going to devise a method of extracting more diverse and accurate phrases by using the syntactic patterns and lexical patterns together. References [1] E.M. Voorhees, “Overview of the TREC 2003 question answering track,” Proc. 12th Text Retrieval Conference (TREC-2003), pp.54– 68, 2003. [2] S. Harabagiu, D. Moldovan, C. Clark, M. Bowden, J. Williams, and J. Bensley, “Answer mining by combining extraction techniques with abductive reasoning,” Proc. 12th Text Retrieval Conference (TREC-2003), pp.375–382, 2003. [3] J. Xu, R. Weischedel, and A. Licuanan, “Evaluation of an extractionbased approach to answering definitional questions,” Proc. 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-2004), pp.418–424, 2004. [4] S. Blair-Goldensohn, K.R. McKeown, and A.H. Schlaikjer, “A hybrid approach for QA track definitional questions,” Proc. 12th Text Retrieval Conference (TREC-2003), pp.185–192, 2003. [5] T.S.C.H. Cui, M.Y. Kan, and J. Xiao, “A comparative study on sentence retrieval for definitional question answering,” SIGIR Workshop on Information Retrieval for Question Answering (IR4QA), 2004. [6] H. Cui, M.Y. Kan, and T.S. Chua, “Unsupervised learning of soft patterns for generating definitions from online news,” Proc. 13th international conference on World Wide Web (WWW-2004), pp.90– 99, 2004. [7] K.S. Han, H. Chung, S.B. Kim, Y.I. Song, J.Y. Lee, and H.C. Rim, “Korea University question answering system at TREC 2004,” Proc. 13th Text Retrieval Conference (TREC-2004), 2004. [8] P. Tapanainen and T. Jarvinen, “A non-projective dependency parser,” Proc. 5th Conference on Applied Natural Language Processing, pp.64–71, 1997. [9] S. Dumais, M. Banko, E. Brill, J. Lin, and A. Ng, “Web question answering: Is more always better?,” Proc. 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-2002), pp.291–298, 2002. [10] A. Echihabi, U. Hermjakob, E. Hovy, D. Marcu, E. Melz, and D. Ravichandran, “Multiple-engine question answering in TextMap,” Proc. 12th Text Retrieval Conference (TREC-2003), pp.772–781, 2003. [11] C.Y. Lin, “Rouge: A package for automatic evaluation of summaries,” Proc. Workshop on Text Summariation Branches Out, PostConference Workshop of ACL 2004, 2004.