Document not found! Please try again

Mining the web to validate answers to natural ... - Semantic Scholar

3 downloads 0 Views 60KB Size Report
Mining the web to validate answers to natural language questions. Bernardo Magnini, Matteo Negri, Roberto Prevete & Hristo Tanev. ITC-Irst, Centro per la ...
Mining the web to validate answers to natural language questions Bernardo Magnini, Matteo Negri, Roberto Prevete & Hristo Tanev ITC-Irst, Centro per la Ricerca Scientifica e Tecnologica, Trento, Italy

Abstract Answer validation is the ability to automatically judge the relevance of a candidate answer with respect to a given question. This paper investigates answer validation following a data-driven approach that considers textual passages (i.e. snippets) retrieved from the Web as the main source of information. Snippets are then analyzed in order to maximize the density of relevant keywords belonging both to the question and to the answer. Results obtained on a corpus of human-judged factoid question-answer pairs submitted by participants at TREC-2001 show a satisfactory degree of success rate (i.e. 86%). In addition, the efficiency of the methodology (documents are not downloaded) makes the approach suitable to be integrated as a module in the architecture of a question answering system.

1 Introduction Textual Question Answering (QA) aims at identifying the answer to a question in large collections of documents. QA systems are presented with natural language questions and the expected output is either the actual answer identified in a text or small text fragments containing the answer. A common QA architecture includes three basic components: a question processing component, generally based on some kind of linguistic analysis; a search component, based on information retrieval techniques; an answer processing component, which exploits the similarity between the question and the documents to identify the correct answer. Answer validation is a recent issue in QA, where open domain systems are often required to rank huge amounts of candidate answers. More precisely, given a question q and a candidate answer a, the answer validation task is defined as the capability to assess the relevance of a with respect to q. Several QA systems apply answer validation techniques with the goal of filtering out improper candidates by checking how relevant a candidate answer is with

respect to a given question [1]. Approaches to answer validation range from the use of knowledge intensive processing of documents to purely statistical methods based on Web search. Knowledge-based approaches deal mostly with semantic techniques using relations in a knowledge base (e.g. WordNet [2]) and rely on discovering semantic relations between the question and the answer [3]. Although well motivated, the use of semantic techniques on open domain tasks is quite expensive both in terms of computational complexity and in terms of the involved linguistic resources. Moreover, these approaches are limited to the knowledge stored in a particular knowledge base. Data driven approaches, on the other hand, rely on non structured textual information, whose main source is the Web. The idea of using the Web as a corpus is an emerging topic of interest in the computational linguistics community. The TREC-2001 QA track (http://trec.nist.gov/pubs/trec10/) demonstrated that Web redundancy can be exploited at different levels in the process of finding answers to natural language questions. Several studies (e.g. [4], [5]) suggest that the application of Web search can improve the precision of a QA system by 25-30%. A common feature of these approaches is the use of the Web to introduce data redundancy for more reliable answer extraction from local text collections. [6] suggests a probabilistic algorithm that learns the best query paraphrase of a question when searching the Web. Other approaches suggest training a QA system on the Web [7]. However, for data-driven approaches, a crucial problem is the amount of documents that need to be processed in order to extract significant information for answer validation. From this perspective, content-based approaches tend to download documents from the Web and then to apply some passage selection filter to extract relevant text portions. More recently [8] proposed a purely statistical approach to answer validation based on quantitative counts of retrieved documents, showing significant performance. Building on those results, this paper investigates a challenging compromise between content-based and statistical methods to data-driven answer validation. The basic idea is to take advantage of text passages (i.e. snippets) returned by some search engines (e.g. Google) as a result of a Web search. The main advantage is that there is no need to download documents, making the whole process suitable for real time application. We propose an answer validation procedure based on the following steps: (i) given a question and a candidate answer for that question, compute the set of representative keywords from both of them; this step is carried out using linguistic techniques, such as answer type identification (from the question) and named entities recognition (from the answer); (ii) from the extracted keywords compute a validation pattern and submit it as a query to Google; (iii) estimate an answer validity score considering the distance between keywords in the text snippets provided with the results. The paper is structured as follows. Section 2 presents the main points of data driven answer validation. Section 3 shows the linguistic processing necessary to build validation patterns from a question/answer pair. Section 4 provides explanations on the use of a Web search engine to obtain textual snippets and Section 5 gives details on how to calculate an answer relevance score from snippets. Results and discussion of experiments are reported in Section 6.

2 Data-driven answer validation Given a question q and a candidate answer a, the answer validation task is defined as the capability to assess the relevance of a with respect to q. We assume open domain questions and that both answers and questions are texts composed of few tokens (usually less than 100). This is compatible with the TREC-2001 data which will be used as an example throughout this paper. We also assume the availability of the Web, considered to be the largest open domain text corpus containing information about almost all the different areas of the human knowledge. As pointed out by [8], the answer validation task can be addressed considering the results of a Web search with validation patterns, i.e. queries containing different combinations of question and answer keywords. As an example, given the question “What is the capital of the USA?”, the problem of validating the answer “Washington” can be tackled by searching the Web with the following validation patterns (the former is a simple AND composition, while the latter is built using the NEAR operator provided by AltaVista): [capital AND USA AND Washington] [capital NEAR USA NEAR Washington] These queries result in ranked lists of documents into which the words “capital”, “USA”, and “Washington” co-occur. The co-occurrence of both question and answer keywords can be considered as a clue to the relevance of the answer with respect to the question. In the framework of a Web-based approach, validation patterns represent a general and effective way for dealing with both the huge quantity and variety of information connecting a question to its possible answers. Answers, in fact, may occur in text passages with low similarity with respect to the question. Passages telling facts may use different syntactic constructions sometimes are spread over more than one sentence, may reflect opinions and personal attitudes, and often use ellipsis and anaphora. For instance, we can find Web documents containing passages like those reported in Table 1. Table 1. Text fragments found by Google where capital, USA, and Washington co-occurr 1. …Capital Communications Group, LLC International Consulting 1250 24th Street, NW, Suite 350 Washington, DC 20037 USA tel 202-776 2. … way our nation's capital can! Let USA Hosts help create your next great program! Click here to send an immediate message to USA Hosts in Washington DC. Click … 3. Washington, DC and the Capital Region USA Annual Report for FY ... 4. District of Columbia (Washington, DC, the capital of the USA) 5. … Europe Washington DC, USA Washington DC is the capital of USA

These passages convey something different from the explicit information expressed by statements like “The capital of the USA is Washington”. Nevertheless, such implicit information contains a significant amount of knowledge about the relations between the question and the answer. Validation patterns cover a large portion of text fragments, including those lexically similar to the question and the answer (e.g. fragments 4 and 5 in Table 1) and also those that are not similar (e.g. fragment 2 in Table 1). A useful feature of validation patterns is that when we search for them on the Internet, we capture both explicit and implicit information, thus obtaining many hits. In contrast, the straightforward search for explicit statements connecting question and answer keywords generally results in a small number of documents (if any).

3 Extracting validation patterns In our approach a validation pattern consists of two components: a question subpattern (Qsp) and an answer sub-pattern (Asp). 3.1 Building the Qsp A Qsp is derived from the input question cutting off non-content words with a stop-words filter. The remaining words are expanded with morphological forms in order to maximize the recall of retrieved documents. Morphological forms are added to the Qsp and placed together in an OR clause. The following example illustrates how the Qsp is constructed. Given the TREC2001 question “When did Elvis Presley die?” the stop-words filter removes “When” and “did” from the input. Then a limited number of morphological forms (two in this version of the experiment) for the verbs are added to the Qsp. The resultant Qsp will be the following: [Elvis AND Presley AND (die OR died)] 3.2 Building the Asp In our experiments with Google we addressed only questions which require a named entity as an answer (i.e. factoid questions). For these types of questions the exact answer candidate can be extracted from the answer string with simple named entities recognition. We excluded from our experiments both definition and generic questions because for these question types it is not clear which part of the answer string is the exact answer. An Asp is constructed in two steps. First, the answer type of the question is identified considering both morpho-syntactic (a part of speech tagger is used to process the question) and semantic features (by means of semantic predicates defined on the WordNet taxonomy). Possible named entity answer types are: DATE, MEASURE, PERSON, and LOCATION. In the second step, a rule-based named entities recognition module identifies in the answer string all the named entities matching the answer type category. If the

category corresponds to a named entity, an Asp for each selected named entity is created. For instance, given the question “When did Elvis Presley die?” and the candidate answer “though died in 1977 of course some fans maintain”, since the answer type category is DATE the named entities recognition module will select [1977] as an answer sub-pattern.

4 Exploring Web redundancy with Google For our experiments we have chosen Google (http://www.google.com) as a search engine on the Web because it has a large document index, is very fast, supports boolean expressions and allows for the extraction of co-occurrence snippets. Google was founded in 1997 by Sergey Brin and Larry Page at Standford University. Its architecture is optimized for high speed performance and very large scale search [9]. Currently its document index includes over two billion pages and represents a significant part of the Web. These pages, mirrored on the Google servers for faster access, include not only HTML documents, but also PDF, Microsoft Office, PostScript, Corel WordPerfect, Lotus 1-2-3, etc. One of the most outstanding features of Google is its page ranking algorithm PageRank™ [9] which makes intensive use of the hypertext graph structure of the Web. PageRank™ classifies pages according to the amount and authority of links which refer to them. The hypertext structure is also exploited by considering the text of the links: when a document text is indexed, the text of the links in other pages that point to that document are also considered as a part of the document text itself. When searching for documents relevant to a query, the ranking algorithm takes into account not only the page rank, the frequency and the position of the query terms, but also their font and their capitalization. Moreover, the pages where the query terms appear close to each other are considered more relevant. Another feature which facilitates the co-occurrence identification is snippet extraction. When a query is submitted, the search engine provides surrounding context for the first occurrences of a query term in the returned pages. When two or more keywords co-occur near each other, they are included in one cooccurrence fragment. Our experiments with the search engine show that Google prefers to extract snippets where close co-occurrences take place ignoring the passages where only one keyword appears. Our validation algorithm makes extensive use of these context snippets to identify co-occurrences between the answer and the question keywords. 4.1 Querying Google We use a Web-mining algorithm to retrieve text fragments where keywords from a question q and its candidate answer a are found together. 4.1.1 Query composition Google accepts queries of no more than 10 key terms, therefore we build validation patterns (see section 3) of a maximum of ten words. If necessary we modify the validation pattern by ignoring morphological forms and question

keywords. Pattern modification is performed using word-ignoring rules in a specified order. Such rules, for instance, ignore the focus of the question, words which belong to less significant parts of speech or belong to specific WordNet classes which we believe are less important. Next, we submit to Google a combined list of keywords from the question and the answer. This way, we aim at obtaining co-occurrence text fragments where the answer appears in proximity to all or to a subset of the question keywords. As an example consider the questionanswer pair: Question: When did Idaho become a state? Answer: In 1890 , Idaho became the 43rd state of the Union First, keywords are extracted from the question, which results in the following keyword list: Idaho, become, and state. Next, the past form of the verb become is added to the keyword list and the question pattern is transformed into a Google query: [Idaho AND (become OR became) AND state] Finally, we extract the candidate answer 1890 building the query: [Idaho AND (become OR became) AND state AND 1890] The search for this expression in the Web catches all the pages where the words Idaho, state and one of the forms of become are present. 4.1.2 Using text snippets Google gives highest ranking to documents whose query terms co-occur closely [9]. For the experiment reported in this paper we set the number of results per page to 100, and we analyze only the first result page. Each document is provided with a list of text snippets in which the terms of the query appear. Instead of downloading the documents we analyze the text snippets. This significantly increases the processing speed, making the total time of analysis for a [q, a] pair about one second. For the above mentioned validation pattern Google returns about 11,100 hits. The snippet lists for the first three hits are presented in Table 2. All the snippets are separated by an ellipsis symbol “...” and each snippet contains at least one query term. How it was stated before, if the keywords are found in proximity to each other, they are provided with common context. In this way we obtain a list of contexts where the answer and question keywords appear together. Our hypothesis is that the closer the distance between the candidate answer a and the question keywords, the stronger their relation is. We consider the cooccurrences of a and of the question keywords as a clue of consistency of the validation pattern and, as a consequence, of the relevance of the candidate answer.

Table 2. Snippets the top three ranked documents obtained with the query [Idaho AND (become OR became) AND state AND 1890] 1. ...This, too, was controversial and was redrawn several times. Nevertheless, it was used until Idaho became a state in 1890. STATE SEAL NOW IN USE. ... 2. …Idaho State Symbols. ... Entered Union: July 3, 1890, 43rd state to ... the University of Idaho, wrote the verse which became the chorus of the Idaho State... 3. …Legislature on January 11, 1866. This, too, was controversial and was redrawn several times. Nevertheless, it was used until Idaho became a state in 1890. … For example, the first snippet in Table 2 includes both the answer and the question keywords. The keyword density in this co-occurrence is very high. If we take out the stop words, the distance between the keywords will become zero. In fact, the text snippet contains an explicit statement: “Idaho became a state in 1890.” We carried out some manual experiments and concluded that usually Google ranks pages containing high density co-occurrences, such as explicit statements, on the top of the document list. Moreover, if there is a close co-occurrence of keywords in a document, Google usually provides at least one snippet for it. Therefore, we do not lose significant information about co-occurrences when substituting the document analysis with snippets processing. The validation algorithm extracts the snippets by parsing the HTML code of the first Google results page. The trigger HTML sequence for beginning or ending a snippet is ..., which represents the ellipsis symbol. Identification of the beginning and of the end of the document list on the page is performed using other HTML tag sequences.

5 Computing answer relevance The validation procedure identifies the text snippets and evaluates them to obtain an answer relevance score. Every appearance of a candidate answer a in a snippet is evaluated by calculating a co-occurrence weight, as the number of the question keywords and their distance from a. If we have co-occurrence of the answer a and a set of question keywords QK = {qk1 , qk 2 ,...} the co-occurrence weight

CW(a,QK) is calculated by means of the following formula: CW(a,QK) =



w(qk i )

(||

qk i a|| +1) −1

i

Where w(qk i ) is the weight of the question keyword qk i . In general, w(qk i ) can be calculated from the keyword frequency. However in our experiment we used equal weights for all the words. In the CW formula we denote the distance between the answer a and the closest appearance of qk i by || qk i a || The distance is measured with the number of non-stop non-keywords between a and qk i . The

more that different question keywords occur in proximity to a, the higher their CW is. The CW also depends on the distance between a and the question terms. If we denote with S a the set of text snippets where a appears in the first one hundred documents, the Answer Relevance Score ARS(a) is calculated in the following way: ARS(a) =

∑ CW (a, QK )

QK ∈S a

By using a sum of the [q, a] co-occurrence weights we stress the number of cooccurrences. In this way we consider more important the patterns which appear with higher frequency in the top 100 most relevant documents retrieved by Google. Table 3. Performance of the answer validation algorithm on different types of factoid questions Question type DATE MEASURE PERSON LOCATION total

#q 52 61 56 80 249

Test set #qa pairs 282 293 323 447 1345

P 60.0 61.7 52.5 52.4 55.5

Baseline SR 70.2 76.1 56.7 57.9 64.2

P 91.3 82.5 91.9 87.3 88.6

R 92.0 56.6 80.0 82.6 79.0

Results SR + %SR 92.6 22.4 78.5 2.4 87.0 30.3 86.4 28.5 86.1 21.9

Answers which result in an ARS lower than a certain threshold are discarded. The rest are sorted by ARS and the answer with maximal score is judged correct along with all the answers which have similar score.

6 Experiment and discussion Test set. We carried out the evaluation of our answer validation algorithm using the question answer file of the main QA track in TREC-2001. In this file all the questions from the main QA track are listed along with the corresponding correct and incorrect answers provided by the participants. We considered only the questions which our system identifies as named entity questions. Our question processing algorithm identified 249 such questions. For each of them we included in the test set a maximum of three correct and three incorrect answers. This resulted in a corpus of 1345 question-answer pairs (some questions have less than three correct answers). Results. Table 3 reports the results of the automatic answer validation experiments in terms of precision (P), recall (R), success rate (SR) and success rate improvement with respect to the baseline. Success rate best represents the performance of the system, being the percent of [q, a] pairs where the result given by the system is the same as the TREC judges' opinion. Precision is the percent of [q, a] pairs estimated by the algorithm as relevant, for which the opinion of TREC

judges was the same. Recall shows the percent of the relevant answers which the system also evaluates as relevant. Success rate improvement shows the increase of the validation algorithm success rate with respect to the baseline model. Baseline. A baseline for the answer validation experiment was defined by judging correct all the answer strings which include at least one named entity corresponding to the question type of the question. Baseline results are also reported in Table 3. The recall of the baseline is 100% because all the correct answers contain at least one named entity of the question type. Therefore in Table 3 for the baseline algorithm we report only precision and success rate. Question types. We wanted to check performance variation based on different types of TREC-2001 questions. We carried out five evaluations: for questions which ask for DATE, for MEASURE, for PERSON, for LOCATION and for the full set of named entity questions. For each question type Table 3 reports the number of the questions for such question type and the number of the corresponding [q, a] pairs. The overall results in the last row of Table 3 show a success rate of 86.1%, i.e. in 86.1% of the [q, a] pairs the system evaluation corresponds to the human evaluation, and confirms the initial working hypothesis. This is 22% over the baseline overall success rate. The success rate of our content-based algorithm is similar to the success rate of the data-driven statistical approach described in [8]. This might be considered evidence for the supposition that in general for such data-driven approaches the data plays a more significant role than the algorithm. Although using different search engines, our approach and [8] use the Web as a main data source. Therefore they share a lot of common data. Our algorithm achieves highest performance for the category DATE (i.e. 92.6%), for which the baseline is also high (70.2%), with an improvement of 22.4%. For the categories PERSON and LOCATION the system demonstrates good performance (87% and 86.4% respectively). Even more, the validation algorithm performance for these named entities exceeds the baseline by 30.3% and 28.5% respectively. The MEASURE category shows the lowest success rate (78.5%), at only 2.4% above the baseline corresponding success rate. There are different hindrances in these kinds of answers. Often measures are given in different units, and to make the things more difficult, different texts treat the numbers with different precision. These obstacles can be solved by comparing numbers and measure units with more intelligent algorithms than a simple string match.

7 Conclusions and Future Work We have presented a data-driven, content-based approach to answer validation which estimates the degree of correlation between the question and the candidate answer by exploiting the redundancy of textual data in the Web. The approach is based on techniques described earlier in [8]. However, we introduce a new algorithm which considers the textual context of the question-answer cooccurrences. In particular, our approach estimates the distance between the candidate answer and the question keywords. We use Google which provides

document snippets where the keywords are found together. This allows for effective co-occurrence mining without downloading documents. Results obtained on the TREC-2001 QA corpus correlate well with the human assessment of answers' correctness and confirm that a Web-based algorithm provides a workable solution for answer validation. We plan several activities in the near future. First, our approach is content-based and as a result it considers a small number of documents found in the Web. In contrast, statistical approaches ([10], [8]) consider the number of documents obtained for a query thus collecting “votes” from the entire Web. To increase the precision of our content-based approach we intend to integrate it with statistical procedures which take into account the number of pages where [q, a] co-occur. The combined use of more than one search engine is another option which we take into consideration. Next, the validation algorithm presented in this paper will be integrated in the architecture of our QA system under development. The system will be designed to maximize the recall of retrieved candidate answers. Instead of performing a deep linguistic analysis of the answer passages, the system will delegate to the answer validation component the selection of the right answer.

References [1] Harabagiu, S., Pasca, M. & Maiorano S., Experiments with Open-Domain Textual Question Answering. Proc. of COLING-2000, Saarbruken, Germany, pp. 292-298, 2000. [2] Fellbaum, C., WordNet, An Electronic Lexical Database, The MIT Press, 1998. [3] Harabagiu, S. & Maiorano S., Finding Answers in Large Collections of Texts: Paragraph Indexing + Abductive Inference. Proc. of the AAI Fall Symposium on Question Answering Systems, pp. 63-71, 1999. [4] Clarke, C., Cormack G., Lynam T., Li, C. & McLearn G., Web Reinforced Question Answering (MultiText Experiments for TREC2001). Proc. of the TREC10 Conference, Gaithesbourg, MD, pp. 620-626, 2001. [5] Brill, E., Lin, J., Banko, M., Dumais, S. & Ng, A., Data Intensive Question Answering. Proc. of the TREC-10 Conference, Gaithesbourg, MD, 2001. [6] Radev, D. R., Qi, H., Zheng, Z., Blair-Goldensohn, S., Zhang, Z., Fan, W. & Prager, J., Mining the Web for Answers to Natural Language Questions. Proc. of 2001 ACM CIKM, Atlanta, Georgia, USA, pp. 143-150, 2001. [7] Mann, G. S., A Statistical Method for Short Answer Extraction. Proc. of the ACL-2001 Workshop on Open-Domain Question Answering, Toulouse, France, 2001. [8] Magnini, B., Negri, M., Prevete. & Tanev, H., Is It the Right Answer? Exploiting Web Redundancy for Answer Validation. To be published in Proc. of ACL-02, 2002. [9] Brin, S. & Page, L., The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proc. of the 7th International World Wide Web Conference, Brisbane, Australia, 1998. [10] Turney, P. D., Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL. Proc. of ECML2001, Freiburg, Germany, pp. 491-502, 2001.