Answer-Driven Reasoning as a Framework for ... - Semantic Scholar

1 downloads 0 Views 38KB Size Report
Child prodigy Mozart made his first public appearance, at six years of age, in a harpsichord and piano concert tour of Munich and Vienna. One year later.
Answer-Driven Reasoning as a Framework for Intelligent Information Access Matteo Negri, Bernardo Magnini and Hristo Tanev ITC-irst, Centro per la Ricerca Scientifica e Tecnologica Via Sommarive, 38050 Povo (TN), Italy {negri|magnini|tanev}@irst.itc.it

Abstract. This paper presents a general overview of answer-driven reasoning, which may serve as a common framework for a variety of Information Retrieval related activities. The underlying assumption is that a piece of information in a given document is as relevant as the extent to which such information represents the answer to a possible user’s question. The approach relies on the application of answer procedures to a source data (e.g. a text collection) to mine question/answer correlations. The outcome can be exploited for different purposes, ranging from the creation of a Q/A index specifically targeted to intelligent Information Retrieval, Question Answering and Summarization, to the augmentation of existing ontologies in the Semantic Web scenario.

1

Introduction

State of the art information access technologies still fall short of providing a reliable framework for intelligent information search. Though NLP has been rapidly developing in the last few years, most of the information retrieval engines rely on simple bag of words techniques which retrieve a document considering only the presence of the query keywords in the text, their proximity, and some other nonlinguistic characteristics. The extreme paradox inherent to bag of words approaches is that given a question the top scored document retrieved is the document that contains the question itself, no matter whether or not the answer is actually provided. Experienced Web users may be able to circumnavigate the proble m by resorting to the trick of predicting the ways in which the correct answer is likely to appear within a document. For instance, if we want to know “what is an atom?”, the best solution relies on querying Google with exact strings such as [“an atom is”], or [“an atom is defined as”], instead of searching the Web with the question [What is an atom?]. The challenging goal of automating the Web expert’s tricky solution of searching for possible answer patterns is at the origin of the answer-driven approach to information access described in this paper. The underlying assumption is that a piece of information in a given document is as relevant as the extent to which such information represents the answer to a possible user’s question. Within this framework, we investigate the possibility of developing automatic techniques to: • pinpoint those document fragments which represent answers to possible users’ questions;

• capture the correlation between such fragments and a question types taxonomy; store question/answer correlations in a Q/A conceptual index ready to be accessed in different applicable scenarios.

2

Answer-Driven Reasoning

The use of answer patterns is motivated by the general assumption that, given a particular question, there exist a certain set of schemas in which a correct answer for that question can be formulated and presented. While many Question Answering systems make implicit use of answer patterns as a supporting resource, their application as a core technology has been recently exploited in TREC competitions (http://trec.nist.gov), showing that reliable systems can be developed upon pattern repositories built solely by human effort (Soubbotin and Soubbotin 2001). However, manual coding of answer patterns is likely to become infeasible when unrestricted question types and application domains are supposed. A first attempt to develop an automated surface pattern acquisition process for Question Answering has been performed in Ravichandran and Hovy (2002), showing that obtaining surface answer patterns is possible, provided the availability of enough question/answer pairs for a proper training. Starting from the Question Answering perspective (i.e. retrieve “answers” rather than “documents”), our aim is to combine the most recent advances in Natural Language Processing, Machine Learning, and Knowledge Representation to develop effective and robust answer-driven technologies for information access. The key element of answer-driven reasoning is represented by answer procedures, which will serve as methods for mining question/answer correlations within an input text. Answer procedures can be seen as sets of instructions for identifying answers to possible users’ questions concerning the content of an input document. More concretely, by running answer procedures over a source data (e.g. a document collection), text portions that can provide answers to sensible users’ questions will be marked as relevant and associated to a detailed taxonomy of question types by means of explicit XML structural annotations. For instance, supposing that we find in a text “Mozart (1756 – 1791)”, the application of answer procedures (e.g. the set of answer procedures for birth/death dates) will result in a tagged text, associated with the class of questions asking for birth/death dates, where Mozart has been recognized as the name of a person whose birth and death dates are 1756 and 1791 respectively. Provided the availability of more sophisticated answer procedures, the same text portion will also be associated to oher sensible users’ questions such as “How old was X when he died?”, or “How old was X in DATE?”. Answer procedures range from the application of superficial patterns to the application of more complex knowledge-based procedures, involving advanced linguistic analysis and reasoning mechanisms (coreference resolution, spatial and temporal reasoning, etc.). In the case of the example above, the textual pattern “X (Y1-Y2)” (where X stands for a capitalized word, while Y1 and Y2 stand for sequences of 4 digits) matches the ni put text establishing a correlation between “Mozart”, “1756”, and the question “When was Mozart born?”. Advanced linguistic analysis is necessary to infer the time when Mozart appeared on stage (or how old

Mozart was in 1763) from the following text portion: “Wolfgang Amadeus Mozart (1756-1791). Child prodigy Mozart made his first public appearance, at six years of age, in a harpsichord and piano concert tour of Munich and Vienna. One year later (1763) his first published composition was distributed in Paris”. In addition, moving from pure texts to semi-structured source data, such as tables and conceptual hierarchies, will necessitate the design of specialized answer procedures. For instance, specific answer procedures could be applied to gather the same kind of information that was sought above from a music time line, or from a table of historical data about the history of arts rather than from a text. Finally, ad hoc procedures can be automatically acquired to form answers to complex questions combining multiple information sources.

3

Answer Procedures Acquisition and Abstraction

Answer procedures are first automatically acquired from the Web in the form of surface patterns, then generalized using robust Natural Language Processing, automated reasoning, and machine learning techniques, and finally stored in a common infrastructure (the answer procedures library). This acquisition/abstraction process is depicted in Figure1. QType taxonomy + Q/A pairs Repository Semantic Web Ontologies Linguistic & Semantic Processors

Web Documents Abstraction

Temporal & Spatial Reasoning

Answer Procedures

Answer Procedures Library

Figure1: Answer procedures acquisition and abstraction. The automatic acquisition of surface patterns is carried out in light of methods such as the one proposed by (Ravichandran and Hovy, 2002), which assumes that repeated word sequences containing both the question keywords and the answer are evidence for useful answer phrases. The basic idea is to download from the Web, for each QType and the corresponding examples in a question/answer pairs repository, a huge database of sentences containing both the question keywords and the answer. These sentences will then be processed by applying efficient data structures for string

matching (e.g. suffix trees) in order to extract the most frequent substrings containing both the question keywords and the answer. Once these substrings have been extracted, the corresponding textual answer patterns are created by replacing the words of the question by the tag , and the answer with the tag . As an example, consider the class of questions asking for birth dates and the question/answer pairs [Mozart; 1756] and [Guglielmo Marconi; 1874]. Since expressions like “Mozart (1756-1791)”, “Mozart was born in 1756”, “Guglielmo Marconi (1874-1937)”, and “Guglielmo Marconi was born in 1756” will probably be among the most frequent substrings retrieved from the Web, the resulting surface text answer patterns will be “ (-” and “ was born in ”. However, answer procedures based on surface text patterns are not general enough to be applied on a large scale. While some of them can be used in the answer-driven reasoning approach as they are (e.g. “ was born in ”), the largest part is too specific, and hence almost useless. For instance, textual answer patterns such as “, the great composer whose legacy in music is widely known, was born in ”, clearly show a limited applicability. This issue can be addressed by developing abstraction techniques (i.e. pattern induction) to cluster similar patterns and produce semantically motivated generalizations. Answer procedures abstraction relies on semantic processing (e.g. coreference resolution, named entity recognition, parsing, lexical semantics) as well as advanced automated reasoning techniques (i.e. temporal and spatial reasoning) and Machine Learning algorithms (e.g. pattern acquisition/recognition, concept clustering, me mory-based learning etc.). For instance, the surface text pattern in the example above can be generalized by applying syntactic analysis to replace “, the great composer whose legacy in music is widely known,” with the corresponding syntactic function, thus obtaining the generalized answer procedure: “,, was born in ”. Lexical semantics allows for the substitution of words within a surface text pattern with concepts. This generalization process can be accomplished in two steps. First, Word Sense Disambiguation based on WordNet senses provides the basis for determining the correct sense of a word within a textual pattern. Then, WordNet semantic relations (e.g. the hypernymy relation) provide more generic concepts for substitution. For instance, surface text patterns such as: “, the composer, was born in ” “, the genius, was born in ” “, the musician, was born in ” will be generalized in: “, the , was born in ”. Other answer procedures can be generalized by applying temporal and spatial reasoning according to ontologies already existing in the Semantic Web. For instance, an answer procedure capable of calculating what Mozart’s age was in 1770 would require simple temporal reasoning involving concepts such as “interval” or “duration”. As a result of the acquisition and abstraction process, answer procedures acquired from different data types and with different levels of abstraction are stored in a common database, (i.e. the answer procedures library), with a uniform representation

format. Such a database will be the basic resource for the creation of semantic Q/A indexes and for any further application of the answer driven approach to information access.

4

Creation of a Q/A Index

Figure 2 illustrates how the answer procedures library can be used to mine question/answer correlations from a given source data collection, and to create a Q/A index ready for Information Retrieval related tasks. The discovery of Q/A correlations can be accomplis hed through two stand-off annotation levels. At the first level, each document is preprocessed by applying linguistic and semantic processors to annotate all the information required to run answer procedures. Most of the processors used during the answer procedures abstraction phase (i.e. tools for coreference resolution, named entity recognition, parsing, word sense disambiguation, temporal and spatial reasoning) are applied to produce these annotations. Once the source data have been annotated, the answer procedures stored in the answer procedures library are applied to highlight possible answers present within the documents. The matching of an answer procedure with a text portion results in the correlation of this text portion with an element of the QType taxonomy. The second annotation level will explicitly represent this correlation, relying on XML tags.

Linguistic Annotation Source Data

Linguistic & Semantic Processors

Question Answer Correlation

Answer Procedures Library

Annotated Source Data

Q/A Indexing of the Source Data

Q/A Index

• Advanced Search • Semantic Web • Question Answering • Summarization

Figure2: Discovery of question/answer correlation and creation of the Q/A index. The final step is the creation of the Q/A index, which keeps these annotations in a compact format ready to be accessed by search engines. According to the answerdriven reasoning perspective, relevant terms (i.e. terms which are the “focus” of a

possible user’s query) are indexed by the ques tions (and their respective answers in documents) that users in a given information access situation are likely to pose.

5

Potential Applications

Potential applications of the answer-driven reasoning approach affect many research areas addressing the problem of information access. Question Answering. Most QA systems today have a more or less standard structure based on transforming the query, locating documents likely to contain an answer, and then selecting the most likely answer passage within the candidate documents (Magnini et al., 2002). The difference between systems lies in the selection of appropriate passages. If the system follows a classical Information Retrieval approach, the challenge is to make segments so small as to fit the answer without redundant information also being returned. If the system follows a Natural Language Processing approach, the challenge is to perform a syntactic and semantic interpretation of the user’s question against the corresponding syntactic and semantic interpretation of each sentence in the candidate documents, and return the best matches. However, parsing, interpretation, and matching should be fast enough to allow the system to return solutions from a large search space without significant delays. A conceptual index will be an invaluable resource for QA. Actually, some systems make use of a Q/A typology but none access a full conceptual index. The index will provide the QA system with the ability to look for previously determined semantic relations, and then identify the answer on the basis of these relations. The conceptual index built with answer-driven reasoning techniques will make a high speed and high precision QA system feasible. The QA system will be lightweight, since it will rely mainly on the information in the semantic index. The system itself will perform question type identification (e.g. definition, location-event, who-inventor, etc.), formulation of the necessary relation to be searched for (e.g. inventor-of, location-of, etc.), search through the Q/A index and candidate ranking based on the number and weights of the relations found. The system will not actually perform full document analysis, allowing it to be very fast. Summarization. Automatic (single and multiple) document summarization will benefit from the possibility of obtaining vital information such as temporal/spatial information, and cross-document coreference from a Q/A conceptual index. Summarization based on answer-driven reasoning will be extract-based, i.e. summaries will be compiled fro m stretches of text, extracted from one or more documents, that have been considered important enough for inclusion. Documents will be summarized using the semantic relations encoded in the Q/A index. Answer driven or query-based summaries, i.e. summaries generated by answering a set of questions, will be generated from the Q/A index by considering the importance of the relations. Those relations which are considered more important will represent the most frequently-sought information regarding a particular entity, and therefore this information will be that which is most desirable in a summary.

Advanced Search. A Web search engine could significantly improve its performance if it could access a Q/A conceptual index capturing in a compact format the correlation between text fragments from a source data collection and the QType taxonomy. For instance, in such an index, the word Mozart could be related to text fragments reporting specific facts about his life and his masterpieces. This would allow a user asking the generic keyword “Mozart” to get, as a result of the search, a number of properly ranked links represented by FAQs about Mozart (e.g. “Where was Mozart born?”, “Which is Mozart’s birth date?”) for which the search engine was able to find an answer. The user interested in such information would not have to access the content of a long list of documents, with an evident saving of time. Semantic Web: ontological information enhancement using answer-driven reasoning. Indexing of Web pages by the search engine described above will serve two objectives relevant to the development of the Semantic Web - ontology acquisition and enhancing conceptual markup of web pages. Currently, ontologies used in various Semantic Web scenarios consist largely of their taxo nomic backbone, while richer conceptual information in the form of non-taxonomic relations (such as Located_In or Works_In) is still lacking. Correspondingly, Web pages annotated in terms of such ontologies fail to capture many important aspects of the content of the pages. Answer procedures can be used to automatically augment existing ontologies with non-taxonomic relations and correspondingly enhance relational metadata present in conceptually annotated documents. The particular objective is to integrate existing ontology development tools and (semi-) automatic annotation tools into one architecture. The overall procedure of introducing newly discovered non-taxonomic relations into an ontology and inserting corresponding conceptual mark-up into Web pages can be described as follows. First, an ontology is loaded into the system in terms of which the documents are annotated. After that, portions of the text that fill slots in answer procedures are located and the answer procedures mark-up is compared with the conceptual mark-up. This stage constitutes the key problem that needs to be solved – finding the appropriate level of abstraction of the ontology concepts that are matched to the slots in the answer procedures. To address this problem, we can employ methods similar to the generalization techniques used to acquire answer procedures, which are additionally based on the information about ontology concepts, such as their properties and relations. After plausible correspondences have been identified, the system will add the non-taxonomic relations in the ontology and relational metadata to the Web page. The enhancement of ontologies and annotated documents with non-taxonomic information has direct relevance to many different applications based on Semantic Web technology. In particular, it will allow for wider use of the existing techniques for reasoning about information present in documents and improve semantic indexing of Web pages.

6

Conclusion

We have presented an overview of answer-driven reasoning, a novel approach to information access. based on the application of answer procedures to mine correlations between textual source data and a taxonomy of question types. Since future developments of systems actually based on answer-driven reasoning will resort to huge answer procedure libraries, we are currently working on this core aspect of the approach. In particular, we are investigating the problem from two different directions, i.e. the extraction of surface patterns using suffix trees and their syntactic generalization based on the comparison of parse trees. Preliminary experiments over a set of TREC-2001 question/answer pairs show that automatic acquisition of answer procedures is a feasible task. Furthermore, experimental results confirm the hypothesis that the answer procedures coverage significantly increases when they exploit linguistic knowledge rather than surface patterns

References 1. B. Magnini, M.Negri, R. Prevete H. Tanev: Mining Knowledge from Repeated CoOccurrences: DIOGENE at TREC-2002. Proceedings of TREC-2002 Conference. Gaithersburg, MD, November 2002. 2. D. Ravichandran, Hovy: Learning Surface Text Patterns for a Question Answering System, In proceedings of ACL 2002 , 6-12 July, 2002 3. M.M Soubbotin, S.M. Soubbotin: Patterns of Potential Answer Expressions as Clues to the Right Answer, Proceedings of TREC-10 Conference . Gaithersburg, MD, November 2001.

Suggest Documents