, and the answer with the tag . As an example, consider the class of questions asking for birth dates and the question/answer pairs [Mozart; 1756] and [Guglielmo Marconi; 1874]. Since expressions like “Mozart (1756-1791)”, “Mozart was born in 1756”, “Guglielmo Marconi (1874-1937)”, and “Guglielmo Marconi was born in 1756” will probably be among the most frequent substrings retrieved from the Web, the resulting surface text answer patterns will be “(-” and “was born in ”. However, answer procedures based on surface text patterns are not general enough to be applied on a large scale. While some of them can be used in the answer-driven reasoning approach as they are (e.g. “was born in ”), the largest part is too specific, and hence almost useless. For instance, textual answer patterns such as “, the great composer whose legacy in music is widely known, was born in ”, clearly show a limited applicability. This issue can be addressed by developing abstraction techniques (i.e. pattern induction) to cluster similar patterns and produce semantically motivated generalizations. Answer procedures abstraction relies on semantic processing (e.g. coreference resolution, named entity recognition, parsing, lexical semantics) as well as advanced automated reasoning techniques (i.e. temporal and spatial reasoning) and Machine Learning algorithms (e.g. pattern acquisition/recognition, concept clustering, me mory-based learning etc.). For instance, the surface text pattern in the example above can be generalized by applying syntactic analysis to replace “, the great composer whose legacy in music is widely known,” with the corresponding syntactic function, thus obtaining the generalized answer procedure: “,, was born in ”. Lexical semantics allows for the substitution of words within a surface text pattern with concepts. This generalization process can be accomplished in two steps. First, Word Sense Disambiguation based on WordNet senses provides the basis for determining the correct sense of a word within a textual pattern. Then, WordNet semantic relations (e.g. the hypernymy relation) provide more generic concepts for substitution. For instance, surface text patterns such as: “, the composer, was born in ” “, the genius, was born in ” “, the musician, was born in ” will be generalized in: “, the , was born in ”. Other answer procedures can be generalized by applying temporal and spatial reasoning according to ontologies already existing in the Semantic Web. For instance, an answer procedure capable of calculating what Mozart’s age was in 1770 would require simple temporal reasoning involving concepts such as “interval” or “duration”. As a result of the acquisition and abstraction process, answer procedures acquired from different data types and with different levels of abstraction are stored in a common database, (i.e. the answer procedures library), with a uniform representation
format. Such a database will be the basic resource for the creation of semantic Q/A indexes and for any further application of the answer driven approach to information access.
4
Creation of a Q/A Index
Figure 2 illustrates how the answer procedures library can be used to mine question/answer correlations from a given source data collection, and to create a Q/A index ready for Information Retrieval related tasks. The discovery of Q/A correlations can be accomplis hed through two stand-off annotation levels. At the first level, each document is preprocessed by applying linguistic and semantic processors to annotate all the information required to run answer procedures. Most of the processors used during the answer procedures abstraction phase (i.e. tools for coreference resolution, named entity recognition, parsing, word sense disambiguation, temporal and spatial reasoning) are applied to produce these annotations. Once the source data have been annotated, the answer procedures stored in the answer procedures library are applied to highlight possible answers present within the documents. The matching of an answer procedure with a text portion results in the correlation of this text portion with an element of the QType taxonomy. The second annotation level will explicitly represent this correlation, relying on XML tags.
Linguistic Annotation Source Data
Linguistic & Semantic Processors
Question Answer Correlation
Answer Procedures Library
Annotated Source Data
Q/A Indexing of the Source Data
Q/A Index
• Advanced Search • Semantic Web • Question Answering • Summarization
Figure2: Discovery of question/answer correlation and creation of the Q/A index. The final step is the creation of the Q/A index, which keeps these annotations in a compact format ready to be accessed by search engines. According to the answerdriven reasoning perspective, relevant terms (i.e. terms which are the “focus” of a
possible user’s query) are indexed by the ques tions (and their respective answers in documents) that users in a given information access situation are likely to pose.
5
Potential Applications
Potential applications of the answer-driven reasoning approach affect many research areas addressing the problem of information access. Question Answering. Most QA systems today have a more or less standard structure based on transforming the query, locating documents likely to contain an answer, and then selecting the most likely answer passage within the candidate documents (Magnini et al., 2002). The difference between systems lies in the selection of appropriate passages. If the system follows a classical Information Retrieval approach, the challenge is to make segments so small as to fit the answer without redundant information also being returned. If the system follows a Natural Language Processing approach, the challenge is to perform a syntactic and semantic interpretation of the user’s question against the corresponding syntactic and semantic interpretation of each sentence in the candidate documents, and return the best matches. However, parsing, interpretation, and matching should be fast enough to allow the system to return solutions from a large search space without significant delays. A conceptual index will be an invaluable resource for QA. Actually, some systems make use of a Q/A typology but none access a full conceptual index. The index will provide the QA system with the ability to look for previously determined semantic relations, and then identify the answer on the basis of these relations. The conceptual index built with answer-driven reasoning techniques will make a high speed and high precision QA system feasible. The QA system will be lightweight, since it will rely mainly on the information in the semantic index. The system itself will perform question type identification (e.g. definition, location-event, who-inventor, etc.), formulation of the necessary relation to be searched for (e.g. inventor-of, location-of, etc.), search through the Q/A index and candidate ranking based on the number and weights of the relations found. The system will not actually perform full document analysis, allowing it to be very fast. Summarization. Automatic (single and multiple) document summarization will benefit from the possibility of obtaining vital information such as temporal/spatial information, and cross-document coreference from a Q/A conceptual index. Summarization based on answer-driven reasoning will be extract-based, i.e. summaries will be compiled fro m stretches of text, extracted from one or more documents, that have been considered important enough for inclusion. Documents will be summarized using the semantic relations encoded in the Q/A index. Answer driven or query-based summaries, i.e. summaries generated by answering a set of questions, will be generated from the Q/A index by considering the importance of the relations. Those relations which are considered more important will represent the most frequently-sought information regarding a particular entity, and therefore this information will be that which is most desirable in a summary.
Advanced Search. A Web search engine could significantly improve its performance if it could access a Q/A conceptual index capturing in a compact format the correlation between text fragments from a source data collection and the QType taxonomy. For instance, in such an index, the word Mozart could be related to text fragments reporting specific facts about his life and his masterpieces. This would allow a user asking the generic keyword “Mozart” to get, as a result of the search, a number of properly ranked links represented by FAQs about Mozart (e.g. “Where was Mozart born?”, “Which is Mozart’s birth date?”) for which the search engine was able to find an answer. The user interested in such information would not have to access the content of a long list of documents, with an evident saving of time. Semantic Web: ontological information enhancement using answer-driven reasoning. Indexing of Web pages by the search engine described above will serve two objectives relevant to the development of the Semantic Web - ontology acquisition and enhancing conceptual markup of web pages. Currently, ontologies used in various Semantic Web scenarios consist largely of their taxo nomic backbone, while richer conceptual information in the form of non-taxonomic relations (such as Located_In or Works_In) is still lacking. Correspondingly, Web pages annotated in terms of such ontologies fail to capture many important aspects of the content of the pages. Answer procedures can be used to automatically augment existing ontologies with non-taxonomic relations and correspondingly enhance relational metadata present in conceptually annotated documents. The particular objective is to integrate existing ontology development tools and (semi-) automatic annotation tools into one architecture. The overall procedure of introducing newly discovered non-taxonomic relations into an ontology and inserting corresponding conceptual mark-up into Web pages can be described as follows. First, an ontology is loaded into the system in terms of which the documents are annotated. After that, portions of the text that fill slots in answer procedures are located and the answer procedures mark-up is compared with the conceptual mark-up. This stage constitutes the key problem that needs to be solved – finding the appropriate level of abstraction of the ontology concepts that are matched to the slots in the answer procedures. To address this problem, we can employ methods similar to the generalization techniques used to acquire answer procedures, which are additionally based on the information about ontology concepts, such as their properties and relations. After plausible correspondences have been identified, the system will add the non-taxonomic relations in the ontology and relational metadata to the Web page. The enhancement of ontologies and annotated documents with non-taxonomic information has direct relevance to many different applications based on Semantic Web technology. In particular, it will allow for wider use of the existing techniques for reasoning about information present in documents and improve semantic indexing of Web pages.
6
Conclusion
We have presented an overview of answer-driven reasoning, a novel approach to information access. based on the application of answer procedures to mine correlations between textual source data and a taxonomy of question types. Since future developments of systems actually based on answer-driven reasoning will resort to huge answer procedure libraries, we are currently working on this core aspect of the approach. In particular, we are investigating the problem from two different directions, i.e. the extraction of surface patterns using suffix trees and their syntactic generalization based on the comparison of parse trees. Preliminary experiments over a set of TREC-2001 question/answer pairs show that automatic acquisition of answer procedures is a feasible task. Furthermore, experimental results confirm the hypothesis that the answer procedures coverage significantly increases when they exploit linguistic knowledge rather than surface patterns
References 1. B. Magnini, M.Negri, R. Prevete H. Tanev: Mining Knowledge from Repeated CoOccurrences: DIOGENE at TREC-2002. Proceedings of TREC-2002 Conference. Gaithersburg, MD, November 2002. 2. D. Ravichandran, Hovy: Learning Surface Text Patterns for a Question Answering System, In proceedings of ACL 2002 , 6-12 July, 2002 3. M.M Soubbotin, S.M. Soubbotin: Patterns of Potential Answer Expressions as Clues to the Right Answer, Proceedings of TREC-10 Conference . Gaithersburg, MD, November 2001.