Proc. Indo-US Wkshp on S&SP, Bangalore, Jan 1998
Knowledge Representation and Information Retrieval R Chandrasekar and S Ramani National Centre for Software Technology Gulmohar Cross Road No. 9, Juhu Bombay 400 049, India fmickey,
[email protected]
Extended Abstract
1 Introduction - the problem In today's information society, it is important to study the problem of storing a large body of information, and structuring it such that it would be possible to intelligently retrieve information from this body. As an application, we may consider bibliographic search { searching through a large corpus of abstracts to locate papers of interest to a research worker. The standard method of such literature search involves the use of keywords or some classi cation system. Keywords associated with a paper are usually few in number. Typically, they are a fraction of 1% of the length of the paper. In a keyword scheme, it is easy to miss out words which may be critically relevant. For example, a paper on the side-eects of a rabies vaccine, covering the eects on foetus (among other topics), might have the following keywords: "rabies, vaccine". A simple keyword search would be ineective for a researcher interested in the eects on foetus. Searching merely for the keywords 'rabies' and 'vaccine', he may get a thousand references. It may be argued that the word "foetus" must be included in the keyword list; by induction, this will then imply that every (non-function) word would need to be included! Classi cation systems have similar drawbacks; a person doing the classi cation of a system may often give fewer or (worse, still) much more classi catory terms than ideal. One alternative to this would be to scan the English abstract in its entirety. This could be time-consuming. Again, in locating articles on the "exports of Japanese cars to the US", a typical system may rake up articles on "automobile research in Japan and the US". In earlier times, such false positives were acceptable. Today, this would be unacceptable, due to the large sizes of bibliographic databases. In this paper, we propose a scheme involving a language to describe the contents of given items of information. Using this language, a variety of items may be described 1
2 Proc. Indo-US Wkshp on S&SP, Bangalore, Jan 1998
in a database, and an intelligent scheme used to retrieve items of interest. We present examples speci c to bibliographic search, but the scheme is applicable to information retrieval in general.
2 Requirements for a Representation Language We wish to de ne a language for describing the contents of a paper. We expect the description to be (typically) about 5% as long as the paper. This language has to be machine comprehendable. The knowledge representation scheme used by this language and the operations on it can assume a common context; thus it will not be necessary to say "this paper claims that", or "this paper reports the eect of" etc. Predicative structure is not essential for every expression in the representation language describing the contents of a document. For instance, "alkaloids in coee beans" is acceptable as a description of a document. The machine readable abstract need not resemble the English abstract. We want the stored representation to be unambiguous; that is, all disambiguation has to be done at the time of data entry. Lexical ambiguities, structural ambiguities, pronoun reference problems etc should all be handled at that time. At the same time, we wish to avoid complicated syntactic analysis. While we desire a canonical form to represent all information, we realise that this may not be possible. However, we wish to enforce this as far as possible. The expressiveness of such a representation language is important. We are willing to forgo the subtle nuances between "the atmosphere of Venus has been found to contain..." and "the atmosphere of Venus contains...". However, the language has to be suciently rich enough to cover all ideas in the source text. We are not concerned in reducing the size of such a language, but we do not wish to make it a baroque monster. A basic English vocabulary of 800 words [1] may provide the base for such a language. Technical words, jargon, tradenames etc. will vastly enlarge this vocabulary. We estimate that each specialized subject will require an operational vocabulary of about 5000 words for communication. We would prefer to have a natural language interface to such a system. Ideally, the system should assist the user in describing the contents by forcing easy-to-use but standardised methods. The principle being used here is that it is easier to train a human, than to create truly "intelligent" systems. In the next section, we describe a system which can form the basis of such a representation language.
3 The ScreenTalk System Natural language understanding systems developed so far have usually handled linear strings of words. A lot of 'intelligence' is used to analyse these strings, and to puzzle out its structure. One could bypass some problems by using alternative approaches.
3 Proc. Indo-US Wkshp on S&SP, Bangalore, Jan 1998
We have proposed [2] a scheme for communication of the content and structure of natural language sentences. A system called ScreenTalk has been implemented, incorporating some of the ideas being discussed here, followed by extensions called INFERACT and INFERACT-2, incorporating deduction. These systems allow a database of sentential information to be built up. Users may then ask questions about the stored information. There are two main ideas in our approach. One is: "When in doubt, ask". The other is the idea of communicating structural information, in addition to content. Intelligent use of interaction could vastly enhance the quality of user applications. If there is any lexical ambiguity or pronoun referencing problems, ScreenTalk comes back to the user and asks, for example, "In which sense are you using the word 'manifest' here { as a noun, a verb or an adjective?". Pronoun references and multiple senses of a word can similarly be disambiguated. The other idea in ScreenTalk evolves from the problems associated with structural ambiguity. Consider the sentence: "Fruit ies like fructose", which has multiple interpretations. If we write it out in a case-grammar like notation, we can resolve the structural ambiguity. Every predicative element in the lexicon triggers o various expectations. ScreenTalk maintains a skeleton for each predicative element in the lexicon. Each skeleton is just a list of attribute : value-expectation pairs. Skeletons for nouns de ne the prototypical object, describing a general element of the class being described. The skeleton can be interactively 'instantiated' to form in ts. An in t (portmanteau word made from INFormation unIT) is just a eshed-out skeleton, with userspeci ed values conforming to appropriate value-expectations. The user starts in t instantiation by specifying the word (predicate) which he wishes to use as a base in his communication, for instance 'decays' as a base for saying "The isotope decays rapidly, leaving behind sulfur". ScreenTalk looks up the skeleton database and retrieves the attribute and value-expectation list. For each attribute, the system prompts the user for a response. The response is then validated against the expectation. If it is valid, this attribute value pair is stored; otherwise, the user is prompted for a valid response. Nested in ts can be built with ease, since values may themselves be in ts. ScreenTalk has an IS-A hierarchy for objects, which uses the same skeleton and in t mechanism to de ne prototypical objects and their instances. A parser for a small nite state grammar allows some exibility in handling regular phrases. Interaction is made easier using menu techniques and a simple context mechanism. The process of querying in ScreenTalk is similar to in t-instantiation. A query is seen just as an incomplete in t, with some values yet to be speci ed. ScreenTalk matches this incomplete in t with the stored in ts, and outputs matches one by one. ScreenTalk has been implemented on the DEC10, in about 4100 lines of SIMULA code. The system has been tested on a database of newspaper headlines. It has been designed to be fairly domain independent. In INFERACT and INFERACT-2 (both implemented on the DEC10 in LOGLISP), facts which are implicit in the database, can now be explicitly stated by using PROLOG-like rules.
4 Proc. Indo-US Wkshp on S&SP, Bangalore, Jan 1998
4 An Example
Consider the following segment from an abstract: "This paper discusses the evolution of the relations between TRANSPAC, the French public packet-switched data network, and the French ISDN. TRANSPAC is now essential for communication in most companies and in Government administration in France. The total investment already made in this network is approximately 2.5 billion Francs..." The skeleton for the word "essential", for example, could be of this form: Essential(What:Object1, To-What:Object2, Where:Place1)
Using such skeletons, we can represent these sentences as below: Evolution( Relations_Between((TRANSPAC/ (French, public,packet-switched,data-network)) (French, ISDN))) { I } Essential(TRANSPAC, Communication_Between(Government, Business), France) { II } Total_Investment(TRANSPAC, 2500000000, French_Francs) { III }
The notation of slash followed by a bracketed expression is used as a noun-description. Thus (I) de nes TRANSPAC as well as the relation between TRANSPAC and ISDN. (II) contains a nested in t. Rules can also be de ned in this framework: Total_Cost(X,Y,Z) IF Total_Investment(X,Y,Z) Assassinated(X,Y) IF Killed(X,Y) AND Vip(Y)
{ IV }
{ V }
(IV) relates Total Cost to Investment. (V) indicates that if person X killed person Y, and if person Y is a VIP, then X assassinated Y.
5 Proc. Indo-US Wkshp on S&SP, Bangalore, Jan 1998
5 Features expected in Intelligent Matching Using a representation language like the one described above, items can be stored in a database. But how can matching items be retrieved in an error-free manner? Can the matching process be guaranteed? Will it be independent of the person who entered the data? All this seems to require an all-knowing system, which knows about all possible queries and all possible representations. We do not propose or expect to build such a system. What we envisage is a system which will retrieve semantically related items. The items sought and the items retrieved might not be the same, but will be related in a useful sense. Conveying structure information using the representation language will ensure that no false positives are retrieved due to structural ambiguity. The system will oer cooperative responses [3] so that a query about "How many people were killed in the LIC building re in Madras?" could say "None were killed, but ve were injured". The system will provide "semantic ltering", where predicative expressions and concepts are matched against stored expressives and concepts in machine comprehendable form, and only appropriate documents retrieved. In addition to rules relating concepts, the system should provide a subject-speci c thesaurus. Together, these will solve some of the semantic matching problems. Such a retrieval system will also need to have a lot of "common sense knowledge". For example, the system should know that "pregnant women" and "foetus" are related concepts; thus, a query about the "eect of thalidomide on foetus" should be able to relate to stored concepts about the "hazards of thalidomide usage in pregnancy". We need to extend ScreenTalk and INFERACT formalisms to include all these features. Meanwhile, it is useful to have a quick look at one system which delivers most of what we require.
6 Why FRUMP/CYRUS is dierent Schank et al [4] describe an implementation of a system called CyFr, a combination of two programs FRUMP and CYRUS. FRUMP skims UPI news stories, and using a library of 'sketchy scripts', outputs a conceptual representation of the important details in each news story. FRUMP understands stories which fall broadly into one of its sixty scripts. CYRUS takes in this representation, lls in contextual details, adds this new information to its database, and answers queries on the stored information. Extending the system would require additions to FRUMP's script-base. Abstracts do not have the structure that a news story would have, and hence may not be amenable to a script-like treatment. This is one reason why we allow nonpredicative expressions in our descriptions. We are also likely to have many more skeletal items in our database, which cover a vast variety of concepts used in abstracts. Thus, there is a dierence between our formalism and CyFr, necessitated by the dierences in the nature of the information represented.
6 Proc. Indo-US Wkshp on S&SP, Bangalore, Jan 1998
7 Conclusions
We have examined the need for a language to represent the content of textual items, and the requirements for an appropriate matching and retrieval scheme. Extensions to systems called ScreenTalk and INFERACT, including synonymy, semantic ltering and commonsense knowledge, we feel, would cater to these requirements.
References [1] C.J. Ogden, Basic English: International Second Language, New York: Hartcourt, Brace & Jovanovich. [2] R. Chandrasekar and S. Ramani, "Interactive Communication of Sentential Structure and Content: An alternate approach to man-machine communication", under publication. [3] J.R. Kaplan, "Cooperative Responses from a Portable Natural Language Query System", Arti cial Intelligence, Vol. 19, pp 165-187, October 1982. [4] R.C. Schank, J.L. Kolodner, G. DeJong, "Conceptual Information Retrieval", in [5]. [5] R.N. Oddy, S.E. Robertson, C.J. van Rijsbergen, P.W. Williams, (eds). Information Retrieval Research, London: Butterworths, 1981.