In Proceedings of Second Conference on Information and Knowledge Management (CIKM '93)
Translating Description Logics to Information Server Queries Premkumar T. Devanbu Arti cial Intelligence Principles Research Department, 2B417, AT&T Bell Laboratories, Murray Hill, NJ 07974, USA
[email protected]
Abstract Description Logic (DL) Systems can be useful as front-ends for databases, particularly in applications that involve browsing and exploring, such as data mining. Such use of DL systems raises some pragmatic and theoretical issues. In this paper, we adopt a general architecture for \loose coupling" DL systems with databases; in the context of this architecture, we have built a system called qindl that can generate, from a speci cation, a translator from a description logic to a database query language. qindl was used to generate translators from classic, a typical DL, to 3 dierent database query languages. We also present one view of safety of classic when it is used to query databases, and present a semantical basis for this view.
only global variables"), which can then be saved for later use, perhaps in other queries. Thus, the user creates and evolves a particular view of the raw data. This evolving view captures the state of the data archaeologist's knowledge as s/he explores the database and discovers useful facts. In this paper, we discuss some technical issues that arise in using Description logics (DLs) (also known as the KLONE family of languages [18]) as a \front end" for data archaelogy (see, e.g., [4, 2]). We begin with a brief background description of DL's, and then recapitulate the architecture used to interface DL's to information servers (from [5]). Two interesting issues arise in this use of description logics: translating DL expressions to database queries, and safety of DL expressions, when used as database queries. These are the main concern of this paper.
1 Introduction There is currently a great deal of interest in \data mining" applications, where a large source of data is combed for useful knowledge [15]. Following [4], we distinguish between data mining approaches, where autonomous tools run freely over databases to search for interesting general properties and categories of the data, and the interactive data archaelogy approach, which (to quote the authors of [4]) is an \iterative, dialectic process that involves constant human intervention". In the imacs [4] system, the authors use the example of marketeers searching a database of department store sales records for useful patterns. Likewise, public health of cials may wish to explore a medical records database to identify groups that could bene t from certain education campaigns. As a nal example, consider programmers who, during software development, want to browse through the source code of a large software system, to develop a model of it. This could also be considered a kind of data archaelogy, supported by systems such as codebase[16]. This process involves composing queries, retrieving answers, inspecting the answers, and formulating new queries. Some of these queries may represent interesting categories (e.g., a programmer may be interested in the category \A function, which calls only library functions, and modi es
2 Description Logics Description Logics deal with descriptions of individuals, relationships between individuals, called roles, and concepts, which denote sets of individuals. All of these are represented in DL's with potentially complex descriptions (the classic system [3], which we used in our work, is an example DL). Individuals are speci c named objects that occur in the domain of interest. In a marketing database, these may be individuals like JoeCustomer, Hoboken, Cheapmart. They have descriptions associated with them| like the fact that Joe Customer lives in Hoboken and shops at the Cheapmart. Roles are binary relationships between individuals, such as lives-in, shops-at, or located-in. Thus part of the formal description of JoeCustomer would be that the ller of his lives-in role is Hoboken; the ller of the located-at role for Cheapmart might be Newark. Concepts are speci c descriptions of sets that de ne properties that individuals have to satisfy in order to be members. Thus, we can de ne a concept such as HOMEBODY which is de ned as those who shop only at stores that are located in towns that they live in. JoeCustomer doesn't satisfy this description, so is not an instance of HOMEBODY; he would be if he lived in Newark. Some concepts like PERSON cannot be described in terms of others, and are considered primitive. The others are de ned. A full range of description operators are available to form concept and individual descriptions. We present just one
concept description, as an example (adapted from [5]) of the general syntactic structure of description, and to illustrate some of the constructs.
de ne-concept GOODCUSTOMER (and CUSTOMER (all purchased LUXURYITEM) (atleast 1 child) ( lls works-for NJLaw) (same-as lives-in commutes-to) (all creditRating) (oneof Hi Very-Hi))))
(
denotes objects in the intersection of the following sets: the set of individuals that are instances of the concept CUSTOMER, probably a primitive speci ed elsewhere; the set of individuals, all of whose llers for the role bought are instances of the concept LUXURYITEM; the set of individuals with at least one ller for the child role; the set of individuals the ller of whose works-for slot is the individual NJLaw. the set of individuals the ller of whose lives-in slot is identical to the ller of the commutes-to slot. the set of individuals, all the llers of whose creditRating slots are either Hi or Very-hi. In classic [3], one can use these and other similar constructors de ne new concepts, roles, and individuals; classic will compute IS-A (subsumption) relationships between de ned concepts and individuals, and among de ned concepts, based on the meanings of the descriptions; it can also detect inconsistent descriptions, and perform other operations. There are several reasons why DL systems are useful for data archaeology. First, description logics are designed for knowledge representation|they provide a natural and formal way for the user to accumulate the gleaned knowledge (for an example of the use of DLs as a mechanism for storing and retrieving knowledge about software, see [6]). Secondly, they can be used as query language for creating interesting queries, ones that can be reasoned with. Finally, the inference facility in DL systems is helpful in organizing the knowledge and queries1 (by classi cation) and detecting inconsistencies in them . However, DL systems are primarily in-core systems, and their algorithms and data structures have traditionally been designed to handle only a limited number of individuals. Database management systems, on the other hand, have been developed to handle extremely large amounts of data. The motivation of this work, then, is to exploit the largescale eciencies of databases in conjunction with DL systems for use in data archaeology and other applications. Since there are a variety of dierent databases in use, we seek a general way of interfacing DL's with databases.
3 A Generic Interface Architecture We recapitulate here, (from [5]) the general architecture for integrating a DL knowledge base management system (DLMS) with a database management system (DBMS) (see
1 See Anwar et al [2] for a more detailed description of this use of description logics.
DLMS Organize class descriptions
Definitions and Queries
Description Language
Incremental Classification of New Descriptions
Retrieved Instances & Descriptions
KB
Integrate Individuals
DB
DL translator
DBMS
Query Tranlator
DB
Figure 1: A Generic Architecture gure(1)) The DBMS could be any standard database system, which manages a (large) collection of data; it can perform query processing in a standard query language like SQL or relational algebra; it may also perform the usual concurrency control and recovery functions. Its query evaluation component typically also incorporates various optimization facilities. The DLMS is a knowledge management system based on a description logic; examples include classic[3], back[14], and loom[11]. A knowledge engineer inserts descriptions of concepts, individuals and roles into the DLMS, which infers subsumption relationships between these descriptions, and builds a classi cation hierarchy of concepts, with the individuals at the bottom. There are primitive and de ned concepts; the former, such as person, cannot be given formal, if-and-only-if de nitions, and the latter are concepts that can be de ned in terms of other (primitive or de ned) concepts, such as married-person or has-only-daughters. The knowledge engineer can assert \IS-A" relationships from primitive concepts to others; the subsumption inference infers IS-A relationships between de ned concepts. InstanceOf relationships between individuals and primitive concepts can also be asserted; an individual can be said to be an instance of the particular concept. An individual may also be inferred to be an instance of a de ned concept. The subsumption inference has set-theoretic semantics, with respect to which it is usually sound and sometimes complete (depending on the DL system). Subsumption between concepts may be expensive to compute, (depending on the expressiveness of the underlying DL), but cannot be computed in the database. Though it may be easier in some cases to check whether an individual belongs to a concept, it is often not practical to do so for extremely large collections of individuals, because these are processed one at a time. During a typical session with a DLMS, the user may de ne a number of concepts (both primitive and de ned) and insert several individuals. As the description of each new de ned concept and individual is inserted, the DLMS infers all the IS-A relationships. At any point, the user may ask for the instances of some particular de ned concept, and
the system retrieves the subsumed individuals by quickly following a set of pre-computed links. The DLMS are usually in-core systems, and the algorithms and data structures are designed to handle several hundred or a few thousand individuals. In data-intensive applications, a loose coupling architecture can exploit the DBMS's ecient query processing subsystem to optimize the handling of large numbers of individuals. With this arrangement, an a priori association is made between primitive concepts and roles and certain relations (or materialized views) in the database. Primitive concepts can be associated with unary relations, and roles with binary relations. Then, each de ned concept is translated directly into a database query when it is classi ed. When a user asks for the instances of a concept, we proceed as follows. If it is a primitive concept, we have an associated relation or materialized view; we can return the tuples therein after translation to DLMS form. If it is a de ned concept, we use the translated query, and submit it to the DBMS, which optimizes it, (eciently) evaluates it, and returns a set of matching tuples. These can then be translated one tuple at a time into DLMS form, and directly inserted, without classi cation, into the KB2 . Thus, we use the DBMS's query processing mechanism to \classify" an otherwise unmanageable number of individuals, and nd the instances of the desired concept. Further details are omitted for brevity, and may be found in [5]. This architecture is quite general; it can be used to couple any DLMS and DBMS. Indeed, the \back-end" doesn't even have to be a database; it can be any \data-gathering" device3 that can be used to eciently search a large body of data for instances that match a de ned concept. This general approach can be used to build DL-based data mining tools that will work with a wide range of data sources. It does raise two important issues: Query Translation: To couple a DLMS to DBMS, we must rst implement a compiler to translate from concept de nitions in the DL to a query in the DBMS query language. If we had a general tool that could generate such translators (from a speci cation), this architecture could be used to interface a range of DLMSs and DBMSs. We have built a translator generator system, called qindl (Query Interfacer for Description Logics) for this purpose. Safety: A database query needs to be safe; i.e., the user should be warned if he submits a query that would cause the DBMS to search through an in nite set, or even a large nite set of irrelevant answers that the user may not be interested in. Safety of database query languages per se is well understood; here, we are eectively using de nitions in DLs as database queries; so we need to identify and reject unsafe DL de nitions, should one be submitted as a query to be translated and then evaluated in the DBMS. We have developed an appropriate semantically based notion of safety for DL de nitions, as well as a simple, correct (with respect to the given semantics) algorithm to perform a syntactic check for safety. 2 If there are too many, sampling techniques could be select a few \typical" ones for insertions into the DLMS. 3 In fact, our system has been used to generate translators from classic to genoa[7] which is an applications generator that creates analyzers that can extract information directly from source code; the classic-genoa combination can be used in a system such as codebase[16] to mine knowledge about software directly from the source code.
We address these issues now, beginning with a description of our general purpose translator generator.
4 The Translator Generator Just as the parser-generator tool Yacc simpli es the task of building parsers by generating a parser from a grammar speci cation, our DL-query language translator-generator system, qindl, simpli es the task of generating translators from description logics to database query languages; it generates a translator from a high-level translation speci cation. A DL-to-query-language translator has the following task: given a description of a de ned concept , generate a query for the DBMS that would nd the individuals that match this concept. In classic, this operation is implemented by the cl-concept-instances functions, which, given a concept, returns all its known instances. The translator essentially has to 4\implement" this function in the database query language . The design of the qindl speci cation language is directly in uenced by the compositionality of description logics. The syntax of a description logic like classic provides several operators such as and, at-least, all, lls, etc., which are both syntactically and semantically compositional. The syntactic correctness of any nested part of an expression is independent of the contents of any distinct part of ; likewise the semantics of an expression (the set of individuals denoted by it in a model of the DL) are determined by evaluating the semantics of the operands of in any order, and combining them according to the meaning of the top level operator of . In some sense, evaluating a concept de nition as a query in a database is like nding its semantics: the set of individuals that are denoted by that concept in a model. Thus if we can provide generic translation rules for each operator in the description logic, and translations for each primitive concept and role, then any description composed with these operators and primitive concepts and roles can be translated into a query. We simply treat a de nition as a tree, where each node corresponds to a subexpression (from now on, we use the terms \node" and \subexpression" interchangeably), and the children are the subexpressions thereof, and translate in a bottom up fashion. A qindl speci cation is a set of translation rules, generally one for each operator in the DL. From this speci cation, we can generate code that actually does the pattern matching, tree traversal and translation operations. To illustrate the translation process using qindl, we rst consider the translation from classic to relational algebra (RA), which is fairly simple, and then go on the more complex features needed to translate from classic to SQL 5 . In both cases, we assume the following: Corresponding to each primitive concept (resp role) in the DLMS, there is a unary relation/de ned view (resp binary relation/de ned view) in the DBMS. In 4 The question naturally arises as to whether the query language is powerful enough to implement this function, since languages such as SQL and relational algebra are expressively limited. As classic stands right now, this can be done; indeed, the set of rules in qindl to translate classic to a given query language forms the basis of an inductive proof of this. The main diculty lies in classifying individuals which are known to have several distinct unknown llers; since these are generally not allowed in databases, the query languages are indeed powerful enough. 5 For an explanation of relational algebra, and SQL, see any standard database text, such as Navathe and Elmasri [8].
the examples, we use below, we show this correspondence by using the same names in lowercase form for the DLMS concept (e.g. worker) and Capitalized form for the DBMS unary relation (e.g. Worker). All the binary relations (corresponding to roles) have two attributes (or columns). They are creatively named fst and snd. As the rst example of a qindl speci cation we present below the translation rule for an expression of the type ( lls r b), where r is a role and b is a constant. (Just as a reminder, this description stands for those individuals that are related to b, among others, by r.) () ( lls (expr ?r) (atom ?b)) ((return project%1
(select%2=(db?equivalent ?b) (query ?r)))))
The rule has two parts: a pattern, indicating the syntactic structure of the lls expression to be matched and translated, and an action side consisting of several actions. The pattern side includes several match variables, to be matched against the various subexpressions. In this case, ?r matches an expression that is the role part of the lls expression, and ?b matches the constant expression. The action side of the translate rule returns a RA expression, which rst selects from those tuples resulting from the evaluation of the query corresponding to ?r, the ones that have the database equivalent of the value of ?b in their right hand column, and then projects out the rst column of the result. In the same vein, translation rules can be written for all of the other operators in the description logic. Here's the rule for a (same-as p q) expression, which denotes individuals that have the same values related to them by roles p and q, under the general constraint that p and q have exactly one value for each object. () (same-as (expr ?r1) (expr ?r2)) ((return project%1( select%2=%4 ((query ?r1) (query ?r2))) We take the cross product of the RA expression for ?r1 with that for ?r2, select out those resulting 4-tuples that have the same second and fourth columns, and project out the rst column of the result. Of course, a query can consist of an expression using many dierent operators; as long as the target query language is also compositional, as is RA, the translation can be made in a context free, bottom up manner. A qindl speci cation for a translator from classic to RA has 7 rules, for and, all, lls, at-least, atmost6 , same-as and one-of. From this, qindl generates about 600 lines of Common Lisp code, which comprises an optimized pattern matcher and rule evaluator that performs translations rapidly. For example, an expression like: (and worker ( lls employer at&t) (same ? as (boss) (advisor))) is translated into the equivalent RA query 6 atleast, and atmost are handled with aggregate functions[8] in relational algebra. all restrictions are done with set dierences, or \anti-joins".
(and c1 ( lls r1 b1) ( lls r2 b2))
select x from C1 x R1 y R2 z where x = y.fst & y.snd = B1 & x = z.fst & z.snd = B2
Figure 2: A classic de nition and an equivalent SQL query those q Worker \
(project%1 (select%2=at&t (Employer)) \(project%1(select%2=%4 (Boss Advisor)))
in a fraction of a millisecond in Allegro Common Lisp on a SparcStation 1+; in practice we expect that the translation time would be overridingly dominated by query evaluation time in the DBMS. Translation from a DL to a database query language that is not compositional, like SQL, cannot be performed in a simple, \context-free", bottom up fashion. SQL is not a context-free language because the syntactic validity of one part of a query might depend on the variables introduced in a dierent part. Thus, the translation of one DL subexpression might depend on a distant parent expression. Furthermore, to produce more optimized queries, various pieces of SQL translations of parts of de nitions need to be merged in a non-compositional manner to yield the resulting query. To handle these issues, qindl has facilities for associating attributes with nodes. These may be available either only at the node where they are generated, and its parent nodes (an upwards attribute) or at all nodes below it (a downward attribute). For convenience, we provide local attributes, which are only used at the node where they are generated. In the following section, we describe how these are used to implement a translator from classic to SQL. We clarify this with an example. First, we note that the translation of ( lls r1 b1) is select x from R1 x where x:snd = B ; likewise, the translation of ( lls r2 b2) is select x from R2 x where x:snd = B 2. Also, note that the SQL for c1 is simply select x from C 1 x . When the above expressions occur in a conjunction, as in gure (2), it is clear that the SQL for c1 and the two lls expressions should be merged by the translation in a complicated, non-compositional manner; we describe some sources of this complication, and how these are addressed in qindl. First, within a large query, the translation of each conjunct changes slightly|dierent variables are used. This can be handled by always generating new \tuple variables" for every role and primitive concept. Second, the translations of each conjunct T are not simply appended together, with intervening operations, as in the case7 of RA; they are interwoven in a more complicated manner . The select part comes from one subexpression and not from the others; the from parts from each subexpression get picked out and merged into one from; and the where parts also are picked out and merged into one whole. This can also be handled within the same bottom-up translation scheme, as in the case of RA. However, in this case, each node (subexpression), instead of generating just one translation, can generate several distinct upward attributes, which can be passed to the parent node (subexpression). 7 The intersect operator in SQLcould have been used, but the translation shown in gure (2) can lead to more ecient evaluation.)
In the classic to SQL translation, we associate four attributes for each node: select, from, where and query: each translation rule associates these four attributes with a classic construct. If a node is the root (top-level) construct, the \query" attribute is returned. Otherwise the other four attributes are created simultaneously and stored at the node, and are available for retrieval. Nodes above this node can access these values, and merge them appropriately to create their own \select, \from", \where" or \query" attributes. Finally, we observe new where conditions x = y:fst, and x = z:snd, are added into the where condition when the lls expressions are used in a conjunct. In this case, a selected variable from the parent and expression, which we call an anchor variable, has to be passed into the child lls expression; this is accomplished in the qindl rule by using a downward attribute. This happens in an and expression, where all the subexpressions should use a single \select" term, and in a role-chaining expression in classic, which gives rise to a join expression in SQL (such as the expression (all r1 ( lls r2 b1)) which leads to a role chain between R1 and R2). In the case of both conjunctions and role chains, an anchor may be passed in from above, must be used to constrain the value of the tuples resulting from this subexpression. The following qindl rule, which translates lls expressions, illustrates these dierent types of attributes and their use. () ( lls (atom ?R) (atom ?B)) ((local ?Tuplevar (newvariablename)) (local ?Link (cond ((isboundp ?Anchor) `(equal ;?Anchor (dot;?Tuplevar fst))) (t T ))) (assign select `(dot ?Tuplevar fst )) (assign from `(,(DatabaseRelation ?R) ,?Tuplevar)) (assign where `(and ,?Link (equal (dot ?Tuplevar snd ) ,(DBNAME ?B)))) (assign query `(query (select ,select) (from ; from) (where ; where)))))
The above rule is used to translate a lls expression. The rolename and the constant are bound to ?R and ?B respectively. We then generate a new tuple variable name for the relation ?R (?Tuplevar). If the downward attribute ?Anchor is bound, coming into this node, that means this expression occurs as a subexpression to another, and the two-place tuple ?R will have to be chained to the parent| either as a conjunct, or in a role chain|the ?Link condition sets this up. The from and where attributes are selfevident. The where is the conjunction of the Link predicate to the condition induced by the llsma expression, i.e., that the second column of ?R must be the value of ?B . The query attribute is just a datastructure that stores the select; from and where parts|the precise SQL syntax can readily be generated from this. The qindl speci cation for classic to SQL translation has nine rules (about 100 lines of code), and generates about 1200 lines of common lisp code. This qindl speci cation has more rules than there are classic constructs, for the following reason: in some cases, it might be preferable to translate certain operators occurring in a particular combination in a special way, rather than translate each one separately, to produce an SQL expression that is easier for the DBMS to evaluate. These can be speci ed by writing additional rules. In case of a con ict, the qindl execution
mechanism breaks the tie by simply choosing the rst rule in the speci cation that matches the node being translated. There are also other considerations which impact on the ef ciency of query evaluation. For example, we could de ne, as in [5], the materialized view for primitive concepts to have extra columns, one for each single-valued role (functionally dependent attribute) having that concept as source (the fst domain). This could be used to produce more ecient queries for certain combinations of lls and same-as expressions by avoiding the need to perfrom joins. If such special \optimizing" views are available, additional qindl rules can be written to produce queries that exploit these views. In this discussion, we have assumed the simplest possible schema, with one binary relation for each role, and a unary relation for each primitive concept. We have also implemented a translator from classic to genoa [7], which is a very dierent language; brie y, it is a kind of query by example language for the parse trees of programs. genoa is actually a simpler language than SQL; the qindl speci cation for this translation is purely bottom up, and uses only upward attributes. It has 7 rules, one for each construct, and compiles to about 700 lines of lisp.
5 Safety Since we are using de nitions for querying databases, an important question arises concerning the safety of the queries produced by translation from a description logic de nition. Suppose we have a survey database containing information about people, what they eat (meat, vegetables, etc.), where they live (apartment, house, etc.), and their occupations (plumber, programmer, etc.). Now, one might want to identify the vegetarians in this database, by constructing a de nition such as (all eat PlantProduct ) (1) and posing it as a query: we ask for all those that eat only plant products. In principle, we want the DBMS to return all those pids for which the database does not contain a tuple of the form hpid ; fidi where the fid identi es a non-plantproduct food item. Unfortunately, this query can lead to disaster|strictly speaking, by this de nition, anything that doesn't eat something else should be returned|including houses, occupations, etc. This is clearly not what the user intended, and clari es why (1) is an unsafe query. Informally, safety is a property of database queries, whereby we regard as unsafe queries that result in answers that are irrelevant in some sense. The main pragmatic goal is that if safety can be determined a priori, quickly, by a fast safety checking algorithm, we can check queries for safety before executing them. This algorithm should (for eciency's sake) only inspect the syntax of the query; it should not look at the extensions of the relations in the database. Thus, whether a query is safe or not, the user is given some sort of answer, and is never given reams of irrelevant answers. Additionally, we'd like to have a clear, semantically motivated de nition of safety that coincides with the results of the syntactic algorithm. To this end, we de ne safety for a de nition (query) with respect to the domain of . Given a classic de nition , and a knowledge base KB, the DOM function, which returns the domain of , identi es just those individuals in the KBMS (or DBMS) that must be examined to compute the instances of . The de nition of DOM re ects the semantics of classic, as well as a notion of relevance. If the semantics of classic require that, to process
a certain query, one must examine individuals other than those in the domain of the query, then the query should be considered unsafe.
De nition 1
DOM (C ; KB) DOM (H ; KB) DOM (%; KB) DOM ((and 1 2 n ); KB) DOM ((all % ); KB)
Instances(C; KB) Fst KB) [ Snd (%; KB) Sn(%;DOM (1 ; KB) i=1 DOM (%; KB) [ToldNone (%; KB) [Instances (; KB) DOM (( lls % ); KB) DOM (%; KB) [ fg DOM ((atleast %); KB) DOM (%; KB) DOM ((atmost %); KB) DOM (%; KB) [ToldNone (%; KB) DOM ((oneof 1 2 n ); KB) f1 ; 2 ; : : : n g %2m )); KB) DOM ((same ? as (%11 %1n ) (%21 S S Smi=1 DOM (%2j ; KB) n DOM (%1 ; KB) i i=1
Instances() simply refers to the known instances of a given primitive concept . In classic, there are two types of primitive concepts: host concepts (denoted by H ) which are things like integer, string, etc; and \classic" concepts, (denoted by C ) which are primitive concepts de ned by the
user. Host concepts, in principle, have an in nite number of instances; however, they are usually used in conjunction with other constructs in a de nition. Their primary function is to work as \ lters" in queries such as \a person, whose age is an integer". Asking for direct instances of integer in classic is an error. Since they primarily work as lters, and by themselves aren't intended to contribute any new individuals for consideration 8during query processing, their domain is taken to be empty . Fst and Snd are the two members of a role tuple. ToldNone (%; KB), for a given role % are those individuals which have been asserted to have no llers for the % role. In classic, we have the open world assumption with respect to roles; rather than inferring the nonexistence of a role- ller for an individual, classic is explicitly told that certain individuals do not participate in certain roles, either by inheritance or by the closing of a role9 . The function ToldNone returns such individuals; since they are are used in processing all and atmost restrictions, we include them in the domain of these constructs. Some relational databases, however, cannot store the information that certain individuals have no llers of a certain type; in such cases, we take ToldNone by de nition to be empty. In other databases, we could explicitly store the information that some individuals have no llers of a certain type, using null values; in this case, ToldNone will not be empty. However, null values in databases usually have don't know rather than doesn't exist semantics; so this approach may not be widely applicable. Alternately, one can also store the generic information that certain categories of things don't have certain llers in the schema. In this simpli ed treatment, we ignore these possibilities, and assume a simpli ed closed world database model with just ground atomic facts.
8 classic also has test concepts, where the membership condition is de ned procedurally via an escape hatch; for the purposes of the domain de nition, they are treated the same as ordinary primitive concepts; the domain is the set of individuals known to be instances of the test concept. 9 On the other hand, classic uses the CWA for primitive concepts; if an individual is not explicitly told to be an instance of a primitive concept, it is assumed not to be a member. classic also makes the unique names assumption.
In such databases, the closed world assumption dictates that any individuals that do not explicitly occur in the \Fst" attribute of a role relation should be inferred to have no llers for that role. However, these individuals are taken to be irrelevant to queries that involve all and atmost restrictions, and are therefore omitted from the domain. As we shall see, this leads us to determine that in such databases, all and atmost queries are unsafe. We adopt the following standard de nition of safety (See, for example Ullman [17]): De nition 2 A query is safe if for any knowledge base KB, if Answer(; KB) DOM (; KB) i.e., the answer to the query is drawn from the elements in the domain. From De nitions (2), and (1), the following theorem can be shown by a simple induction on the size of a query, by appealing to the semantics of classic. Theorem 1 A query evaluation algorithm for a safe query need only examine the individuals in DOM (; KB) to compute the answer. This tells us that the domain of a de nition is \closed" in some sense; if the answer to the query is contained in the domain, the individuals that the query evaluation algorithm must look at are also contained in the domain. We again emphasize that De nition (1) works slightly dierently if the instances are in the database than if the instances are entirely within classic: for databases, the ToldNone function is always empty. We shall argue below that these de nitions lead us to determine that classic-thing is a safe query in databases (though it makes us look at all the individuals in the database), where as (all eat vegetable) is not. This is precisely the characterization we want for safety. Let us examine this issue more closely by considering each classic construct in turn, and analyzing how individuals corresponding to de nitions using these constructs are retrieved in classic, and how they should be retrieved in a database application. Clearly, all primitive concepts in classic are safe. The answer set is simply an enumeration of the set of individuals asserted to be instances of these concepts. Next, we can see that a query (and 1 2 ::n10) is safe if one of the i 's is safe. The query (atleast %) is safe since we only have to look at the domain as de ned above to compute it. Likewise, it is easy to see that lls and one-of are safe. The constructs that lead to unsafe queries are (atmost %), and (all % ). Intuitively, these lead to negated existential queries: they retrieve individuals for whom llers of a certain kind donot exist (the all query essentially looks for individuals with at most 0 llers that are not instances of .). classic, with its \open world" retrieval algorithm, explicitly knows those individuals that have no llers for a particular role, and thus searches for candidate instances of these only within the domain of this query (as de ned above) and nds answers. These constructs are safe as far as the classic processor is concerned. However, in the case of the database, these queries are unsafe. Essentially, any individual that has no llers of type % is a correct answer to this query; this could lead to many irrelevant answers, as we saw in the earlier example. Therefore these constructs give rise to unsafe queries. Of course, if they occur in the context of an and operation, this is irrelevant. The algorithm for detecting safety, then, is this: 10 Note that in classic, cannot be zero in this construct
Algorithm 1 If the top level construct is one of fat-least, lls, one-ofg, the query is safe. If the top level construct is one of fatmost, allg the query is unsafe; if the top level construct is and, the query is safe if one of the conjuncts is safe.
The following theorem certi es that the algorithm above has the needed properties:
Theorem 2 Algorithm (1) correctly classi es every description logic query as safe or unsafe. Proof: For brevity, we present just an outline here. First, we analyze the structures of dierent kinds of queries that are agged as safe, and appeal to the semantics of classic, and the de nitions of the domain to show that in each case, the answer to a query is contained in the domain thereof. Next, we consider the structures of dierent kinds of queries that the algorithm ags as unsafe in turn, and present a construction which can generate a data base where the answer to the agged query would contain elements that are not present in the domain, in each of those cases. Since the algorithm always terminates, it always decides. In this treatment of safety of DL queries, we adopt the view taken by Ullman [17] (pages 151 thru 156): a query is safe if it \pays attention only to the data it is given". In other words, if the user mentions certain primitive concepts and roles in the query, then the individuals that occur in the database relations corresponding to these concepts and roles should be the only ones that in uence the answer to the query. Thus, if the user chooses to ask for instances of a concept like classic-thing, one should treat it as a safe query. This \user knows best" approach, though uniform and consistent, might present serious pragmatic diculties in processing such open-ended queries. This issue has motivated a more pragmatic, if less formal, view of safety in the treatment of [5]: queries such as classic-thing are prohibited. All queries must be composed of primitive concepts and roles that correspond to real relations in the database. Development of a semantic model of this pragmatic notion of safety is an open issue. 6 Related Work The unique contribution of this work hinges on the use of a description logic as a query language for an external data source. Mays, et al, [12] have used an object-oriented database to provide persistence in their knowledge representation language, K-REP. Their primary interest is in storing large portions of their taxonomy in a database, and provide transaction control, versioning, etc., to enable many knowledge engineers to simultaneously make changes. They are not concerned with treating descriptions as queries over a large collection of individuals from external information sources (such as a DBMS, or genoa). The Stanford Knowledge Sharing Technology project [9] is also concerned with translations; however the goal there is much more ambitious, to translate knowledge bases from one knowledge representation formalism to another, accounting for dierences in expressive power, ontology of the KB, etc. Full details of their translation language were not available at the time of writing, so a close comparison with qindl is not possible. Since using descriptions as queries for external data sources is not of primary concern, safety is not discussed in [9].
Illarramendi, et al, [10] describe the use of back[14] as a tool to facilitate the integration of schemas from dierent databases into one federated database. back's classi cation assists in this process. Again, the main focus there is not the use of descriptions as queries. Finally, the design of the qindl speci cation language combines elements of both attribute grammar formalisms [1] and transformational methods [13]. The downward and upward attributes in qindl are similar to the 11inherited and synthesized attributes in attribute grammars . Attribute grammars, however, are used to handle parsing and semantic attribute propagation in programming languages; in the case of description logics, there is no need for parsing; the concepts are stored internally as term structures. Binding of the template variables such as ?r and ?b on the left-hand side of the rule is done by rst-order uni cation, rather than being bound by a parser while processing raw text. Furthermore, attribute grammars normally have one grammar rule (with attribute equations) per syntactic construct; in qindl, there may be several alternate rules for each description logic construct. These alternates produce dierent translations for various usages of a construct that are syntactically identical, but oer dierent opportunities for optimization. While one might simulate these alternate rules with attribute grammars, it is likely to be awkward and cumbersome to do so. Furthermore, as a pragmatic issue, attribute grammars could lead to circular propagations of attributes, and potential non-termination; in the case of qindl, query translation is accomplished in a strictly top down, left to right fashion which is guaranteed to terminate; the execution semantics disallows opportunities for circular attribute propagation. A qindl rule is a specialized kind of transformation12; as in the classic transformational paradigm, there is a pattern to be matched on the left-hand side, which gives rise to several bindings of template variables; these bindings are then used in the execution of the right hand side. However, because of the simple kind of uni cation used to match with the patterns on the left hand side of qindl rules, it is possible to combine the the uni cations required by a set of qindl rules into a single, ecient, decision-tree-based procedure. This procedure checks for matches, generates bindings of pattern variables, and selects the appropriate rule to be executed. This decision tree enables the translator generated from the qindl rules to translate description logics concepts eciently. In addition, the notion of upward and downward attributes in qindl provide a simple way of associating attributes with (sub-)term structures, and using them to generate translations. Thus the qindl speci cation language provides a particular combination of features from attribute grammars and transformational methods that is well suited for specifying translations from description logics to data base query languages, and allows for generation of ecient translators.
7 Conclusion We have described a general facility for using description logic systems as a querying and browsing facility for a variety of data sources. We have developed an applications generator, qindl that generates translators from classic to
11 We chose a dierent terminology to emphasize the fact that they are used somewhat dierently 12 Transformational methods include a wide range of approaches, from simple local rst-order term rewrite rules to complex algorithmreplacement procedures.
query languages, and demonstrated its generality by building translators from classic to RA, SQL, and genoa. We have also developed a semantically motivated notion of safety of DL de nitions when they are used as queries, and presented a simple way to check if a de nition is safe in this sense. We continue to address the following open issues: How does one de ne views on a database to aid in using classic as a query language? How does one con gure the qindl rules to produce queries that are better optimized? How does the design of the schema (views) in the database aect the safety of queries?
References [1] Aho, A., Sethi, R., and Ullman, J., Compilers: Principles, Techniques and Tools, Addison-Wesley Press, 1991. [2] Anwar, T., Beck, H., and Navathe, S., Knowledge Mining by Imprecise Querying: A Classi cation Based Approach, Proceedings of the Eighth International Conference on Data Engineering. [3] Borgida, A., Brachman, R. J., McGuinness, D. L., and Resnick, L. A., classic: A structural data model for objects, Proceedings ACM SIGMOD-89, Portland, Oregon (1989) 58{76. [4] Brachman, R. J., Selfridge, P.G., Terveen, L.G., Altman, B., Borgida, A., Halper, F., Kirk, T., Lazar, A., McGuinness, D.L., and Resnick, L. A., Knowledge Representation Support for Data Archaeology, ISMM International Conference on Information and Knowledge Management, Baltimore, MD, November 1992. [5] Borgida, A., Brachman, R., Loading Data into Description Reasoners, Proceedings, ACM SIGMOD '93 Washington, DC, 1993. [6] Devanbu, P., Brachman, R., Selfridge, P., and Ballard, B., LaSSIE: A Knowledge-Based Software Information System, Communications of the ACM, 34:5, May 1991. [7] Devanbu, P., genoa - A language- and front-end independent code analyzer, International Conference on Software Engineering, May 1992, Melbourne, Australia. [8] Elmasri, R., and Navathe, S., Fundamentals of Database Systems, Benjamin Cummings Publishing Company, Inc.,, 1989. [9] Fikes, R., Cutkosky, M. Gruber, T., Van Baalen, J., Knowledge Sharing Technology, Unpublished Manuscript, Knowledge Systems Laboratory, Stanford University, 1992. [10] Illarramendi, A., Blanco, J., Go~ni, A., A Uniform Approach to Design A Federated System Using back, Terminological Logic Users Workshop Proceedings, KIT Report 95, Technische Universitat Berlin, October, 1991. [11] MacGregor., R. M. Inside the loom Description Classi er, ACM SIGART Bulletin 2(3), 1991 [12] Mays, E., Lanka, S., Dionne, R., and Weida, R., A Persistent Store for Large Shared Knowledge Bases, IEEE Transactions on knowledge and data engineering, Vol. 3, No. 1, March 1991.
[13] Partsch, H., and Steinbruggen, R., Program Transformation Systems, ACM Computing Surveys, Vol 15, No 3, Sept 1993 [14] Peltason, C., The back system - an overview ACM SIGART Bulletin, 2(3), June 1991. [15] Piatestsky-Shapiro, G., and Frawley, W. G., Eds., Knowledge Discovery in Databases, AAAI Press, 1991. [16] Selfridge, P.G., Knowledge Representation Support for a Software Information System, Proceedings of the 7th IEEE Conference on AI Applications, 1991, pp. 134140 [17] Ullman, J., Principles of Database and Knowledgebase systems, Vol 1, Computer Science Press, 1988. [18] Woods, W. A., \KL-ONE languages: A Framework for Progress", in Principles of Semantic Networks: Explorations in the Representation of Knowledge, Sowa. J, (Editor), Morgan-Kaufman, 1990.