DOI 10.1007/s10115-003-0102-0 Springer-Verlag London Ltd. © 2004 Knowledge and Information Systems (2004) 6: 315–344
Contextual Partitioning for Comprehension of OODB Schemas Huanying Gu1 , Yehoshua Perl2 , Michael Halper3 , James Geller2 and Erich J. Neuhold4 1 Department of Health Informatics, UMDNJ, Newark, NJ, USA 2 Computer Science Department, NJIT, Newark, NJ, USA 3 Department of Mathematics and Computer Science, Kean University, Union, NJ, USA 4 Fraunhofer IPSI, Darmstadt, Germany
Abstract. Object-oriented databases (OODBs) have been utilized for complex modeling tasks within a variety of application domains. The OODB schema, typically expressed in a graphical notation, can serve as a useful presentation tool for the information contained in the underlying OODB. However, such a schema can be a large, complex network of classes and relationships. This may greatly hinder its effectiveness in helping users gain an understanding of the OODB’s contents and data organization. To facilitate this orientation process, a theoretical framework is presented that guides the refinement of an existing schema’s subclass-of relationship hierarchy – the backbone of any OODB. The framework sets forth three rules which, when satisfied, lead to the establishment of a collection of contexts, each of which exhibits an internal subclass-of tree structure. A formal proof of this result is presented. An algorithmic methodology, involving a human–computer interaction, describes how the approach can be applied to a given OODB schema. An application of the methodology to an example OODB schema is included. Keywords: Comprehension; Context; Object-oriented database modeling; Object-oriented database schema; Schema partitioning; Subclass hierarchy; User orientation
1. Introduction Object-oriented database (OODB) systems (Kim and Lochovsky, 1989; Zdonik and Maier, 1990; Bertino and Martino, 1993; Loomis, 1995; Cattell and Barry, 1997) have been utilized for data-modeling tasks within a wide variety of domainsranging from
Received 28 Mar 2001 Revised 26 Jul 2002 Accepted 14 Oct 2002
316
H. Gu et al.
telecommunications and engineering to document processing, healthcare, and E-business. OODBs typically comprise large, complex bodies of information. As such, gaining an orientation to and comprehending the contents of an OODB can be difficult for a user (Wickens et al, 1997). The abstraction provided by an OODB’s schema – ordinarily displayed using a methodology such as OMT (Rumbaugh et al, 1991) or UML (Booch et al, 1999; Rumbaugh et al, 1999; Fowler and Scott, 2000) – can play a major role in supporting the user’s efforts. However, an OODB schema itself may be large and hard to understand. For an example of a project within the engineering field involving large schemas, see STEP (Standard for the Exchange of Product Model Data) (Schenck and Wilson, 1994; Fowler, 1996). An extensive telecommunications OODB schema can be found in Geller et al (1993). Additionally, large OODB schemas for medical vocabularies were developed in Gu, Halper, Geller et al (1999), Gu et al (2000), and Liu et al (1999, 2002). In this paper, we present both a theoretical paradigm and a methodology that aid in the process of comprehending OODB schemas. Our approach toward achieving this goal is based on a combination of the following two processes: schema trimming (i.e., the elimination of all but the highest priority schema constructs) and partitioning. The overall outcome of our technique is a representation of a large, source schema as a partition composed of meaningful, manageably sized collections of classes. Each such collection of classes is called a context, and the whole approach is thus referred to as contextual partitioning. (See Section 3.1 for further discussion of and references to the notion of context.) The subclass-of relationship hierarchy within each context will be a tree, and each of these trees will ordinarily fit easily onto a single computer screen. Our work has proceeded from the assumption that a forest of disjoint trees is much easier to comprehend than a unified ‘tangled’ multiple inheritance structure within an OODB schema. The forest hierarchical view, resulting from the partitioning of the schema into contexts, functions as a skeleton which promotes comprehension of the original schema. Contextual partitioning yields a set of views (one per context) that we call contextual views. Concentrating on one such view at a time, the user can gain local comprehension. Once this is achieved, the user can progress toward an orientation of the full schema by reviewing larger views, each comprising two contexts and the relationships between them. Such a view is called a bi-contextual view. The user can achieve an orientation to the entire original schema by stepwise review of many bi-contextual views, one at a time. The advantage of this process is that at any given time the user is focusing on a relatively small unit of knowledge encompassing two subject areas, namely, the two contexts and all their interrelations. The contextual partitioning technique has as its basis a set of three rules for the refinement of the subclass-of hierarchy of the original OODB schema. These rules constitute a theoretical paradigm that, when followed, guarantees the existence of a meaningful forest hierarchy within the subclass-of hierarchy of the refined schema. (A preliminary presentation of the three rules of contextual partitioning appeared in Perl et al, 1996.) It will be noted that the refinement of the schema is not intended as a replacement for the original schema. We have no interest at all in altering the modeling of that schema or re-engineering it. Neither the classes nor the relationships – and their respective cardinalities – is changed. The sole purpose is to provide a technique which lends support to gaining an understanding of the content of the existing OODB schema. This support is obtained by the display of contextual and bi-contextual views. By studying these smaller views, the user will gain comprehension of the original schema. We also present a methodology, based on the theoretical paradigm, for applying the rules of contextual partitioning and finding the forest hierarchy. This methodology relies on an interaction between a human domain expert (e.g., the OODB designer) and
Contextual Partitioning for Comprehension of OODB Schemas
317
the computer. The expert is asked to follow a series of steps and make some decisions that refine the OODB schema’s subclass-of hierarchy according to the three rules. The computer supports the process by performing computationally intensive steps and presenting the expert with needed data. The process leads to a partition of the schema comprising a forest of tree-structured contextual views. We will demonstrate our methodology by applying it to a subschema of a university OODB (whose entire schema appeared originally in Mehta et al, 1996, 1998). In Gu, Perl, Geller et al (1999), we presented a related methodology for partitioning an electronic medical vocabulary modeled as a semantic network. The rest of this paper is organized as follows. In Section 2, we describe the notions of schema trimming and partitioning as well as schema complexity. The three rules of contextual partitioning are introduced in Section 3. Section 4 contains a proof that the rules of contextual partitioning guarantee the existence of a forest hierarchical view (i.e., one consisting of a collection of trees). The methodology for partitioning the OODB schema into contexts is described in Section 5. In Section 6, we apply the methodology to a subschema of a university OODB. Section 7 contains conclusions.
2. Schema Trimming and Partitioning To quantify the notion of ‘schema complexity’ (and get a handle on the difficulty of schema comprehension), we focus on two factors: (1) the total number of object classes; and (2) the relationship density d defined as the ratio of the number of relationships to the number of classes. For two schemas containing the same number of classes, we take the one with higher relationship density to be more complex. Likewise, for two schemas with the same relationship density, the one with more classes is more complex. Throughout the paper, we will be utilizing a subschema of a large OODB schema that captures a university environment. Figure 1 shows the subschema, which concentrates on only some of the academic aspects of the university structure (Mehta et al, 1996, 1998). For example, only a few details of publications and educational records appear. Several relevant classes which do not participate in the subclass hierarchy and which do not contribute to our discussion, such as the classes Course and Section, have also been omitted to simplify the figure. The graphical conventions used in the figure are as follows (Halper et al, 1993): A class is drawn as a rectangle. A subclass-of relationship is a bold arrow directed from the subclass toward the superclass. An ordinary relationship is a labeled arrow going from the source class to the target class. The subschema contains 32 classes, about one fifth the size of the whole schema. We will demonstrate our techniques on this subschema. In addition to its classes, the subschema in Fig. 1 contains 61 relationships. Therefore, its relationship density d = 61/32 = 1.91. A user must expend a substantial effort becoming oriented to this subschema. This is somewhat disturbing because (a) it is only a subschema of the original schema, and (b) it has been simplified by excluding all attributes. To facilitate stepwise comprehension of the schema, we will describe two approaches that can be used to identify views of lower complexity. The preliminary orientation gained from each of those lower-complexity views can later help a user in comprehending the entire original schema. There are, of course, various ways that the complexity of an OODB schema can be reasonably reduced. One way is to eliminate all properties (i.e., attributes and ordinary relationships), leaving only the classes themselves and the subclass-of relationships – a process we call schema trimming. This process forms an abstract view of the OODB
318
H. Gu et al. Person
has−members
Alumnus
is−member−of
has−chairperson Student
Alumni Organization has−chairperson
supervisees is−member−of has−members
supervisor
has−workers
Student Union
Employee
Graduate Student has−employees Assistant
Undergraduate Student has−supervisor
has−supervisor Instructor
has−resume
Adjunct has−resume
Resume
Teaching Assistant
Faculty Member
publications Research Assistant
Publication
has−supervisor
has−research−assistant Special Lecturer
Professor
of−department
Ref Conf Paper has−instructors has−formal−education Ph.D. Advisor
has−professors Formal Education has−phd−advisor
Ph.D. Degree
Master’s Degree
Bachelor’s Degree
Academic Admin has−chairperson Dept. Chairperson of−department
Department of−college
Provost
in−charge−of
has−departments of−college College
College Dean
has−provost
of−university
has−college−dean has−college
has−president President
University in−charge−of
Fig. 1. A subschema of the university OODB.
schema, which itself is an abstraction of the OODB’s constituent objects. The hierarchy formed by the subclass-of relationship – arguably the most important construct in OODB schemas (Kim, 1991) – serves as the conceptual ‘backbone’ of an OODB schema. Schema trimming allows us to focus our attention on this backbone. Figure 2 shows the result of applying schema trimming to the schema in Fig. 1. We call the product of schema trimming the hierarchical view of a schema. Note that the relationship density of the hierarchical view in Fig. 2 is d = 27/32 = 0.84. It contains the same number of classes as the original schema, but it has a lower relationship density and is easier to comprehend. Furthermore, the inclusion of only the
Contextual Partitioning for Comprehension of OODB Schemas
319
Person
Alumnus
Student
Alumni Organization
Student Union
Employee Graduate Student Assistant
Undergraduate Student
Instructor Adjunct
Resume
Teaching Assistant
Faculty Member
Research Assistant
Publication
Special Lecturer Professor
Ref Conf Paper
Ph.D. Advisor Formal Education Academic Admin
Ph.D. Degree
Master’s Degree
Bachelor’s Degree
Department
Dept. Chairperson
Provost
College
College Dean President University
Fig. 2. Hierarchical view of university OODB after applying schema trimming.
subclass-of relationships with their uniform nature, in contrast to the varied semantics of ordinary, user-defined relationships, promotes enhanced comprehension. It is possible for a class in an OODB schema to be specialized into many subclasses and also be generalized into many superclasses. Thus, the hierarchical view of an OODB schema will, in general, be a directed acyclic graph (DAG). Therefore, even the hierarchical view may be difficult to utilize for comprehension purposes in cases where its relationship density is relatively high. A large number of classes coupled with a complicated multiple inheritance hierarchy could very well leave a user disoriented. Therefore, schema trimming by itself, although helpful, is not always a sufficient approach. Another approach to the reduction of complexity is to partition a large schema into smaller contextual views. Eliminating the ‘cross-view’ relationships (i.e., those whose
320
H. Gu et al.
source class is in one contextual view and whose target class is in another) – thus isolating the contextual views from each other – and focusing attention on individual contextual views simplifies the overall schema comprehension task. Specifically, for initial orientation, a user can first pick one contextual view at a time and study its internal relationships (i.e., those that have both their source class and target class in that view). Then, the user can move on to study bi-contextual views for the cross-view relationships between every two contextual views. In this way, the task of studying all relationships is divided into a number of smaller, well-organized, and more manageable tasks. We stress here that the elimination of the cross-view relationships does not alter the original schema. That schema with all its classes and relationships stays intact, and thus its semantics is not changed at all. When we discuss the elimination of the cross-view relationships, we are referring to the derivation of the contextual views only. These views serve as an interface for comprehending portions of the original schema. As a matter of fact, all these ‘eliminated’relationships are included in the bi-contextual views which are designed to facilitate comprehension of cross-view relationships. This is accomplished by displaying only a small portion of the cross-view relationships at any one time. By reviewing the bi-contextual views – one at a time – the user will gain an orientation to all the cross-view relationships of the original schema. With this established orientation, the user will be ready to handle the original unchanged schema with all its complexity. Consider, for example, a large schema partitioned into K contextual views with N classes in each. Suppose further that the number of internal relationships of each view is αN and the number of cross-view relationships between each pair of contextual views is βN. Then, the relationship density of the entire schema is d=
KαN + K(K−1) βN K −1 2 =α+ KN 2
The density of each of the K contextual views is obviously α. Furthermore, the consideration of a bi-contextual view only involves a relationship density of d=
2αN + βN β =α+ 2N 2
Therefore, when only focusing on a bi-contextual view derived from the partition, a user is confronted with a cross-view relationship density that is smaller by a factor of K − 1 compared to that of the cross-view relationships in the entire schema. The such bi-contextual views in order to cover user will, of course, have to review K(K−1) 2 the whole schema. However, the overall task is partitioned into many small subtasks of lower complexity. Thus, the task of comprehending the schema as a whole is facilitated. To facilitate the partitioning, we need to consider several issues. From the technical side, only a limited size view can be displayed on a computer screen. Thus, the partition should consist of a manageably sized set of views, each of which fits on a computer screen (in a legible layout) and which together constitute the complete schema. From the conceptual side, in order to support comprehension, each view should comprise a logical unit which describes some aspect of the application. As we shall discuss in the next section, each such logical unit is selected to be a ‘context’ (hence our term: ‘contextual partitioning’). The need for a contextual partitioning of the schema seems to introduce a vicious cycle, as one must comprehend the schema in order to partition it into contexts. Hence, the problem of logical partitioning into contexts is a hard problem. Moreover, the combination of the practical and conceptual considerations makes such a schema partitioning problem even harder.
Contextual Partitioning for Comprehension of OODB Schemas
321
A possible line of attack is to combine schema trimming and schema partitioning by first trying to partition the hierarchical view and then using that partition to impose a partition on the original schema. Obviously, this problem is much simpler than the original since the hierarchical view has fewer relationships. Furthermore, if the hierarchical view has a forest structure, then there exist efficient polynomial algorithms for various partitioning criteria (Kundu and Misra, 1977; Perl and Schach, 1981; Becker et al, 1982; Becker and Perl, 1983; Becker and Schach, 1984; Agasi et al, 1993; Lucertini et al, 1993; Becker and Perl, 1995). However, if it does not have a forest structure due to multiple inheritance, then we are faced with what is called the ‘simple graph partitioning problem’, which is known to be NP-complete (Gary and Johnson, 1979) and thus probably has no polynomial algorithm. In this paper, we will show that in a hierarchical view it is possible to identify a forest, the semantics of which helps to support schema comprehension. The definition of the simple graph-partitioning problem is as follows: Given a graph, partition it into components such that each component has at most K nodes, and at most L edges connect nodes from different components. Note that the requirement of at most K nodes per component corresponds to the computer screen-capacity issue raised above. The constraint on the number of cross-component edges reflects the fact that partitioning into contextual views tends to minimize the number of cross-view subclass-of relationships.
3. Three Rules of Contextual Partitioning 3.1. Category-of and Role-of Specialization Relationships Our identification of a meaningful forest structure partition of a hierarchical view is based on subtleties in the nature of the subclass-of relationship. In previous work (Neuhold and Schrefl, 1988; Neuhold et al, 1989, 1990; Geller et al, 1991), two major types of subclass-of (specialization) relationships have been distinguished: category-of and role-of . According to our own definition (Geller et al, 1991), category-of is a specialization relationship used for class refinement in the case where both the superclass and the subclass are in the same context. On the other hand, role-of is used in the case where the superclass and the subclass are in different contexts, with instances of the subclass functioning in the role-of instances of the superclass. An obvious question facing a domain expert who wishes to exploit these two kinds of specialization relationships is: When do I use one and not the other? According to the definition, this is the same as asking: When do I define a switch of context between a superclass and its subclass? There is no scientific answer to these questions. The use of category-of as opposed to role-of is a decision which must be made by the expert based on an overall understanding of the domain of interest. As an example to demonstrate the notion of context, consider the classes in Fig. 2. The class Student is subclass-of the class Person, and Graduate Student is subclassof Student. Furthermore, Assistant is subclass-of Graduate Student, and Teaching Assistant is subclass-of Assistant. However, our understanding of the academic domain tells us that information about a student and a graduate student are both in the same context, namely, that of learning, while Person is in a different context, that of personal life. The other two classes, Assistant and Teaching Assistant, are in a third context of employment. Thus, Graduate Student is category-of Student because it represents a refinement in the same context. Similarly, Teaching Assistant is categoryof Assistant. In contrast, Student is role-of Person and Assistant is role-of Graduate
322
H. Gu et al.
Student because in each case the two are in different contexts. Let us emphasize that this determination is not always so easy. We will discuss this issue again below. We accept the situation that for some domain experts two classes are in the same context while for others they are in different contexts. Such discrepancies can arise due to differences in perspective, preference, emphasis, etc. In fact, in our view, it is important to give an expert the freedom to determine the context of each class – which is an important aspect of the job of deriving a successful partition. At bottom, the assignment of classes to contexts is a judgment call on the part of the expert. It should be noted that despite extensive research in the areas of artificial intelligence, knowledge representation, knowledge-base systems, natural language processing, etc. (e.g., Guha, 1991; Shoham, 1991; Buvaˇc and Mason, 1993; McCarthy, 1993; Buvaˇc and Fikes, 1995; Iwanska, 1995; Miller, 1995), there is no widely accepted definition of ‘context’. Even so, it is believed that such a notion is an important organizational construct (Lenat and Guha, 1990). The notion of context has been utilized as a means for the logical integration of disparate information resources on the Web (Goh et al, 1994; Madnick, 1999). Building a gigantic knowledge-base in the CYC project (Lenat, 1995) was found to be doomed to failure if contexts were not introduced as structuring mechanisms. Work following this line has assumed that a context is a first-class object used to parameterize axiom schemas (Guha, 1991; McCarthy, 1993; Buvaˇc and Fikes, 1995). However, this approach has not provided clarity regarding the nature of contexts themselves. As a workshop on the notion of context in natural language processing (Iwanska, 1995) showed, researchers agree that they disagree on what contexts are. In this paper, we are not attempting to promulgate a universally accepted notion of context. Instead, we just assume a context to be a collection of object classes concentrating on a specific subject. In this way, contextual partitioning can be applied by different domain experts armed with their favorite definitions of context. In other words, we are not participating in the research effort aimed at defining ‘context’. We accept the fact that contexts exist in human thinking as a construct which helps in organizing knowledge. With this in mind, we suggest a theoretical framework and a methodology to help domain experts handle context, whatever that concept might exactly mean to each individual. It is our belief that partitioning a complex schema into such contexts is greatly preferable to leaving the schema without such an organization, particularly when the goal is to promote comprehension. Our theoretical paradigm guarantees the existence of an assignment of classes to contexts which results in a forest view of the DAG-structured hierarchical view. The accompanying methodology finds such a forest. In that regard, we will be providing some guidance on the judgment about whether a context switch is warranted or not. In order to ensure that a forest-structured hierarchical view can be found within a given schema, the assignment of classes to contexts must always satisfy three rules, which will be introduced below. We refer to the refinement that involves the relationships category-of and role-of and which satisfies these rules as contextual partitioning. As we shall see in the next section, all three rules are concerned directly or indirectly with the category-of relationship.
3.2. The Equicontext Equivalence Relation In the discussion of our theoretical paradigm, we will be using a new mathematical relation, called equicontext (or ‘in the same context’), from one object class to another. A pair of two classes belongs to the equicontext relation if both classes are in the same context.
Contextual Partitioning for Comprehension of OODB Schemas
323
Let us compare the relationship category-of to the equicontext relation. Category-of is directed and asymmetric, while equicontext is undirected and symmetric. The existence of a category-of between two classes implies the equicontext relation between them, but the opposite is not necessarily true. If two classes, A and B, are both category-of C, then A and B are in the same context (i.e., (A, B) ∈ equicontext) since, by definition, both are in the same context as C; however, A and B are not category-of one another. The first rule of contextual partitioning pertains to the equicontext relation: Rule 1. The equicontext relation is an equivalence relation, i.e., it is reflexive, symmetric, and transitive. An equivalence relation partitions the elements of a set into disjoint subsets, such that every two elements of the same subset are related and no two elements of different subsets are related. As such, Rule 1 implies Rule 1 . Rule 1 . The classes of a schema are partitioned by the equicontext relation into disjoint contexts. Rule 1 forces a domain expert – who is interested in employing contextual partitioning – to explicitly specify contexts in the schema and to resolve any ambiguous situations. The symmetry and transitivity of equicontext imply the following: Any two classes between which there exists a path of category-of relationships (regardless of the directions of category-of ’s) are in the same context. Once again, we do not claim to have a unique way of assigning classes to contexts. Because this is an expert judgment regarding a real-world environment, there are usually many alternative ways to carry out this assignment. Furthermore, we do not claim that all contexts are naturally disjoint. On the contrary, contexts can overlap. However, in order to achieve our goal of enhancing the comprehensibility of large OODB schemas, the domain expert utilizing contextual partitioning must decide on disjoint contexts. Consider, for example, the class Teaching Assistant in Fig. 2. From one side, it belongs to the ‘employment’ context, as does its superclass Assistant. From the other side, it also belongs to the ‘teaching’ context, as does its superclass Instructor. However, Rule 1 requires that Teaching Assistant belong to only one context. When we demonstrate our methodology below, we will see that the choice made is the ‘employment’ context. As can be gathered, the partitioning of classes into disjoint contexts is often a difficult task involving subtle analysis. It is certainly possible that different domain experts will make different decisions in this regard. We do, however, require that any partition satisfy the three rules we set forth in order to guarantee that a forest structure can be identified.
3.3. Category-of Refinement is Exclusive In the model of OODBs that we employ, it is assumed that we may refer to the same ‘real-world object’ at several different levels of the schema’s subclass-of hierarchy. In other words, information pertaining to a single real-world object is distributed among several instances (of different classes) up the hierarchy from the point (class) at which an object was instantiated. (This has been called object slicing, Kuno et al, 1995.) The category-of relationship is used when we refine the objects of a class, say, C when both C and its subclass are in the same context. This means that an instance of a class can have a portion of its information appearing in an instance of a category-of subclass. In the contextual partitioning paradigm, we further require that such a category refinement be mutually exclusive, as specified by the following rule.
324
H. Gu et al.
Person
Quaker
Republican
Republican Quaker Fig. 3. Multiple superclasses (Case 2).
. Rule 2. Two classes which are category-of the same superclass cannot both have an instance representing the same real-world object. That is, the real-world objects corresponding to the instances of the different (category-of ) specialization classes of a given class form disjoint sets. We can restate Rule 2 more formally by introducing a new relation ‘same realworld object’ (abbreviated ‘SRWO’) defined from one instance of an OODB to another instance. A pair (x, y), where x and y are instances of arbitrary classes in an OODB, belongs to relation SRWO if both x and y represent the same real-world object. The relation SRWO is obviously an equivalence relation. In the following, we will also use the notation x ∈ extent(C) to denote the fact that x is an instance of class C (i.e., the extent of C is the set of instances of C). Rule 2 (Restatement). Let A and B be classes in an OODB schema (with A = B) such that A category-of C and B category-of C for some class C. Then there do not exist instances a ∈ extent(A) and b ∈ extent(B) such that (a, b) ∈ SRWO. The problem is that in some real-world situations this rule is not satisfied; the extents of two category-of specialization classes are not disjoint. Figure 3, known as the Nixon Diamond (Shastri, 1988, 1989), illustrates the problem. The class Republican Quaker is a non-empty subclass of both Quaker and Republican. (President Nixon was an instance of the class Republican Quaker.) It is not permitted in contextual partitioning for both the class Republican and the class Quaker to be designated category-of Person. We need to give the domain expert guidelines on how to deal with such a situation. Those guidelines will be discussed in Section 5.
3.4. Uniqueness of a Root Class The third rule of contextual partitioning introduces the notion of the root (or defining) class of a context. Rule 3. For each context, there exists a unique class R that serves as the context’s root class, with every other class in the context being a descendent of R. In other words, each context has one class which is its defining class such that all other classes in the context are specializations of it. There is a directed path of category-of relationships from each class of the context to its root class. Here, we are using the notion of a directed tree where all the directions are towards the root rather than away from it. In graph theory terminology, the root is a sink. For example, the classes Student and Employee are the defining classes of the learning and employment contexts, respectively.
Contextual Partitioning for Comprehension of OODB Schemas
325
Person
Buddhist
Jewish
Quaker
Republican
Catholic
Republican Quaker
Democrat
Democrat Catholic
Fig. 4. Extended Nixon diamond.
Person
Religious Person
Buddhist
Jewish
Quaker
Political Person
Catholic
Republican Quaker
Republican
Democrat
Democrat Catholic
Fig. 5. Extended Nixon diamond satisfying Rule 3.
Rule 3 is not overly restrictive. In the case where a context has several root classes, a new class T can be created with all the root classes as its category-of specialized classes (children). T will function as the required unique root of the context. As an example, we identify three different contexts in Fig. 4: ‘personal’, ‘religious’, and ‘political’. The contexts ‘religious’ and ‘political’ do not have unique roots. In order to satisfy Rule 3, we have to introduce two new classes as role-of Person. The various religious orientations are category-of the new class Religious Person, and the political orientations become category-of the new class Political Person (see Fig. 5; note that we are using a dashed bold arrow to represent a role-of relationship; a category-of relationship is denoted in the same way as subclass-of : using a solid bold arrow). The role-of relationships of Republican Quaker and Democrat Catholic will be discussed in Section 5.
4. Contextual Partitioning Results in a Forest Structure In this section, we will prove that when the three rules of contextual partitioning are adhered to, the category-of (specialization/generalization) hierarchy exhibits a forest structure. The result is stated as the following theorem.
326
H. Gu et al.
E
F
...
...
D
B
C
A Fig. 6. Schema demonstrating contradiction.
Theorem 4.1. Using contextual partitioning, a class has at most one parent (that is, generalization class) to which it has a category-of relationship. Proof. Suppose to the contrary that there exists a class A which is category-of both classes B and C. Hence, A and B are in the same context. Similarly, A and C are in the same context. By the transitivity of equicontext (Rule 1), B and C are also in the same context. By Rule 3, the joint context of the classes B and C has a unique root class, say, D such that both B and C are descendents of D with respect to category-of (Fig. 6). In other words, there is a sequence of category-of relationships from B (C) up to D. Let E (F ) be the closest descendant of D on the path of category-of relationships from B (C) to D, such that E = F . In other words, E and F denote the last distinct classes on the respective paths upward from B and C to D. Furthermore, if E and F are not children of D, then redefine D to denote the common parent class of E and F on the paths. Let a be an instance of A (i.e., a ∈ extent(A)). Then there exists ba ∈ extent(B) such that (a, ba ) ∈ SRWO, since A is category-of B. In other words, instances a and ba represent the same real-world object. Similarly, there exists ca ∈ extent(C) such that (a, ca ) ∈ SRWO. Thus, both ba and ca represent the same real-world object due to the transitivity of the relation SRWO. As noted above, B (C) has a sequence of category-of relationships up to E (F ). Hence, E (F ) has an instance ea (fa ) such that (ba , ea ) ∈ SRWO ((ca , fa ) ∈ SRWO), where the correspondence is derived transitively along the sequence of category-of relationships. But, as shown above, the instances ba and ca represent the same real-world object. Therefore, it follows again from the transitivity of SRWO that ea and fa also represent the same real-world object. However, by Rule 2, the extents of E and F , which are each category-of D, may not both contain an instance representing the same real-world object – a contradiction. 2 Corollary 4.1. The category-of hierarchy has a forest structure, i.e., it consists of one or more trees. Proof. A directed graph which contains no cycles and in which each vertex has at most one parent is a forest. Since the category-of hierarchy is a subhierarchy of the directed acyclic subclass-of hierarchy, the category-of hierarchy has no cycle. By the theorem, each class (vertex) has at most one category-of parent. Hence, the category-of hierarchy is a forest. 2
Contextual Partitioning for Comprehension of OODB Schemas
327
The tree structures of the forest serve as the backbones of the schema, and they will help in its comprehension. Furthermore, these trees partition the original schema into manageable views obtained as vertex-induced subgraphs. That is, each contextual view will have the nodes of one tree plus all the relationships whose source and target nodes are in that same tree.
5. A Methodology for Finding a Forest Hierarchy In this section, we will describe a methodology, based on the theoretical framework, for identifying the forest-structured view of a given schema. This methodology involves human–computer interaction, with a domain expert (e.g., schema designer) being called upon to make judgment decisions based on an understanding of the application. The computer’s role is to provide results of algorithmic procedures which do not involve complex intuitive decisions but do require many computational steps. The output of the methodology is a forest of contexts, based on a refined specialization schema of the original hierarchical OODB schema. Every subclass-of relationship appearing in the original is replaced with either a category-of relationship or a role-of relationship. As we will discuss, the role-of ’s themselves will be differentiated into three types: role-of/regular, role-of/intersection, and role-of/forced. During the partitioning process, the category-of relationships will be left in place to form the forest, and all role-of relationships, irrespective of their type, will be deleted. The methodology is specified algorithmically (in pseudocode) in the following. Comments describe important aspects of the algorithm, and further discussion appears afterward. The major steps of the algorithm have been labeled with numbers appearing at the left-hand margin. We have also explicitly denoted places where the domain expert is called on to make judgment decisions. The input to the algorithm is a schema S. The output is a forest (view) F . In the algorithm, we use the notation superclasses(C) to denote the ‘set object’ that initially holds all the superclasses of a class C. That is, at the outset: superclasses(C) = {U | C subclass-of U }; |superclasses(C)| is the set’s cardinality. Additional notation: L is a list of classes derived from a topological sort of the schema; Q is a queue of classes assumed to be initially empty. forest Create_forest_view(schema S) { // Apply schema trimming to the input schema. (See Section 2.) (1). . . .S = Schema_trim(S); // Apply topological sort, producing a “top-down” ordered list of the // classes in the trimmed schema. (2). . . .L = Topological_sort(S ); // Examine all classes in a top-down manner to determine the root classes // of contexts. This decision is made by the domain expert by comparing // the meaning and importance of a given class with the meaning and // importance of each of its superclasses. A class deemed to be a defining // class of its own context is made into the root of the new context by // disconnecting it from its parents’ contexts. This is done by changing // its outgoing subclass-of relationships to role-of ’s. (3). . . .for (i = 1; i ≤ |L|; i++) { Display the class Ci (∈ L) and the set superclasses(Ci );
328
H. Gu et al. Domain
if (Ci is the defining class of a context) then . . . . . . . ←− judgment // Make Ci the root of the context. Change its // outgoing subclass-of relationships to role-of ’s. for (each U ∈ superclasses(Ci )) { Change “Ci subclass-of U ” to “Ci role-of U ”; Remove(U, superclasses(Ci )); }
expert
} // Insert into a queue Q all classes having multiple superclasses (i.e., // multiple parents with respect to subclass-of not role-of ). Such // classes are enqueued in bottom-up order. For all other classes // (i.e., those having a single superclass), add them to the contexts // of their respective parents by changing the subclass-of relationship // to category-of . (4). . . .for (i = |L|; i ≥ 1; i– –) if (|superclasses(Ci )| > 1) then Enqueue(Ci , Q); else if (|superclasses(Ci )| == 1) then { Change “Ci subclass-of U ” to “Ci category-of U ” [where U ∈ superclasses(Ci )]; Remove(U, superclasses(Ci )); } (4a). . Q2 = Q; // Make a copy of the queue Q // In bottom-up order, the domain expert determines the major // superclass of each class having multiple superclasses. (5). . . .while (NOT Empty(Q)) { C = Dequeue(Q); Display the class C and the set superclasses(C); if (∃ only one class M s.t. M is the major superclass of C) then . . . . . . . . . . . . . . . . . . ←− Domain expert judgment { // C belongs in the context of M. Therefore, make C a category // of M and make C a role of all its other superclasses. Change “C subclass-of M” to “C category-of M”; Remove(M, superclasses(C)); for (each U ∈ superclasses(C)) { Change “C subclass-of U ” to “C role-of U ”; Remove(U, superclasses(C)); } } else // The domain expert has not been able to decide on a major // superclass for C. Therefore, let C be the root of a new // context by making it a role of all its superclasses. for (each U ∈ superclasses(C))
Contextual Partitioning for Comprehension of OODB Schemas
329
{ Change “C subclass-of U ” to “C role-of/intersection1 U ”; Remove(U, superclasses(C)); } } // Resolve contradictory diamonds.2 (6). . . .while (NOT Empty(Q2 )) { C = Dequeue(Q2 ); for (each unordered pair {U1 , U2 } s.t. C role-of/intersection U1 & C role-of/intersection U2 ) { A = Lowest_common_ancestor(U1 , U2 ); if (all relationships on the paths of hierarchical relationships from U1 to A and U2 to A are category-of ’s) then { // Arbitrarily, change either the category-of from U1 to its // parent or from U2 to its parent to role-of Y = Random_select(U1 , U2 ); // Choose U1 or U2 randomly. Change “Y category-of PY ” to “Y role-of/forced 3 PY ” [where PY is the parent of Y on the path to A]; } } } (7). . . .F = Remove_role_ofs(S ); return F ; }
After applying schema trimming to the input schema at Step 1, a topological sort is performed at Step 2, producing a list of all the schema’s classes ordered according to a topological sort order (Aho et al, 1983). Step 3 involves the identification of the root classes of contexts. Here, an interaction between the computer and the domain expert takes place, with the expert being called upon to make judgment decisions. The classes are scanned top-down according to the order defined in Step 2. During this process, the classes which are the defining classes (roots) of the respective contexts are identified by the domain expert. This decision is made by comparing the meaning (and importance) of the class with its superclasses’ meanings in the application. Those classes chosen as roots define new contexts of their own instead of extending the contexts of their superclasses. As the roots of the contexts are being identified, their subclass-of relationships to their respective superclasses are changed to role-of relationships. This type of role-of relationship is called a role-of/regular because it denotes a switch of context between the
1 The role-of/intersection relationship is a kind of role-of relationship. This additional qualification is required by Step 6. See below for further details. 2 The notion of diamond will be defined in the discussion below. 3 The role-of/forced relationship, like the role-of/intersection relationship, is a kind of role-of . It will be discussed further below.
330
H. Gu et al.
generalized class and the specialized class. In effect, these role-of ’s mark the boundaries between contexts in the schema. After Step 3, each class either will have only ordinary subclass-of relationships to its superclasses or will have only role-of relationships to the classes that were formerly its superclasses.4 In other words, a class will have either all superclasses or all role-superclasses; none will have a combination of these. At Step 4, those classes with multiple outgoing subclass-of relationships are added to a queue in bottom-up order for subsequent bottom-up processing. (Below, we will indicate the need for such bottom-up processing.) Those having zero or one subclass-of relationship or any number of role-of ’s are omitted from the list. Furthermore, a class with a single outgoing subclass-of has that relationship changed to category-of ; this adds the class to the context of its superclass. Step 5 calls upon the domain expert to examine successively the set of superclasses of each class enqueued at Step 4. For each class C in the queue, the expert is asked to designate one superclass as C’s ‘major’ superclass, i.e., the superclass in whose context C belongs. The subclass-of relationship between C and the chosen superclass is changed to category-of . All other subclass-of relationships coming from class C are changed to role-of . If the domain expert is unable to choose a major superclass, then C is deemed to start a new context, as discussed further below. In our experience, the expert has ordinarily been able to determine quite easily which superclass, among the multiple superclasses, should be designated major, i.e., which one should have a category-of relationship directed to it. However, there is a minority of cases where this decision is not simple. In such cases, the domain expert must make this decision based on the partial context information already accumulated in the bottom-up processing. We offer the following two guidelines for making this decision. Step 5, Case 1. One of the superclasses is definitional, describing the essence or definition of the subclass, while the other superclasses describe the functionality or usage of (the instances of) the subclass. In this case, the partial context to which the class and its descendants belong should be examined. (This is where the bottom-up processing is utilized.) It should first be determined whether the category-of relationships appearing in this partial context are, in general, functional or definitional. If they are definitional, then the definitional superclass is chosen to be the major superclass. If they are functional, then the functional superclass is preferred. If there happen to be several functional superclasses, the one which matches the function appearing in the partial context of the class will be selected. Finally, if the class being considered is currently the only one in its context, its definitional superclass will be chosen. As mentioned above, the subclass-of relationship between the class and the chosen major superclass is denoted category-of ; the other subclass-of ’s are changed to roleof ’s. We refer to this type of role-of relationships as role-of/regular since a switch of context from the class to the superclass has occurred. As an example of this case, consider the class Teaching Assistant having two superclasses, Assistant and Instructor (Fig. 2). The superclass Assistant is definitional. After all, what is a teaching assistant? Answer: an assistant. The superclass Instructor is functional. To see this, consider the question: What does a teaching assistant do? A teaching assistant instructs. Thus, Teaching Assistant should be category-of Assistant and role-of Instructor.
4 From now on, we will use the term ‘superclass’ only with respect to the ordinary subclass-of relationship. The more general class in a role-of relationship will be called a ‘role-superclass’. Likewise, we will use the term ‘category-superclass’.
Contextual Partitioning for Comprehension of OODB Schemas
331
Animal
Feline
Cat
Domesticated
Canine
Dog
Cheetah
Wild
Wolf
Fig. 7. Multiple superclasses.
Animal
Feline
Cat
Canine
Cheetah
Domesticated
Dog
Wild
Wolf
Fig. 8. Multiple superclasses resolved by Case 1.
Let us consider another example. Figure 7 models the classifications of dog, cat, cheetah, and wolf according to their biological families, feline and canine, and according to their status as wild or domesticated. Therefore, each of the four kinds of animals has two superclasses. The animal families give definitional information, while being wild or domesticated is a functional description. The results of applying a Case 1 analysis appear in Fig. 8, where no class has two category-superclasses. Step 5, Case 2. More than one superclass is definitional, but it is possible to distinguish the major superclass from the minor ones by linguistic analysis of the name of the subclass. For example, when the main characteristic of one superclass is expressed in the subclass name as a noun, while that of another superclass is expressed in the subclass name as an adjective, then the noun defines the major superclass. If both main characteristics are expressed grammatically as nouns, then the second noun is considered major.5 In this case, we are following the structure of a noun phrase consisting of a head noun, appearing last, together with a modifier noun. Reconsidering the above example, the name of the class Teaching Assistant consists of the noun ‘Assistant’ and the adjective ‘Teaching’. Therefore, according to Case 2, the class Assistant is chosen as the major superclass, and Instructor as a minor superclass. Figure 3 illustrates another example of Case 2, featuring four classes. The class Republican Quaker has two superclasses, Republican and Quaker. By Case 2, the major superclass for Republican Quaker is Quaker (Fig. 9). 5 There are well-known exceptions to this rule: A toy gun is a toy and not a gun.
332
H. Gu et al.
Person
Quaker
Republican
Republican Quaker Fig. 9. Multiple superclasses resolved by Case 2.
Professor
Computer Science Professor
Computer Engineering Professor
Professor with Joint Appointment in CS & CE Fig. 10. An ‘intersection’ of two classes in the same context.
There may be times when the domain expert is just unable to decide on a major superclass. In their judgment, all superclasses are of the same or indistinguishable importance, and each contributes to the definition of the subclass in an equal or indistinguishable way. In such a situation, the semantics of the class is a combination of the semantics of all its superclasses, and the class could reasonably belong to the context of any of these superclasses. However, Rule 1 forbids a class from being a member of more than one context. From the other side, there is no reason to prefer one context over the others. Each choice of context will disassociate the class from the other contexts. The conflict is resolved by letting the subclass start a new context, representing an ‘intersection’ of the contexts of the superclasses. This is done by changing all its superclasses to role-superclasses. That is, all subclass-of relationships become role-of relationships. (See the ‘else’ clause in Step 5 in the algorithm.) We refer to this type of role-of as ‘role-of/intersection’ (written ‘r/i’ in the figures). This designation stresses the fact that there is no actual switch of context but rather a joining of a number of different contexts at that point in the schema. It may seem feasible to leave the subclass in the context of its superclasses, assuming they all belong to the same context. However, the rules of contextual partitioning forbid two or more category-of relationships emanating from the same class. (See the theorem in Section 4.) An example of this situation is illustrated in Fig. 10. There, we see that the class Professor with Joint Appointment in CS & CE has two superclasses, Computer Science Professor and Computer Engineering Professor. Both of these are in the same context, namely, that of the class Professor. In this case, the subclass-of links between Professor with Joint Appointment in CS & CE and its superclasses must be changed to role-of/intersection’s. Step 6, the last major step of the methodology, is carried out automatically by the computer via a structural analysis of the schema. It deals with the issue of contradictory
Contextual Partitioning for Comprehension of OODB Schemas
333
...
...
...
A
...
A
PU1
PU2
PU1
PU2
r/f
U1
U2
r/i
r/i
U1
U2
r/i
C
r/i C
Fig. 11. A contradictory diamond resolved.
diamonds (to be defined shortly) within the schema’s hierarchy.6 Let C be a class having multiple parent classes, and let U1 and U2 be two of them. Furthermore, let A be the lowest common ancestor class of U1 and U2 . The structure containing (i) the class C, (ii) the class A, (iii) all classes which are both descendants of A and ancestors of either U1 or U2 , and (iv) all hierarchical relationships between the respective classes in (i)–(iii) is called a diamond. The class C is called the source of the diamond, and A is its sink. At Step 6, all diamonds having sources which were enqueued at Step 4 are scanned by the computer to search for and resolve what we refer to as contradictory diamonds. Such a diamond D is characterized by a source C that is a role-of/intersection of its two parents in D, with both these parents as well as all other classes in D belonging to the one context (see the left side of Fig. 11). Let us explain the nature of the contradiction. Since C is an ‘intersection’ of its two parents, say, U1 and U2 , both cannot belong to the same context. Otherwise, since the intersection of a context with itself is just the original context, the class C must belong to this common context, too. Such a situation implies that C is category-of both U1 and U2 , which is a contradiction. To ensure that no contradictory diamonds will be included in the partition, the following transformation is applied to each such diamond during this step: One of the respective category-of relationships emanating from U1 and U2 on the path to A is changed to role-of . (The choice of U1 or U2 can be made arbitrarily.) We distinguish this kind of ‘forced’ role-of from the other two kinds, and we call it a ‘role-of/forced’ (‘r/f’ in the figures). After this transformation, U1 and U2 are no longer in the same context, and C can be in the intersection context of the contexts of U1 and U2 . After Steps 1–6 have been carried out, a forest hierarchy of category-of relationships is obtained simply by deleting all three kinds of role-of relationships from the schema (Step 7). This completes the methodology. The methodology utilized both top-down and bottom-up processing. The determination of context membership for classes is performed top-down since the context of the root class defines the context of its descendants. When scanning the schema top-down, 6 It should be noted that after Step 5 has been completed only category-of and role-of relationships appear in the hierarchy of the schema. All subclass-of relationships have, by that time, been replaced by one of those two kinds of relationships.
334
H. Gu et al.
an expert can identify which class defines a new context instead of continuing the context of one of its superclasses which had been processed previously. On the other hand, when determining the context of a class in a bottom-up manner – choosing from among its superclasses – it is important to know which descendants of the given class belong to the same context. This knowledge helps to decide which of the superclasses fits best in the already partially constructed context.
6. Applying the Methodology to a Subschema of the University OODB We will now apply our methodology step-by-step to the subschema of the university OODB referred to above (Fig. 1). Step 1, schema trimming, results in the view shown in Fig. 2. Topological sort (Step 2) is then carried out on the classes in the hierarchy of Fig. 2. Because the hierarchy in Fig. 2 is not connected and some of its elements are just singletons or small collections of classes, we only demonstrate the methodology on the large component rooted at Person. In Step 3, the hierarchy is scanned top-down by a domain expert following the topological sort order to identify those classes which define new contexts. All subclass-of relationships from such identified classes to their superclasses are changed to role-of relationships. Since the class Person is the unique root of the entire hierarchy, it starts a new context, which we call the ‘Personal’ context. Working top-down, we can see that the classes Alumnus, Student, and Employee are subclasses of Person. The class Alumnus describes the context of former students. The class Student defines the Learning context. Employee starts the Employment context. These three contexts are different from the Personal context defined by Person. Thus, these three classes are considered to start three new contexts, and each of them is made role-of Person. The other class which is considered to start a new context is Instructor. It defines the Teaching context, which is different from the Employment context because it concentrates on a specific kind of activity central to the university environment. Thus, Instructor is identified as the root of the Teaching context, and we make Instructor role-of Employee. From the viewpoint of one domain expert, there are no more classes in this hierarchy that start new contexts. Altogether, five classes were identified as root classes (see Fig. 12). Of course, there is always the possibility of ambiguity when choosing roots for new contexts. With respect to the current hierarchy, another domain expert might argue that the class Academic Admin also starts a new ‘Administration’context, perhaps of special importance, although the number of employees modeled is small. (Note that many more kinds of non-academic employees have been omitted from this subschema.) In what follows, we will continue to give preference to the first expert’s opinion. We will comment about differences emerging from among the two opinions. This will serve to demonstrate that there are alternative legitimate ways to partition with contexts, each satisfying the rules of contextual partitioning. During Step 4, all classes which have more than one superclass are enqueued in bottom-up order (reversing the order of the topological sort). These classes are Dept. Chairperson, Teaching Assistant, Academic Admin, and Assistant. Since each class with multiple superclasses may have at most one category-of relationship emanating from it, the domain expert must next decide on one major superclass for each of these four classes (Step 5).
Contextual Partitioning for Comprehension of OODB Schemas
335
Person
Alumnus
Student
Alumni Organization
Employee
Student Union Graduate Student
Assistant
Undergraduate Student
Instructor Adjunct
Resume
Teaching Assistant
Faculty Member
Research Assistant
Publication
Special Lecturer
Professor
Ref Conf Paper
Ph.D. Advisor Formal Education Academic Admin
Department Ph.D. Degree
Master’s Degree
Bachelor’s Degree
Dept. Chairperson
Provost
College
College Dean President University
Fig. 12. Hierarchical view of Fig. 2 after applying Step 3.
Dept. Chairperson has two superclasses: one is Professor and the other is Academic Admin. The main function of a chairperson is to lead a department. A chairperson also functions as a professor, e.g., by teaching a course, but this is usually a secondary function. Hence, by Case 1, the class Dept. Chairperson belongs to the same context as Academic Admin. We make it category-of Academic Admin and role-of Professor. The class Academic Admin has two superclasses: Professor and Employee. The purpose of giving an academic administrator a professor’s appointment is to provide a tenured academic position in the case of a resignation from the administrative position. Hence, by Case 1, the superclass Employee is the major superclass. (According to the alternative opinion mentioned above, the class Academic Admin is already role-of its two superclasses. Therefore, no such further analysis is needed.)
336
H. Gu et al. Person
Alumnus
Student
Alumni Organization
Employee
Student Union Graduate Student
Assistant
Undergraduate Student
Instructor Adjunct
Resume
Teaching Assistant
Faculty Member
Research Assistant
Publication
Special Lecturer
Professor
Ref Conf Paper
Ph.D. Advisor Formal Education Academic Admin
Department Ph.D. Degree
Master’s Degree
Bachelor’s Degree
Dept. Chairperson
Provost
College
College Dean President University
Fig. 13. Result of applying Steps 1–5 of the methodology to Fig. 1.
The class Teaching Assistant has two superclasses: Assistant and Instructor. Since Assistant is a definitional superclass, while teaching is a function of a teaching assistant, Assistant is chosen as the major superclass (Case 1). As a matter of fact, we could come to the same conclusion using Case 2, as discussed above. The class Assistant has two superclasses: Graduate Student and Employee. Since Assistant describes student employment rather than academic studies, it is in the same context as Employee; i.e., it is category-of Employee and role-of Graduate Student. After applying Steps 1–5, each class which originally had multiple superclasses is category-of at most one of those original superclasses, and is role-of the rest. Figure 13 shows the result of applying Steps 1–5 of our methodology to the schema in Fig. 1.
Contextual Partitioning for Comprehension of OODB Schemas
337
Person
Alumnus
Student
Alumni Organization
Employee
Student Union Graduate Student
Assistant
Undergraduate Student
Instructor Adjunct
Resume
Teaching Assistant
Faculty Member
Research Assistant
Publication
Special Lecturer
Professor
Ref Conf Paper
Ph.D. Advisor Formal Education Academic Admin
Department Ph.D. Degree
Master’s Degree
Bachelor’s Degree
Dept. Chairperson
Provost
College
College Dean President University
Fig. 14. Removing all role-of relationships from Fig. 13.
The next step, Step 6, deals with finding diamonds and resolving contradictory cases. There are three diamonds in Fig. 13: one with source Dept. Chairperson and sink Employee; another with source Teaching Assistant and sink Employee; and the last with source Assistant and sink Person. None of these three diamonds is a contradictory case, and thus Step 6 is unnecessary in this situation. All subclass-of relationships in Fig. 2 have already been refined properly into either category-of or role-of in Fig. 13. Finally, in Step 7, the forest hierarchy consisting of the category-of relationships of Fig. 13 is obtained by deleting all role-of relationships. Figure 14 shows the different contexts as trees of the forest. To summarize, the hierarchy rooted at Person, consisting of 20 classes, is partitioned into five trees. (Following the alternative expert’s opinion, the 20 classes are partitioned into six contexts.) Two of the trees are singletons, containing only Person and Alumnus,
338
H. Gu et al. Student
Graduate Student
Undergraduate Student
Fig. 15. Learning contextual view.
Instructor Adjunct
Faculty Member
Special Lecturer Professor
Ph.D. Advisor
Fig. 16. Teaching contextual view.
respectively. The Learning context contains three classes; the Teaching context contains six classes; the Employment context contains nine classes. According to the alternate opinion, the Employment context contains four classes and the Administrative context consists of five classes. This partition is more appealing due to the reduction in the size of the largest context and the overall more even distribution of the sizes of the various contexts. With regard to the issue of enhanced comprehension capabilities afforded by the partition, let us focus on the Learning contextual view in Fig. 15 (consisting of the three classes Student, Undergraduate Student, and Graduate Student) and the Teaching contextual view in Fig. 16 (consisting of the six classes Instructor, Special Lecturer, Faculty Member, Adjunct, Professor, and Ph.D. Advisor). There are no internal relationships for either contextual view. In order to review the cross-view relationships between these two contextual views, see their bi-contextual view in Fig. 17. There are two cross-view relationships directed from the Learning contextual view to the Teaching contextual view. These are both named has-supervisor. One goes from Undergraduate Student to Faculty Member, and the other from Graduate Student to Professor. The only relationship going from the Teaching contextual view to the Learning contextual view is the converse relationship supervisees from Professor to Student. It is clearly much easier to get an orientation to each of these contextual views separately from Figs. 15 and 16 and then acquire a familiarity with their cross-view
Contextual Partitioning for Comprehension of OODB Schemas
339
Student
supervisees
Graduate Student has−supervisor Undergraduate Student has−supervisor
Instructor Adjunct
Faculty Member
Special Lecturer Professor
Ph.D. Advisor
Fig. 17. Learning and Teaching bi-contextual view. supervisor Employee
Assistant
Teaching Assistant
Research Assistant
Academic Admin
Dept. Chairperson
Provost
College Dean President
Fig. 18. Employment contextual view.
340
H. Gu et al. supervisor Employee
Assistant
Instructor Adjunct
r iso erv sup
− has
Teaching Assistant
Faculty Member
Research Assistant has
−re
Special Lecturer
sea
rch
−as
sist
ant
Professor
Ph.D. Advisor
Academic Admin
Dept. Chairperson
Provost
College Dean President
Fig. 19. Employment and Teaching bi-contextual view.
relationships through the bi-contextual view of Fig. 17 than it is to get such knowledge from Fig. 1. There, these aspects are hidden in the overall structure of a large schema. As another example, consider the Employment contextual view in Fig. 18. It has only one internal relationship supervisor from class Employee to itself. When considering the bi-contextual view in Fig. 19 involving the Employment and Teaching contextual views, we see that there is one role-of relationship from the Teaching contextual view to the Employment contextual view and three role-of relationships in the other direction. Furthermore, there are two cross-view (non-hierarchical) relationships between them: has-supervisor from the Employment contextual view to the Teaching contextual view, and has-research-assistant going the other way. To demonstrate the difference between orientation while concentrating only on a bi-contextual view versus acquiring it from the full schema, the reader is invited to find these two relationships in Fig. 1, before looking at Fig. 19. We draw the reader’s attention to our explanation in Section 2 that stresses the fact that we do not alter any aspect of the original schema. All the contextual views, e.g., the Learning, Teaching, and Employment contextual views (Figs. 15, 16, and 18), and the bi-contextual views, e.g., ‘Learning and Teaching’ (Fig. 17) and ‘Employment and
Contextual Partitioning for Comprehension of OODB Schemas
341
Teaching’(Fig. 19), are displayed for a single purpose. They serve as vehicles to facilitate the user’s orientation to the original schema. This orientation is gained by dividing the comprehension process involving the large, complex schema into many small, simple tasks of studying the contextual and bi-contextual views. These examples provide anecdotal support corresponding to the computational analysis of Section 2. Both demonstrate the reduced complexity encountered in comprehending the full schema through pair-wise consideration of its contextual views.
7. Conclusions OODB schemas can be large, complex networks of knowledge, particularly in the scope of cooperative information environments. Therefore, gaining an orientation and an understanding of the content of such schemas and their underlying databases can be a daunting task. In this paper, we have addressed the issue of facilitating OODB schema comprehension by presenting both a theoretical paradigm and a methodology for identifying a meaningful forest view of a given original OODB schema. (It will be noted that the original schema is not altered by this process.) The extraction of the forest view employs two approaches: schema trimming and partitioning. We presented three rules which express limitations and refinements to the OODB schema. A technique, called contextual partitioning, based on these rules was introduced. A formal result guaranteeing the existence of a forest view in a contextually partitioned OODB schema was proven. In addition, a human–computer interactive methodology was developed for finding such a forest view, based on the theoretical paradigm. The forest view functions as a skeleton of the original schema and facilitates schema comprehension efforts. Once the user has gained orientation to the various parts of the schema displayed as contextual and bi-contextual views, he is ready to handle the original schema with all its complexity. The methodology was demonstrated for a subschema of a university OODB. Acknowledgements. We’d like to thank Zong Chen for drawing some of the figures in this paper.
References Agasi E, Becker RI, Perl Y (1993) A shifting algorithm for constrained min-max partition on trees. Discrete Applied Mathematics 45:1–28 Aho AV, Hopcroft JE, Ullman JD (1983) Data structures and algorithms. Addison-Wesley, Reading, MA Becker RI, Perl Y (1983) Shifting algorithms for tree partitioning with general weighting functions. Journal of Algorithms 4:101–120 Becker RI, Perl Y (1995) The shifting algorithm technique for the partitioning of trees. Discrete Applied Mathematics 62:15–34 Becker RI, Schach S (1984) A bottom-up algorithm for weight- and height-bounded minimal partition of trees. International Journal of Computer Mathematics 16:211–228 Becker RI, Perl Y, Schach S (1982) A shifting algorithm for min-max tree-partitioning. Jounal of the ACM 29:56–67 Bertino E, Martino L (1993) Object-oriented database systems: concepts and architectures. Addison-Wesley, New York Booch G, Rumbaugh J, Jacobson I (1999) The Unified Modeling Language user guide. Addison-Wesley, Reading, MA Buvaˇc S, Fikes R (1995) A declarative formalization of knowledge translation. In CIKM-95, Proceedings of the 4th international conference on information and knowledge management, Baltimore, MD, pp 340–347 Buvaˇc S, Mason IM (1993) Propositional logic of context. In Proceedings of the 11th national conference on artificial intelligence (AAAI-93), Washington, DC, pp 412–419 Cattell RGG, Barry DK (eds) (1997) The object database standard: ODMG 2.0. Morgan Kaufmann, San Francisco, CA
342
H. Gu et al.
Fowler J (1996) STEP for data management, exchange and sharing. Technology Appraisals, Twickenham, UK Fowler M, Scott K (2000) UML distilled 2nd edn. Addison-Wesley, Reading, MA Gary MR, Johnson DS (1979) Computers and intractability. Freeman, New York Geller J, Perl Y, Cannata P et al (1993) Structural integration: concepts and case study. Journal of Systems Integration 3(2):131–161 Geller J, Perl Y, Neuhold E (1991) Structure and semantics in OODB class specifications. SIGMOD Record 20(4):40–43 Goh CH, Madnick SE, Siegel MD (1994) Context interchange: overcoming the challenges of large-scale interoperable database systems in a dynamic environment. In Adam N, Bhargava B, Yesha Y (eds). CIKM-94, Proceedings of the 3rd international conference on information and knowledge management, Gaithersburg, MD, pp 337–346 Gu H, Halper M, Geller J et al (1999) Benefits of an object-oriented database representation for controlled medical terminologies. JAMIA 6(4):283–303 Gu H, Perl Y, Geller J et al (1999) A methodology for partitioning a vocabulary hierarchy into trees. Artificial Intelligence in Medicine 15(1):77–98 Gu H, Perl Y, Geller J et al (2000) Representing the UMLS as an OODB: modeling issues and advantages. JAMIA 7(1):66–80 Guha RV (1991) Contexts: a formalization and some applications. PhD thesis, Stanford University Halper M, Geller J, Perl Y et al (1993) A graphical schema representation for object-oriented databases. In Cooper R (ed). Interfaces to database systems. Springer, London, pp 282–307 Iwanska L (1995) Context in natural language processing. In Working notes of workshop W13, IJCAI, Montreal, Canada Kim H-J (1991) Algorithmic and computational aspects of object-oriented database schema design. In Gupta R, Horowitz E (eds). Object-oriented databases with applications to CASE, networks, and VLSI CAD. Prentice-Hall, Englewood Cliffs, NJ, pp 26–61 Kim W, Lochovsky FH (eds) (1989) Object-oriented concepts, databases, and applications. ACM Press, New York Kundu S, Misra J (1977) A linear tree-partitioning algorithm. SIAM Journal of Computation 6:131–134 Kuno HK, Ra YG, Rundensteiner EA (1995) The object-slicing technique: a flexible object representation and its evaluation, Technical report CSE-TR-241-95, University of Michigan Lenat DB (1995) CYC: a large-scale investment in knowledge infrastructure. Communications of the ACM 38(11):33–38 Lenat DB, Guha RV (1990) Building large knowledge-based systems: representation and inference in the CYC project. Addison-Wesley, Reading, MA Liu L, Halper M, Geller J et al (1999) Controlled vocabularies in OODBs: modeling issues and implementation. Distributed and Parallel Databases 7(1):37–65 Liu L, Halper M, Geller J et al (2002) Using OODB modeling to partition a vocabulary into structurally and semantically uniform concept groups. IEEE Transactions of Knowledge and Data Engineering 14(4):850– 866 Loomis MES (1995) Object databases: the essentials. Addison-Wesley, Reading, MA Lucertini M, Perl Y, Simeone B (1993) Most uniform path partitioning and its use in image processing. Discrete Applied Mathematics 42:227–256 Madnick SE (1999) Meta-data Jones and the Tower of Babel: the challenge of large-scale semantic heterogeneity. In Proceedings of the 3rd IEEE meta-data conference, Bethesda, MD McCarthy J (1993) Notes on formalizing context. In 13th international joint conference on artificial intelligence, Chambery, France, pp 555–560 Mehta A, Geller J, Perl Y et al (1996) Computing access relevance for path-method generation in OODBs and IM-OODB. Journal of Intelligent Information Systems 7(1):75–100 Mehta A, Geller J, Perl Y et al (1998) The OODB Path–Method Generator (PMG) using access weights and precomputed access relevance. VLDB Journal 7(1):25–47 Miller GA (1995) Wordnet: a lexical database for English. Communications of the ACM 38(11):39–41 Neuhold EJ, Schrefl M (1988) Dynamic derivation of personalized views. In VLDB’88, Long Beach, CA, pp 183–194 Neuhold EJ, Geller J, Perl Y et al (1989) Separating structural and semantic elements in object-oriented knowledge bases. In Advanced database system symposium, Kyoto, Japan, pp 67–74 Neuhold EJ, Geller J, Perl Y et al (1990) A theoretical underlying Dual Model for knowledge-based systems. In Proceedings of the 1st international conference on systems integration, Morristown, NJ, pp 96–103 Perl Y, Geller J Gu H (1996) Identifying a forest hierarchy in an OODB specialization hierarchy satisfying disciplined modeling. In Proceedings of CoopIS’96, Brussels, Belgium, pp 182–195 Perl Y, Schach S (1981) Max-min tree-partitioning. Journal of ACM 28:5–15
Contextual Partitioning for Comprehension of OODB Schemas
343
Rumbaugh J, Blaha M, Premerlani W et al (1991) Object-oriented modeling and design. Prentice-Hall, Englewood Cliffs, NJ Rumbaugh J, Jacobson I, Booch G (1999) The Unified Modeling Language reference manual.Addison-Wesley, Reading, MA Schenck DA, Wilson PR (1994) Information modeling the EXPRESS way. Oxford University Press, New York Shastri L (1988) Semantic networks: an evidential formalization and its connectionist realization. Morgan Kaufmann, San Mateo, CA Shastri L (1989) Default reasoning in semantic networks: a formalization of recognition and inheritance. Artificial Intelligence 39(3):283–356 Shoham Y (1991) Varieties of context in artificial intelligence and mathematical theories of computation. Academic Press, London Wickens CD, Gordon SE, Liu Y (1997) An introduction to human factors engineering. Addison-Wesley, Reading, MA Zdonik SB, Maier D (1990) Fundamentals of object-oriented databases. In Zdonik SB, Maier D, (eds). Readings in object-oriented database systems. Morgan Kaufmann, San Mateo, CA, pp 1–32
Author Biographies Huanying Gu received her B.S. degree in Computer Science from Huazhong University of Science and Technology and her M.S. degree in Computer Science from Nanjing University of Science and Technology, both in China. She received her Ph.D. in Computer Science from the New Jersey Institute of Technology in 1999. She is now an Assistant Professor of Biomedical Informatics at the University of Medicine and Dentistry of New Jersey (UMDNJ). During her graduate studies, she participated in the Object-Oriented Healthcare Vocabulary Repository (OOHVR) Project, funded by the National Institute of Standards and Technology (NIST) Advanced Technology Program. Her research interests include object-oriented modeling, OODB systems, knowledge representation, controlled medical terminologies, and biomedical informatics. She has published numerous papers in international journals and conferences. Yehoshua Perl received his Ph.D. degree in Computer Science in 1975 from the Weizmann Institute of Science, Israel. He was appointed lecturer and senior lecturer in Bar-Ilan University, Israel, in 1975 and 1979, respectively. Since 1985, he has been in the Computer Science Department at New Jersey Institute of Technology ( NJIT), where he was appointed professor in 1987. He received the Harlan Perlis Research Award of NJIT in 1996. Dr Perl is the author of about 100 papers in international journals and conferences. His publications are in the following areas: object-oriented databases, ontologies, design and analysis of algorithms, data structures, sorting networks, and medical informatics. Highlights of his research include, among others: the shifting algorithm technique for tree partitioning, analysis of interpolation search, the design of periodic sorting networks, modeling medical vocabularies using object-oriented databases, and enhancing the semantics of object-oriented databases. Michael Halper received the B.S. degree (with honors) in Computer Science from NJIT in 1985; the M.S. degree in Computer Science from Fairleigh Dickinson University in 1987; and the Ph.D. degree in Computer Science from NJIT in 1993. During his graduate studies, he was the recipient of a Garden State Graduate Fellowship from the State of New Jersey. Dr Halper is an Associate Professor of Computer Science at Kean University, and a visiting researcher at NJIT’s OODB & AI Laboratory. His research interests include conceptual and object-oriented data modeling, part–whole modeling, extensible data models, object-oriented database (OODB) systems, and medical informatics. He has worked on the OOHVR project – funded by the National Institute of Standards and Technology (NIST) Advanced Technology Program – to model controlled terminologies using OODB technology. Dr Halper has had numerous papers in international journals, conferences, and workshops. He is a member of the Honor Society of Phi Kappa Phi.
344
H. Gu et al. James Geller received an Electrical Engineering Diploma from the Technical University Vienna, Austria, in 1979. His M.S. degree (1984) and his Ph.D. degree (1988) in Computer Science were received from the State University of New York at Buffalo. He spent the year before his doctoral defense at the Information Sciences Institute (ISI) of USC in Los Angeles, working with their Intelligent Interfaces group. James Geller is currently professor in the Computer Science Department of the New Jersey Institute of Technology, where he is also Director of the OODB & AI Laboratory and Vice Chair of the M.S. and Ph.D. programs in Biomedical Informatics. Dr Geller has published numerous journal and conference papers in a number of areas including knowledge representation, parallel artificial intelligence, medical informatics, and OODB systems. His current research interests concentrate on object-oriented modeling of medical vocabularies, and on Web mining. James Geller is a past SIGART Treasurer. Erich J. Neuhold received his M.S. in Electronics and his Ph.D. in Mathematics and Computer Science at the Technical University of Vienna, Austria, in 1963 and 1967, respectively. From 1963 to 1972, he was a research scientist at IBM. From 1972 to 1983, he was Professor of Computer Science at the University of Stuttgart, Germany. In 1983 and 1984, he was Laboratory Director at Hewlett Packard, Palo Alto. From 1984 to 1986, he was Professor of Computer Science, Technical University of Vienna. Since 1986 he has been Director of the Institute for Integrated Publication and Information Systems (IPSI) in Darmstadt, Germany. Since 1989 he has also been Professor of Computer Science at the Darmstadt University of Technology. His primary research interests are in heterogeneous multimedia database systems, Web technologies, persistent information and knowledge repositories (XML, RDF, etc.), content engineering, user interfaces, mobile technology, and security in Web applications like e-learning and e-commerce.
Correspondence and offprint requests to: Huanying Gu, Department of Health Informatics, UMDNJ, Newark 07107, NJ, USA. Email:
[email protected]