The Role of Domain Ontologies in Database Design: An Ontology Management and Conceptual Modeling Environment VIJAYAN SUGUMARAN Oakland University, Rochester, MI and VEDA C. STOREY Georgia State University, Atlanta, GA ________________________________________________________________________ Database design is difficult because it involves a database designer understanding an application and translating the design requirements into a conceptual model. However, the designer may have little or no knowledge about the application or task for which the database is being designed. This research presents a methodology for supporting database design creation and evaluation that makes use of domain-specific knowledge about an application stored in the form of domain ontologies. The methodology is implemented in a prototype system, the Ontology Management and Database Design Environment. Initial testing of the prototype illustrates that the incorporation and use of ontologies is effective in creating entity-relationship models. Categories and Subject Descriptors: H.2.1 [Database Management]: Logical Design - data models General Terms: Design, Experimentation Additional Key Words and Phrases: Ontology, conceptual modeling, entity-relationship modeling, database design, integrity constraints, ontology management and database design environment
________________________________________________________________________ 1. INTRODUCTION Database design is difficult because it relies heavily on a database designer understanding the users’ database requirements and representing them so they accurately capture the real world application being modeled. A database designer, however, may not have knowledge of the domain being modeled and, thus, relies upon the user to articulate the design requirements. The conceptual modeling phase of database design focuses on building a high-quality representation of selected phenomena in some domain. Database designers generate conceptual models (scripts) using (a) conceptual modeling grammars (e.g., the entity-relationship modeling grammar), and (b) conceptual modeling methods and work within an organizational context [Weber, 2002a]. Much progress has been made ________________________________________________________________________ This is a preliminary release of an article accepted by ACM Transactions on Database Systems. The definitive version is currently in production at ACM and, when released, will supersede this version. An earlier version of this manuscript was presented at the International Conference on Information Systems (ICIS’03), Seattle, Washington, December 14th – 17th, 2003. This research was supported by Oakland University and Georgia State University. Authors' addresses: V. Sugumaran, Department of Decision and Information Sciences, School of Business Administration, Oakland University, Rochester, MI 48309; email:
[email protected]; V.C. Storey, Computer Information Systems Department, J. Mack Robinson College of Business, Georgia State University, Atlanta, GA 30302; email:
[email protected]. Copyright 200x by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or
[email protected]. ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
on developing methodologies and guidelines to assist the designer in conceptual database design, especially in understanding how to apply design constructs [Bodart et al. 2002; Weber 1996; Shanks et al. 2002; Siau et al. 1997; Parsons and Wand, 2000]. However, it would also be useful if a designer would have at his or her disposal, knowledge about an application domain in the form of a repository of application-specific knowledge [LloydWilliams 1997]. Ontologies have been proposed as an important way to represent real world knowledge and, at some level, to support interoperability [Swartout 1999; Gualtieri 2005; Knight et al. 2006]. A domain ontology essentially provides knowledge of the relevant terms of a domain. If a knowledge base of such terms were available and could be processed automatically, this could both speed up the design process and make the results more relevant. Ontologies help to capture the semantics of a domain by acting as a surrogate for the meaning of terms. Semantics, for the purpose of this research, is defined as the meaning, or essential message, of the terms used in a conceptual model; that is, of words and phrases representing entities and relationships. First, domain knowledge, as stored in an ontology, can help when creating a database design because it can suggest what terms (e.g., entities) might appear in an application domain and how they are related to other terms. The relationships in an ontology map to business rules and have implications for semantic integrity constraints. Second, the comparison of constructs in a partial design with those in an ontology could highlight missing constructs. The objectives of this research are to: • develop a methodology for creating or evaluating entity-relationship models using domain ontologies that capture and represent some of the semantics of an application domain; • implement the methodology in a prototype; • empirically assess the effectiveness of the methodology. The contribution of the research is to provide a way to create database designs that capture and represent the semantics of an application. It also demonstrates how domain ontologies can be used to facilitate database design and provides a prototype for assessment of ontology use. This should minimize the amount of work on the part of the designer and result in the development of more comprehensive and consistent conceptual models. 2. RELATED RESEARCH 2.1 Database Design Much progress has been made on developing methodologies for database design. Methodologies that support distinct conceptual and logical design have been proposed and research carried out on automating various aspects of these design phases [Denis and Wixom 2000; Alter 1999]. This has resulted in the development of CASE tools with sophisticated diagramming capabilities. These tools have proven useful for consistency checking of the syntactic aspects of a design [Stewart and Arora 2003]. However, they could be even more useful if they had access to knowledge about the semantics of an application domain [Noah and Williams 2002]. 2.2 Conceptual Modeling Conceptual modeling deals with understanding the real world and representing it in such a way that it can be translated into a design that captures the main aspects of an application. Using the entity-relationship model, this means applying the entity, relationship, and attribute construct. A relationship is of the form A verb phrase B where ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
A and B are entities. In linguistics, the semantic counterparts of verb phrases are events. Events are phenomena that occur in time, either statically (one sits in front of a computer) or dynamically (one walks to have lunch) [Guarino, 2005]. Bunge [1977] and others consider events to be only dynamic that include some notion of time. Understanding relationships, therefore, includes understanding and classifying events, although not all relationships correspond to events. Prior research on conceptual modeling has distinguished entities from attributes and relationships and focused on defining the nature of relationships [Dey et al. 1999; Wand et al. 1999; Weber 1996; Shanks et al. 2002, Weber 2002a]. 2.3 Ontologies An ontology defines a set of constructs used to represent real-world phenomena [Allen and March 2005]. An ontology should be a way of describing one’s world although there are many different definitions, descriptions, types, and approaches to ontology development [Weber 2002b; Welty and Guarino 2001]. Ontologies generally consist of terms, their definitions, and axioms relating them [Gruber 1993]. They can also be thought of as the consequence of capturing, representing, and using surrogates for the meanings of terms [Purao and Storey 2005]. This research focuses on domain ontologies that specify conceptualizations specific to a domain [Weber 2002b]. For example, a domain ontology for auctions could consist of terms such as item, bid-price, bidder, and seller and the relationships among them. Several research projects have attempted to use some form of ontologies for conceptual design. Bergholotz and Johannesson [2001] propose an ontology to classify relationships based upon data abstractions and speech acts. Storey [2005] presents an ontology for classifying the verb phrases of relationships based upon research in linguistics and semantic data models. Purao and Storey [2005] develop a multi-layered ontology for classifying relationships by using data abstractions and other data modeling primitives and by separating domain-dependent and domain-independent aspects of the relationship constructs. Storey et al. [1998] classify entities into various categories. All of these are interactive classification schemes that attempt to capture semantics by mapping terms or phrases to predefined categories. Other research on the classification of verb phrases includes Dahchour’s [2001] proposal to understand relationships using metaclasses. Another project is found in Siau [2004]. These research efforts, however, concentrate on the classification schemes and mostly require user interaction to make the mapping from the term name to the classification category. Database design is a good problem for the application of ontologies because having a surrogate for the meaning of constructs could help a designer with little knowledge of an application domain. Furthermore, conceptual modeling using the entity-relationship model has only three main constructs to consider. However, most ontologies are not sufficient for capturing the level of semantics needed for conceptual modeling. For example, there is no mechanism for representing semantic integrity constraints that need to be reflected in a design. 2.4 Ontology Libraries Ontology development is fundamentally a difficult problem. Research on creating ontologies and using ontologies has been motivated by the Semantic Web and knowledge reuse [Ding and Foo 2002; Pinto and Martins 2004; Lammari and Metais 2004]. The Semantic Web is intended to extend the World Wide Web by capturing and representing the semantics of an application domain in domain ontologies [Hendler 2001]. Research on the Semantic Web instigated the need for libraries of ontologies, the most well-known ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
of which is the DAML ontologies that were developed specifically for the Semantic Web (www.daml.org). Another motivation for the development of ontology libraries is for knowledge reuse, which is important because of the overwhelming amount of information that is constantly being generated. It is intended that such ontology libraries will be publicly available [McGuinness 2002]. The creation of ontologies requires a great deal of effort to identify domain terms and how they are related, organize the terms into hierarchies, separate instances (e.g., Charlottetown) from type (city) and so forth. Much ontology development is carried out on a manual basis, although tools and techniques are being developed for automatic or semi-automatic generation of ontologies. Ding and Foo [2002] suggest that small ontologies are useful because they enable semantic agreement among the people using them with little time lag. Although research on semi-automatic or automatic ontology evolution and generation has been carried out [Embley et al. 2005; Al-Muhammed et al. 2005; Storey et al. 2005; Oberle et al. 2004], current efforts are at an early stage of development and still require user intervention. Embley’s [2004] methodology for generating ontologies, for example, considers the information stored in tables on the Web for a specific domain. Ontology integration issues must also be addressed with some resemblance to research on heterogeneous databases [Ding and Foo 2002]. 3. ONTOLOGY FOR DATABASE DESIGN This section presents an ontology that incorporates domain knowledge for database design. The method used to develop the ontology is outlined and the components and representation issues discussed. The research is intended to use existing libraries of ontologies as they are developed. As mentioned above, there has been much call for the development of ontologies for different application domains that can be shared amongst different applications and interest groups [Embley 2004; Heflin and Hendler 2000; Fensel et al. 2001; Brewster and O'Hara 2004; Noy et al. 2004]. This research on database design is one such application. As research progresses on automated and semi-automated ontology development [Ding and Foo 2002], the availability of useful ontologies will increase. Although the actual development of ontologies is not the focus of this research, the following outlines the procedure followed for the manual, but systematic, development of the ontologies used in this research. 3.1 Components and Development The main components of an ontology are terms, relationships, and axioms. The relationships that typically appear in an ontology are subclass-of (is_a), instance-of, synonym, and related-to [Gomez-Perez 1999; McGuinness 2002]. This research extends these main concepts to include domain relationships that capture additional constraints needed for database design and which represent, to some extent, semantic integrity constraints. There additional constraints capture various aspects of a business application. The ontology for the auction ontology was developed using the following steps: Step 1: Identify category of website. For this research, it was the retail category. Step 2: Specify target domain and initial domain knowledge. The target domain was auction. For the auction domain, the keyword “auction” was used as input to a search on Google. Step 3: Crawl and scan web pages. From the websites for auctions, the standard format of the auction domain was extracted. This enabled the researchers to identify the main terms of the application domain. It also enabled the researchers to infer which terms were somehow related to each other. Then, the domain relationships were identified. ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
Step 4: Extract concepts. The terms extracted were in the form of nouns and verbs which translate loosely into entities and relationships. Step 5: Analyze and cluster extracted features. A content analysis of the extracted concepts from the web pages identified which terms might belong together based upon the frequency of their appearances together. For example, item and bid occur together. From these, inferences were made based upon common sense knowledge of the application. For example, an item is a pre-requisite for a bid. The relationships between the terms were generated by examining how close the terms appeared to each other on the website and by adding constraints based upon knowledge of the application domain. For example, in order for a customer to buy an item, a customer needs to bid on it. To be a seller, one must have an item to sell, and so forth. These translate into constraints, for example, pre-requisite constraints. 3.2 Ontology Components and Representation The auction domain ontology used in this research is a lightweight ontology when compared, for example, to upper level ontologies such as CYC [Lenat et al. 1995], OWL ontologies (http://protege.stanford.edu/plugins/owl/owl-library/index.html), or SUMO [Pease et al. 2002]. In ontology development, concepts or terms are often organized into taxonomies with binary relationships. Most ontologies focus heavily on terms, often organized as a class and subclasses (e.g. the DAML ontologies). To a much lesser extent, instances of terms are included. Representing and reasoning with these basic components is not enough to create a heavyweight ontology capable of supporting complex reasoning [Corcho et al. 2003a]. Therefore, before implementing an ontology, one should decide on the application needs in terms of expressiveness and inference. This research extends the relationship component of an ontology [Sugumaran and Storey 2002]. For conceptual modeling, the basic relationships (synonym, is_a, and related_to) employed in an ontology are limiting because of their inability to capture domain specific knowledge (business rules). We model other types of relationships, which we call domain relationships, the purpose of which is to define and enforce business rules. The following four types of domain relationships: a) pre-requisite, b) mutually-inclusive, c) mutually-exclusive, and d) temporal are defined as follows. Pre-requisite relationship: One term/relationship depends upon another. For example, in the online auction, a bid requires an item. However, information about the item can exist without requiring a bid. Consider two terms A and B. If A requires B, then, the prerequisite constraint is: A B. Term B may occur by itself; however, for term A to occur, term B is required. Temporal relationship: One term or relationship must occur before another. If A must precede B, the temporal constraint is: A B. For example, to bid on an item, one must register as a bidder. Temporal relationships can be difficult to establish. Considerable research examines the dimensions of time and how the notion of time can be incorporated into information system [Allen 1983]. There has been some ontology development work done to capture time. For example, the DAML ontology includes the terms time-point, time-unit, month, day-of-the week, time lapse (between one time and another), hour, year, calendar, start and end time, duration, interval sequence, upper and lower bounds, etc. This ontology, thus, captures basic time measurement and other notions such as duration, overlap, meets, etc. However, it does not provide adequate representation for temporal dependence. Mutually inclusive relationship: One term may require another for its existence. If A requires B and B also requires A, a mutually inclusive relationship exists between A and ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
B, A B, signifying that these two terms occur simultaneously. To be a bidder, one must make a bid. Likewise, a bid cannot be made without a bidder. Mutual exclusive relationship: One term or relationship cannot occur at the same time as another. For example, an individual cannot be the buyer and seller of the same item. This operates on the relationship for availability and the relationship for closed bidding. If A and B are mutually exclusive, then only one of them can occur in a given context. The mutually exclusive relationship is: A B. An ontology is conceptually represented as a semantic network, where the nodes correspond to the ontology’s concepts or terms and the arcs correspond to various relationships, as illustrated in Figure 1. This semantic network is converted into a set of facts, using standard templates. These templates are designed to be application domain independent so that they can be processed by a number of applications and promote interoperability. The internal representation of the ontology components stored in the domain ontology repository is given below.
Transaction
Seller
Mu-Ex
Mu-Ex
Mu-In
Rel
e Pr
Re l
Mu-In
Account p Tem
Customer
Shipper
Bidder
Te mp
Is _a
Is _a
Sell mp Te
e Pr
Buyer
Is_a
Pre
Is_a
MuIn
Buy
Item
Rel
n Sy
Pre
Bid
Sy n
Vendor Category
Product
Legend Basic Relationships Is_a
Is-a
Syn Rel
Synonym Related-to
Domain Relationships Pre Mu-In Mu-Ex Temp
Pre-requisite Mutually-Inclusive Mutually-Exclusive Temporal
Fig. 1. Partial Auction Domain Ontology
The term component is expressed as a fact using the following template: (term: id name), where term: is the keyword indicating the type of fact, namely, the fact representing terms. id is an integer that uniquely identifies the term, and name is a string of characters that represents the term itself. For example, the concept of “Bid” in the auction domain is expressed as (term: 1 Bid). The relationship component is represented using the following format: ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
(relationship: rel-id tm1: term1 tid1: id1 rtype: relationship-type tm2: term2 tid2: id2) In the above fact, the keyword relationship: indicates that it is a relationship fact. It also contains other keywords, namely, tm1:, tid1:, rtype:, tm2:, tid2:, which help separate the specific values of the ids and the terms within the fact. In addition, rel-id is an integer that uniquely identifies the relationship fact, term1 is a string of characters that corresponds to the first term, id1 is the identifier of the first term, relationship-type represents the type of the relationship, namely, is-a, synonym, related_to, prerequisite, mutually-inclusive, mutually-exclusive, and temporal. The second term in the relationship is represented by term2 with id2 as its identifier. An example of a relationship fact is: (relationship: 5 tm1: Bid tid1: 1 rtype: prerequisite tm2: Item tid2: 3) The relationship fact (id 5) indicates that the term “Bid” (id 1) has a prerequisite relationship with the term “Item” (id 3). That is, the Item must exist in order to place a bid on it. In terms of the entity-relationship model, the ontology currently captures entities and relationships. Part of the difficulty with capturing attributes is that existing ontologies do not distinguish them. Therefore, rules need to be written to do so. Consider, for example, the computer science department ontology from the DAML library. It contains the classes (which roughly map to entities) of student, subject, faculty, workshop paper, journal, student, etc. The ontology is very good at identifying subclasses (e.g., full-time or parttime professors, journal or conference papers). However, it does not contain attributes such as name, address, date-of-birth for professor, or date-of-submission for research papers). 4. ONTOLOGY-BASED APPROACH FOR DATABASE DESIGN The two primary ways in which an ontology could be used to improve a design are: 1) conceptual model generation and 2) conceptual model validation. The role of an ontology is to provide a comprehensive set of terms, definitions, and relationships for the domain for which a conceptual model is to be created. Therefore, the first task is to generate a design “from scratch,” using the terms and relationships in the ontology as representative of the domain. The second task is to use the ontology to check for missing terms or inconsistencies in an existing, or partial, design. 4.1 Conceptual Model Generation from Scratch The steps involved in creating an E-R model from scratch using an ontology are shown in Figure 2. The steps are: a) identify initial user terms, b) map the initial terms to domain terms in the ontology and expand the model, c) check for consistency, d) generate the ER model, and e) map the E-R model to a relational model. These steps are accomplished based on a set of rules which are stored in a knowledge base and summarized in Figure 3. Step a - Identification of Initial Terms: The user provides his or her application requirements or a textual description of the problem domain and desired functionalities. Consider the user requirement: “I want to design a database to support an auction website. It needs to keep track of products, their seller and buyers and prices at which they are sold.” This requirements fragment can be parsed using simple natural language processing techniques (parts of speech tagged parsing [Mason and Grove-Stephensen 2002]) and key terms. ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
NL Processing for Initial Terms (Step a)
M ap to Domain Terms and Expand (Step b)
Initial Set of Terms
D omain Terms
Application Requirements Ontology Information Ontology Information D omain O ntologies
Consistency C hecking R ules
Rules to Execute
Conflict Information Conflict Resolution Information
Database D esig ner
Check for Consistency (Step c)
D esigner Feedback
Relational M odels
E-R M odels
Relational M odel
M ap To Relational M odel (Step e)
New E-R Model
Info About Prior E-R M odels
New Relational M odel E-R Model
Generate E-R Model (Step d)
C onsistent Terms
Fig. 2. Steps for Creating E-R Model from Scratch
For the purposes of this research, simple natural language processing techniques are used. The parts of speech tagger identifies the nouns and noun phrases (potential entities) as well as verbs and verb phrases (potential relationships). Here, the potential entities are Product, Seller, and Buyer. The ontology further suggests transitive inferences. For example, an item is related to a category. Since a seller is related to an item, it is possible that a seller is related to a category, and corresponding relationships are needed. A designer might need to refine a fragment of an existing entity-relationship model. An ontology could check that the names of the entities are the “best” ones for the application domain. The inclusion of synonyms helps to identify the most appropriate label for an entity or relationship, assuming that the terms used in the ontology are the most appropriate. For example, “Customer has Account” could be refined to “Customer owns Account.” The ontology can also suggest possible missing entities and relationships. The simplest way to do so is to trace through the possible inferences that can be made from one term to another. For example, the ontology can expand the E-R diagram by identifying Bidder and Seller as subclasses of Customer. Relationships between Bidder and Seller and Product are then added to the E-R diagram. The user selects which terms to include in the model or can provide appropriate terms directly. Step b - Map to domain terms and expand list of terms: After the initial set of terms is identified, the terms are compared to those in the ontology for that domain to expand the terms. There are two ways this is done. First, the “synonym” and “is_a” relationships from the ontology are used to verify the appropriate use of the terms in the design (e.g., the user requirement mentions the term Product.) However, in the auction domain, the term Item is used to refer to things being auctioned off. Hence, the appropriate term to use is Item. From the auction domain ontology, Item is a synonym of Product and, hence, the term Item will be used. Similarly, suppose that, in the domain ontology, there is a ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
relationship Electronic-item is-an item. Then, this might suggest that Electronic-item is needed as an additional entity. Since Item is an entity in the original model, then an inference is made that this is an entity. The initial set of terms is expanded using the “related-to” relationship to identify additional terms. For example, if Customer buys Item and Item related-to Bid, then a relationship might be a needed between Customer and Bid (Customer places Bid). Rule #1 (Figure 3) is applied. Several rules are used to check whether the term appears in the ontology. These include: 1) whether the specified term is a synonym of a typically used term (Rule #2); and 2) whether the term is a subclass of another term (Rule #3). There are also rules to transitively compute other terms that are related to the original term by following the “related_to” relationships. At the end of this step, an expanded list of terms is presented to the user for feedback. Step c - Consistency checking: After gathering the set of potential terms, additional domain knowledge can be used to check whether the terms and concepts selected are complete and consistent. The four types of domain relationships, pre-requisite, temporal, mutually-inclusive, and mutually-exclusive, ensure that all the relevant terms and concepts have been included or excluded. For each term, the domain relationships defined in the ontology are first identified. Then, for each of those relationships, the associated terms are gathered and the corresponding rules enforced to ensure that the terms are consistent. For example, if a particular term has a pre-requisite relationship with another term, then the rule would check whether the related term is also included. If not, it will be selected automatically. Similarly, if two terms are mutually inclusive and if only one of them is included in the model, then the consistency checking rule would add the other term to the model and inform the user. In the auction domain Item and Seller are mutually inclusive. Rule #4 (Figure 3) checks for mutually inclusive relationships. Step d - Generate E-R model: Entities are created based upon the terms identified in the previous step. Relationships between entities are generated by taking into account the application domain semantics inherent in the ontology. The verb phrases gathered from natural language processing of the problem description are used as the starting point for identifying relationships between entities. Previously generated E-R models and typical data model fragments are stored in the E-R model repository. This repository is used to identify commonly occurring relationships within an application domain, and to generate the relationships between entities. For example, a typical entity-relationship model in the auction domain would have entities such as Buyer, Seller, Item, Bid, and Transaction and relationships that correspond to “Seller sells Item”, “Buyer places Bids”, “Item receives Bids” and, possibly, others. Binary relationships in a given application domain are stored as facts within the repository using a predefined template. Rules pattern match against these facts to identify potential relationships for the terms/entities identified in the previous steps. Rule #5 (Figure 3) identifies one or more relationships for two given entities that are to be included in the model. For example, if the list of entities under consideration contains Buyer, Item, and Bid, they can be matched against the auction E-R model fragments, and appropriate relationships identified. At the end of this step, an overall E-R model is generated and presented to the user for feedback. The user can add or delete any relationship. Step e - Map to Relational Model: Once the E-R model is created, the corresponding relational data model is generated using well-established rules of mapping from the entity-relationship to relational model [Teorey et al. 1986]. The E-R model is stored in the E-R model repository. ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
Fig. 3. Sample Rules (Implemented as JESS Rules in the System)
Rule #1 – Finding related Terms If:
Term X is selected and Term X has a “related_to” relationship with Term Y and Term Y is currently not selected
Then:
display a message with this information to the user and get user feedback and select Term Y if desired
Rule #2 –Rule for Checking Synonyms If:
Term X specified by user There exists a Term Y in the ontology Term X is a synonym for Term Y
Then:
Term Y is more appropriate Display a message with this information to the user Get user feedback and Replace Term X with Term Y
Rule #3 –Rule for Checking Subclass If:
Term X specified by user There exists a Term Y in the ontology Term X has “is_a” relationship with Term Y
Then:
Term X is a subclass of Term Y Display a message with this information to the user Get user feedback and
Rule #4 – Rule for Mutually-inclusive Relationship If:
Term X is to be selected and Term X has a “mutually-inclusive” relationship with Term Y and Term Y is currently not selected
Then:
display a message with this information to the user and get user feedback and select both Term X and Term Y
Rule #5 – Rule for Identifying Relationships for an Entity Entity X is included in the model and Entity Y is included in the model and There exists a relationship R1 between Entity X and Entity Y in the repository If:
Then:
R1 is a candidate relationship display a message with this information to the user ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
Rule #6 –Rule for Prerequisite Relationship If:
Term X is to be selected and Term X has a “prerequisite” relationship with Term Y and Term Y is currently not selected
Then:
Term Y should be automatically selected Display a message with this information to the user and Get user feedback and Select Term X as well as Term Y if desired
Rule #7 –Rule for Identifying New Relationships If:
Entity X is included in the model Entity Y is also included in the model No relationship between Entity X and Entity Y specified in the model There exists a relationship between Entity X and Entity Y in the repository Relationship R is the relationship phrase between those entities
Then:
Display a message with this information to the user and Get user feedback and Suggest the new relationship Check for other relationships
Rule #8 –Rule for Checking Mutually-Exclusive Relationship If:
Term X has been selected and There exists a “mutually-exclusive” relationship between Term X and Term
Y and Term Y is also currently being selected Then:
Term X and Term Y should not be selected together Display a message with this information to the user and Get user feedback and Select either Term X or Term Y
Rule #9 –Rule for Checking Dependent Terms If:
Term X to be deleted and Currently Term X has been selected and There exists a “pre-requisite” relationship between Term X and Term Y Or a “mutually-inclusive” relationship between Term X and Term Y and Term Y is also currently selected
Then:
Term X cannot be deleted because Term Y requires it Display a message with this information to the user and Get user feedback and Abort deletion of Term X or direct the user to delete Term Y first to proceed.
ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
C heck Appropriateness of Terms (Step a)
E-R Model Information
D omain Terms
Ontology Information
Consistency C hecking R ules
D omain O ntologies Conflict Information Conflict Resolution Information
Database D esigner
Rules to Execute
Identify M issing Terms (Step b)
E-R M odels
Designer Feedback
Generate Augmented ER Model (Step c)
E-R M odel
Info About Prior E-R M odels C onsistent Terms
Fig. 4. Validating Existing E-R Model
4.2 Validation of Existing E-R Model An existing E-R model can be validated by checking whether it includes some of the entities that are typically part of similar designs in that application domain. The model can also be checked to assess whether the entities and relationships are internally consistent. The E-R validation is shown in Figure 4 and consists of the steps: a) check appropriateness of terms, b) identify missing terms, and c) generate augmented E-R model. Step a - Appropriateness of Terms: The names of the entities are checked for appropriateness using the domain knowledge contained in the ontology. The terms in the ontology are the commonly used terms in a domain. The ontology also contains synonyms. If the model uses a synonym, the original term is suggested because one of the purposes of an ontology is to contain the most commonly used terms in a given domain. For example, the designer might have used Consumer buys Item with the ontology identifying synonyms (Consumer, Customer). Then, the designer would need to decide which label to use. The default is the one in the ontology, e.g., Customer. Of course, this could be modified to refine user verification. Step b - Identifying Missing Terms: If entities appear in is_a or related_to relationships, they are used to gather related terms. The domain relationships also ensure that they are internally consistent. For example, Rule #6 (Figure 3) checks if the prerequisites for a given term are also included in the model. If not, they are added with the user’s approval. The terms in the ontology may correspond to either an entity or an attribute in an entity-relationship model. Currently, it is not possible to easily distinguish which term should map to an entity and which to an attribute. Since ontologies usually contain classes (represented as nouns), the implementation assumes that a missing term (one in ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
the ontology, but not in the entity-relationship model) is a potential missing entity. For example, the computer science ontology contains the term “Lecturer” which could suggest a missing entity. If there is a relationship in the ontology that does not appear in the E-R model, it could be a missing construct. For example, if “Publication” and “Professor” appear together in a computer science ontology, this suggests a relationship. Step c - Augmenting the E-R Model: For terms identified in the previous step, appropriate relationships are suggested based on the relationships gleaned from the E-R model fragments stored in the E-R model repository. The repository will build up over time. For the entities included in the model, pattern matching is done against the entities in the repository and candidate relationships are identified and suggested as potential new relationships to be considered. Rule #7 (Figure 3) shows a rule for identifying additional relationships. With designer’s feedback, an augmented E-R model is generated that meets the requirements specified by the designer. 4.3 Consistency Checking The domain relationships allow one to express the application domain semantics and business rules. For example, in an online auction domain, a Bid needs (pre-requisite) Item and Bidder. The temporal constraint for Bid shows that an Account must be created before a bid can be placed. The mutually exclusive constraint for the term Seller indicates that the seller of an item cannot also be its buyer. Thus, while adding a term into or deleting a term from the model, the domain relationships ensure that the selection or deletion action results in a consistent set of terms. A set of heuristics perform the consistency checking, two of which are described below. 4.3.1 Heuristic for Term Selection For a given term, this heuristic identifies other terms that must be included. For example, if there are pre-requisite relationships for a term, those terms must also be included. If the term has a mutually inclusive relationship, the mutually inclusive term must be included. For example, in the auction domain, the mutually inclusive relationship for Bid implies that, whenever Bid is included in a design, the associated terms Item, Buyer, and Seller should also be included. On the other hand, if the term has a mutually exclusive relationship, those mutually exclusive terms should not be part of the design. Enforcing this relationship involves determining if two or more terms in a model are mutually exclusive. If so, this conflict must be resolved with user feedback. The term selection heuristic begins by gathering the domain relationships that exist in the ontology for the term selected. For each relationship type, it creates a set to store the associated terms, that is, different sets are created for pre-requisite, mutually-inclusive, and temporal relationships. For each term added to those sets, their domain relationships and, in turn, corresponding terms are gathered. Thus, for the initial term, associated terms are gathered recursively and included in the model with user feedback. For each term added, its mutually-exclusive terms are also gathered and checked to determine whether they are already included. If they are present, the current term cannot be added and user feedback is sought to resolve conflicts. Rule #8 (Figure 3) shows an example of how to check for mutually-exclusive relationships. 4.3.2 Heuristic for Term Deletion This heuristic ensures consistency when deleting a term from the solution. When a term is deleted, there cannot be any other terms that depend upon it. Therefore, if the term to be deleted is a pre-requisite or a mutually-inclusive term for another term that is ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
currently included in the model, that term should not be deleted. On the other hand, when a term is deleted, its pre-requisite and mutually inclusive terms can also be deleted, provided no other terms require them. The term deletion heuristic also starts by gathering the pre-requisite, mutuallyinclusive and temporal relationships for the term to be deleted. The terms in these relationships are checked against the terms that are currently selected. If at least one term is selected, then the deletion of the initial term would violate the relationship. Hence, the term cannot be deleted and the user is informed of the violation. Rule #9 (Figure 3) checks for dependent terms still selected when deleting a term. If there is another term that depends upon the term to be deleted, then this action is aborted. 5. OMDDE ARCHITECTURE AND IMPLEMENTATION The architecture of Ontology Management and Database Design Environment (OMDDE) is shown in Figure 5. It has an HTML client, a web server that supports Java technologies, and a relational database. The client is a basic web browser for browsing or modifying existing ontologies, or creating new ones. The server side consists of: 1) an Ontology Management Component, 2) a Database Design Component, and 3) Repositories. A prototype of OMDDE has been developed using Java technologies, Servlet/JSP (Java Server Pages) and JESS [Friedman-Hill 2006]. The Model-View-Controller separates data access functionality from the presentation and control logic. For example, the ontology creation and management, E-R model generation, and E-R model validation modules are implemented as servlets that use JESS rules and facts stored in the knowledge base. The visualization module is implemented using JSP technology. The ER model generation and validation activities use java servlets for: gathering user input; identifying key terms from the user’s description of the requirements; accessing and retrieving domain ontology information; identifying missing elements from the model; and passing the results to the JSP pages. The consistency checking and E-R model generation tasks are accomplished through rules written in JESS that use domain specific knowledge in the ontology and the E-R model repositories. The components of OMDDE are briefly described below. Client Side
Server Side Ontology Management Component
Ontology Creation & Management Activity
Browse Ontology
Ontology Creation & Management
E-R Model Generation
E-R Model Validation
Domain Ontology Repositor y
E-R Models Repository E-R Modeling Activity
E-R Model Visualization
Database Design Component
Parts of Speech Tagger
Fig. 5. Ontology Management and Database Design Environment Architecture ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
5.1 Ontology Management Component: The ontology management component manages the domain ontologies. It consists of: a) a browse ontology module, and b) an ontology creation and management module. This component facilitates the creation, storage, and use of ontologies. The ontologies are stored in a simple format, expressed as unordered facts in JESS as shown in Figure 6. It is very difficult to automatically, or semi-automatically, use someone else’s ontology. However, our prior research attempted to assess the quality of ontologies, which will also impact how effectively they can be used [Burton-Jones et al. 2005]. (domain: $?domain-name) (ontology-id: ?ont-id) (ontology-name: $?ont-name) (date-created: ?date version: ?ver-number) (concept-term: id name) (relationship: rel-id tm1: term1 tid1: id1 rtype: rel-type tm2: term2 tid2: id2) Fig. 6. Example Ontology Facts
Browse Module: The browse module provides mechanisms for navigating the ontologies stored in the repository. The user can search the repository through key words or select specific ontologies to view by name or id. Ontology Creation and Management Module: This module assists the user in creating or modifying an ontology. It facilitates the basic steps for creating ontologies and managing and evolving existing ontologies. 5.2 Database Design Component E-R Model Generation Module: This module implements the steps described in Section 4 for creating an entity-relationship model from the users’ requirements. The module interfaces with a natural language parser to obtain the problem description tagged with parts of speech. The nouns and noun phrases are the starting point for identifying terms and checking them against the ontology. This module implements various rules for identifying entities and relationships.An example rule in JESS is shown in Figure 7. This rule is used to find related terms. For example, if a particular term ($?X)1 is initially specified by the user and if it is related to another term ($?Y), then it may be of interest and should be included in the list of potential terms for user feedback. (defrule find-related-terms "" (concept-term: ?t-id1 $?X) (selected ?tid1 $?X) (relationship: ?rid tm1: $?X tid1: ?t-id1 rtype: related_to tm2: $?Y tid2: ?t-id2) ?rem1 (bind ?str1 (str-cat "Since the term\"" (implode$ $?Y) "\" is related to " "\"" (implode$ $?X) "\", ")) (bind ?str2 (str-cat "it may be of interest to the user.")) (bind ?str3 "Click on 'OK' to include it to the list, or 'Cancel' to abort.") (bind ?str (str-cat ?str1 ?str2 ?str3)) (bind ?*d* (new YNDialog ?*current-pnl* "Related Term" ?str "OK" Cancel "))) Fig. 7. JESS Rule for finding Related Terms 1
This corresponds to a multi-field variable in JESS ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
After the initial terms are identified, they are refined using the relationships embedded within the ontology and the entity-relationship models from the repository. The system suggests relationships for the E-R model, generates the overall entity-relationship model, and passes it to the visualization module for display. These rules rely on both the ontologies, and the E-R models stored in the repositories and include capabilities for incrementally creating the E-R model fragments, obtaining user feedback, and building the overall model. E-R Model Validation Module: This module checks the consistency of an E-R model, or portion of one, created by the user. It identifies terms that are in the user’s model, but not in the repositories so they can be considered for updating the ontology. Browse/Visualization Module: The user can view the E-R diagram that has been created up to a given point. 5.3 Repositories There are two repositories: 1) the domain ontology repository, and 2) the E-R model repository. Domain Ontology Repository: The domain ontology repository stores the ontologies by application domains. The user can browse the ontologies to learn more about a domain. The ontologies can be created by different domain analysts and evolve over time. The contents of the ontology are represented as facts and standard templates with the format for some of the facts shown in Figure 6. E-R Model Repository: This repository contains E-R models that have been created over time. By examining the E-R models in a domain and their corresponding requirements, generic relationships that exist between entities can be identified. The E-R models are stored as fragments within the repository. Each E-R model is broken down into fragments, with each fragment corresponding to a relationship that exists within the model. For each fragment, its identifier, the E-R model of which it is a part, the entities involved, the name of the relationship, and its purpose, are stored. This information is stored as JESS facts the format of which is shown in Figure 8. Rules are followed that search for potential relationships for entities based on user requirements. Relevant fragments from the E-R model are identified by pattern matching on entity names through the left hand side conditions and corresponding relationship names are suggested to the user by the right hand side of the rule. (model-entity: $?ename1) (model-relationship: $?ename1 rel-name: $?rname other-entity: $?ename2)) (model-fragment: ?frag-id) (er-model-consists-of: ?frag-id1 ?frag-id2 ?frag-id3 …) (purpose: $?functional-requirement supported-by: ?frag-id) (pattern: ?frag-id EName1: $?ename1 rel-phrase: $?verb-ph EName2: $?ename2) Fig. 8. Example E-R Model Facts
5.4 Sample Session This section illustrates the creation of an ontology and its use for generating and validating an E-R model. The user (database designer) selects either ontology ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
management or conceptual modeling functions as shown in Figure 9. The system supports the ontology management activities: a) creation of a new ontology, b) modification of an existing ontology, and c) browsing ontologies of a particular domain. The conceptual modeling functions supported by OMDDE enable the user to: 1) create an initial version of the model using the knowledge contained in the ontology as a starting point, or 2) validate a conceptual model the user has created against the ontology to check for the appropriate use of terms and for missing constructs.
Fig. 9. OMDDE Initial Screen and Defining Relationships between Terms
5.4.1 Ontology Creation and Management Activities The “Ontology Management Assistant” and “Database Design Assistant” subsystems implement the ontology management functions; and the database design functions, respectively. When the user wants to create a new ontology, the system prompts the user for the domain and ontology names. The system allows the user to define terms, basic relationships, and domain relationships. The system presents the terms that have been added to the ontology so the user can select two terms to establish a relationship between them. The basic relationships currently supported are: is_a, synonym, and related_to; the domain relationships supported are: pre-requisite, mutually inclusive, mutually exclusive, and temporal. When a relationship is created, consistency-checking rules are enforced to ensure that relationships and constraints are consistent. In “Edit Ontology” (Figure 9), the system presents the names of ontologies that exist within each application domain and prompts the user to select one to be modified. Any aspect of the ontology can be updated and the system ensures that the changes are consistent. For example, the user may update an ontology by adding or modifying terms and relationships. The system ensures that the changes are consistent by invoking a set of rules that check whether: a term or relationship is already in the ontology (if it is identified as a new term or relationship), the basic and domain relationships defined for ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
terms are consistent, and the descriptions of terms and relationships are complete. Similarly, when a term is being removed from the ontology, it checks which relationships the term appears in and deletes those relationships as well. 5.4.2 Conceptual Modeling Activities The Database Design Assistant subsystem controls the user interactions during conceptual modeling activities. The user can create an entity-relationship model or augment and validate an existing model through the interface shown in Figure 9. The database designer can explicitly check whether the model uses appropriate terms and relationships or is missing any major constructs. This is accomplished by: 1) mapping the entities and relationships of the current E-R model to the domain ontology; and 2) enforcing the constraints and relationships through consistency checking rules. The system checks for missing and superfluous entities and relationships and presents a report to the user. The designer can start the E-R model creation process through the “Create Conceptual Model” link shown in Figure 9. The system then provides another screen where the user can input natural language text describing the problem. Through natural language processing and consulting the ontology, the initial entities and relationships are identified. The user may also choose to enter initial entities and relationships based upon his or her requirements. These are augmented by applying the heuristics and rules described in the previous sections so a richer model can be developed. The designer can validate a partially completed E-R model through “Validate Model” in Figure 9. The designer can input information about the entities and relationships using the interface shown in Figure 10. At the end of the process, the system generates a report that suggests new entities, new relationships, name changes for entities and relationships, and so forth. In Figure 10, the validation report suggests adding “Bidder” and “Seller” entities, and three other relationships. The designer can request that the updated E-R model be presented. If the system identifies terms that are not part of the ontology, this information is also reported so the ontology can be updated.
Fig. 10. Gathering E-R Model Information and Validation Feedback ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
6. ASSESSMENT Our goal is to provide a designer with an ontology that contains knowledge about an application domain that he or she is interested in. The resulting conceptual model, when supported with information from a domain ontology, would be of higher quality than when such domain knowledge is not available. To assess the usefulness of the system, three controlled experiments were conducted. The first experiment aimed at assessing the value of the system when used by novice designers. The second experiment focused on determining whether the system would also help experienced designers, by comparing the performance of OMDDE versus a CASE tool. In order to determine whether an ontology would be more useful compared to another source of knowledge for E-R modeling, a third experiment was conducted in which one group used OMDDE and the other group used Wikipedia to obtain relevant domain knowledge. Each experiment and the results are described below. 6.1 Validation with Novice Designers The objective of this experiment is to test whether designers who have access to the ontologies in the Ontology Management and Database Design System can generate better quality entity-relationship models than those who do not have access to it. The repository of E-R models is intended to build up over time. Thus, the testing was carried out without an empty E-R model repository. The quality of the model is not a single measure, but a composite measure of different facets [Batra et al. 1990; Lindland et al. 1994; Moody and Shanks 1998; Bock and Yager 2002; Turk et al. 2003]. Bock and Yager [2002] for example, identify the following facets: entity, identifier attribute, various types of relationships (unary 1:1, binary 1:M, binary M:N, ternary 1:M, and ternary M:N), and generalization. We focus on the entity and the relationship facet with the quality of the E-R model generated by the subjects measured in terms of the scores for these two facets. We contend that, using the Ontology Management and Database Design System, the designer can gain an understanding of the domain that will help in selecting appropriate entities and relationships. Using the system during E-R modeling should help the user to identify and incorporate relevant (or correct) entity-relationship constructs. Although the quality of an E-R model depends on various facets of the model, researchers have argued that it is incorrect to compute an overall quality score by adding the individual facet scores [Batra et al. 1990; Bock and Yager 2002]. This is because such a total score would lack construct validity. Hence, the analysis of model quality is conducted at the individual facet level. The following hypotheses were tested: H1:
Designers using OMDDE will produce E-R models with higher scores compared to the models generated by designers without the use of the system H1a: Designers using OMDDE will perform better in identifying appropriate “Entities” for the E-R model than the designers without access to OMDDE H1b: Designers using OMDDE will perform better in identifying appropriate “Relationships” for the E-R model than the designers without access to OMDDE 6.1.1 Output Measurement The correctness of the E-R model is evaluated by a scoring scheme, which specifies how to score a solution on each facet (entities and relationships). A “correct” solution was developed for each case by the researchers to provide a benchmark for evaluating the ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
subjects’ solutions. Each solution was evaluated by two independent evaluators. Our scoring scheme is a simplified, commonly-used scoring mechanism for judging the quality of E-R models [Batra et al. 1990; Bock and Yager 2002; Turk et al. 2003]. It focuses on the correct identification of appropriate entities and relationships based on a given problem description. For every correct entity or relationship included in the model, one point is awarded. If the subjects fail to identify an entity or a relationship that should be part of the model, one point is subtracted from the score. On the other hand, if the E-R model contains superfluous entities and relationships, a penalty of -0.5 points is assessed. If an incorrect name is used for an entity or a relationship, -0.25 point is assessed. Table 1 summarizes the scoring scheme. Table 1. Scoring Scheme
Error Classification and Points
Facet
Major Error (-1.0 Points) Entity
Missing
Relationship
Missing
Medium Error (-0.5 Points) Superfluous Entity Superfluous Relationship
Minor Error (-0.25 Points) Incorrect Name Incorrect Name
Maximum Possible Points for both applications 10 7
6.1.2 The Experiment Students in an introduction to database management system class were solicited to participate in the study. The subjects had no prior knowledge of the study and its objectives. Two E-R modeling cases of comparable complexity were developed by the researchers for the travel and online auction domains. The researchers had developed ontologies for these domains for a prior research project. The travel domain ontology was developed by integrating several existing ontologies: a) DAML (http://www.daml.org/), b) Workshop on Evaluation of Ontology-based Tools [Corcho et al. 2003b], c) SchemaWeb (http://taga.umbc.edu/ontologies/travel.owl), and d) Protégé website (http://protege.stanford.edu/plugins/owl/owl-library/travel.owl). Similarly, the auction ontology was developed from: the auction ontology available at the SchemaWeb website (http://taga.umbc.edu/ontologies/auction.owl), and by visiting various online auction websites. The information in the ontologies is used as a basis for improving E-R models. Neither of the evaluators of the subjects’ responses was familiar with the content of the ontologies. The subjects were given a two hour training session on the system. The first half of the training focused on the concept of an ontology, the major components of an ontology and how ontologies could be used in conceptual modeling. The subjects also used the system on a test case. This allowed the subjects to become familiar with the system. It was also intended to reinforce the concepts covered in the first half of the training. The subjects were randomly assigned to two groups: group A and group B. They filled out a questionnaire, which asked for their experience in Management Information Systems (MIS) related courses and/or experience. Each group analyzed the travel and the auction cases and developed the corresponding E-R models. One case was analyzed using OMDDE and the other case was analyzed without the benefit of the system. Participants were trained on the format of the solution to be generated. The E-R model was represented by listing the names of the entities, and the binary relationships that exist using the triple . The specific assignment of the groups and the cases is shown in Table 2. ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
Table 2. Experimental Design
Travel
Online Auction
Group A
w/system
w/o system
Group B
w/o system
w/system
The solutions generated by the participants were transcribed into a common format so that the evaluators could not determine whether the solution was generated with the benefit of the system. The solutions were numbered for reference, randomized and sent to the two evaluators who were not involved in the experiment. They were provided the answer key for each case and the scoring scheme. 6.1.3 Data Analysis and Results from the Experiment The subject pool consisted of 60 students (MIS major) enrolled in a database course. They had been taught the fundamentals of data modeling, but not considered experts in E-R modeling. Two separate evaluators graded the data models based on the answer key provided to them. The inter-rater reliability was high for both entity scores (0.95) and relationship scores (0.90), well above the recommended 85 percent [Kassarjian 1977]. Multivariate Analysis of Variance (MANOVA) was conducted to test the difference in performance between the various groups with the results presented in Table 3. For the multivariate analysis, the entity and relationship scores were specified as the dependent variables, the fixed factor was the OMDDE system (with or without) and Case (Travel and Online Auction) was specified as the covariate. Order effect was not significant. The role of the Case used by the subjects was tested to see if it had any significant influence on the overall model quality. As seen in Table 3a, only the presence of the system was shown to have a significant impact on the E-R model scores; the Case did not have any impact. Also, Table 3b shows that the mean scores for entities and relationships were both higher when the students used the system. As the overall MANOVA is significant at 0.000 (Table 3a), this supports H1 that designers using the system would produce E-R models with higher scores compared to the models generated without the use of the system. Table 3. MANOVA Results Table 3a. Overall MANOVA of E-R Model Scores
Source
F (Pillai’s Trace)
CASE OMDDE (System)
Significance
Partial Eta Squared
0.877
0.421
0.030
129.354
0.000
0.822
Table 3b. MANOVA of Scores for Individual Facets of the E-R Model
Source
OMDDE (System)
Dependent Variable
Means df
F
Significance
Partial Eta Squared
With OMDDE
Without OMDDE
Entity Score
8.900
6.508
1
240.907
0.000
0.809
Relationship Score
6.225
5.100
1
81.611
0.000
0.589
ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
The results shown in Table 3b also indicate that there are significant differences by treatment on the measures of the individual facet scores. The group that used the OMDDE system outperformed the group that did not use the system in terms of both the entity score and the relationship score. The mean entity score for the group that had access to the system is 8.900 whereas the score for the group without access to the system is 6.508. As seen in Table 3b, this difference is significant (at 0.000), thus substantiating the hypothesis (H1a) that the designers using OMDDE will perform better on identifying appropriate “Entities” for the E-R model than the designers without access to OMDDE. Similarly, the mean relationship score for the group using the system is 6.225, and 5.100 for the group without the system. This difference is also significant at 0.000 (Table 3b) and supports the hypothesis (H1b) that the designers using OMDDE will perform better in identifying appropriate “Relationships” for the E-R model than the designers without access to OMDDE. 6.2 Validation with Experienced Designers The second experiment was conducted to investigate whether designers using the OMDDE system would create better quality E-R models than the designers using a CASE tool. Typically, CASE tools provide a graphical environment and support a number of modeling methodologies. Since UML is widely used for modeling, we selected the Poseidon for UML CASE tool (http://gentleware.com) because of its intuitive interface and ease of use. It facilitates interoperability and its HTML documentation generator allows exporting the models to an HTML format for sharing. It also supports code generation in Java, C++, XML, HTML, C#, VB.Net and PHP. The following hypotheses were tested. H2:
Designers using OMDDE will produce E-R models with higher scores compared to the models generated by designers using a CASE tool H2a: Designers using OMDDE will perform better in identifying appropriate “Entities” for the E-R model than the designers using a CASE tool H2b: Designers using OMDDE will perform better in identifying appropriate “Relationships” for the E-R model than the designers using a CASE tool To test these hypotheses, experienced IT professionals were used as subjects. Six systems analysts from industry with over ten years of experience in data modeling and application development, as well as four MIS faculty who have taught introduction to database management participated in the experiment. The same experimental procedure was followed as the initial validation. The objective of this experiment was also to test the hypotheses listed above. The subjects had considerable background in conceptual modeling using UML and database design, but were not familiar with ontologies or the role ontologies can play in conceptual modeling. Hence, the subjects were provided with some training on ontologies and how they are used in database design. They were familiarized with the OMDDE system by practicing with an example. The subjects also used the Poseidon CASE tool to generate E-R models when not using OMDDE. The subjects were trained on Poseidon for creating conceptual models using class diagrams. The subjects were again randomly assigned to two groups. Each group analyzed the same travel and auction cases and developed E-R models. The first group started with the OMMDE system for the travel case and then used the CASE tool for the auction case. The second group, started with the CASE tool for the travel case and then OMMDE for the auction case. The subjects provided written comments about the ease of use and usefulness of the OMDDE system. The solutions generated by the subjects were transcribed into a common format and randomized before being evaluated. The answer ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
key and the scoring scheme from the previous experiment were used for evaluating the models. Two independent evaluators graded the data models. The inter-rater reliability was high for both the entity and relationships scores (.94 and .90 respectively). The results from the data analysis are given in Table 4. Table 4. MANOVA Results Table 4a. Overall MANOVA of E-R Model Scores
Source
F (Pillai’s Trace)
Significance
Partial Eta Squared
CASE
0.494
0.619
0.058
OMDDE (System)
55.468
0.000
0.874
Table 4b. MANOVA of Scores for Individual Facets of the E-R Model
Means Source
OMDDE (System)
Dependent Variable
df
F
Signifi cance
Partial Eta Squared
With OMDDE
With CASE Tool
Entity Score
9.050
7.100
1
37.528
0.000
0.688
Relationship Score
6.625
5.275
1
83.736
0.000
0.831
The Cases used and the Order in which the subjects analyzed the cases did not have any effect on the scores. As shown in Table 4a, only the OMDDE system had a significant impact on the E-R model scores. As the overall MANOVA is significant at 0.000 (Table 4a), this supports the hypothesis (H2) that designers using the OMDDE system would produce E-R models with higher scores compared to the models generated using a CASE tool. Table 4b shows that the group that used OMDDE system outperformed the group that used CASE tool in terms of both the entity score and the relationship score. The mean entity score for the group that used OMDDE is 9.050 whereas the score for the group that used the CASE tool is 7.100. This difference is significant (at 0.000), thus substantiating the hypothesis (H2a) that designers using OMDDE will perform better on identifying appropriate “Entities” for the E-R model than the designers using a CASE tool. Similarly, the mean relationship score for the group using the system is 6.625, and 5.275 for the group using the CASE tool. This difference is also significant at 0.000 (Table 4b) and supports the hypothesis (H2b) that the designers using OMDDE will perform better in identifying appropriate “Relationships” for the E-R model than the designers using a CASE tool. The subjects provided informal feedback on the system. Selected comments are shown in Figure 11. From this, it is evident that the system performed reasonably well in suggesting entities and relationships and validating the E-R model. The subjects remarked that the system is intuitive and easy to use and that ontologies could play an important role in conceptual modeling and quickly generate a consistent and complete ER model. The comments were positive, but the subjects also suggested improvements, such as adding additional navigational and explanation facilities.
ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
“Ontologies definitely help with appropriate use of domain terminologies. The embedded knowledge in the ontology can help identify related entities as well as relationships. The tool is helpful for users not familiar with the domain. The system could be improved by making it a little bit more flexible.” “The tool is intuitive and easy to use. It will be useful for creating E-R models quickly and also checking for consistency and completeness. The domain information can be used by a novice user in creating a new model or adding elements to an already existing model. The system suggests things that one may not think of normally.” “OMDDE is helpful in creating E-R models and validating them. It enables the user to look at different aspects of an application domain. It would be a useful modeling tool. A graphical interface for inputting the E-R diagram or importing the information from a file for validation would be quite attractive.” Fig. 11. Sample Comments from the Experiment
6.3 Comparing Knowledge Sources The third experiment was conducted to test whether designers who have access to ontologies can generate better quality entity-relationship models than those who consulted a different source of knowledge. For this experiment, the Wikipedia website (www.wikipedia.org) was used as the alternate knowledge source, which is a free encyclopedia built collaboratively using Wiki software. Similar to the previous two experiments, the following hypotheses were tested: H3:
Designers using OMDDE will produce E-R models with higher scores compared to the models generated by designers using Wikipedia. H3a: Designers using OMDDE will perform better in identifying appropriate “Entities” for the E-R model than the designers using Wikipedia. H3b: Designers using OMDDE will perform better in identifying appropriate “Relationships” for the E-R model than the designers using Wikipedia Forty undergraduate students from a database management systems course participated in the experiment voluntarily. They did not have any prior knowledge of the study and were given a two hour training session. Subjects were introduced to the concept of ontology and its use in E-R modeling and interacted with the OMDDE system on a test case. They were also explained how to navigate the Wikipedia website to search for information on various topics. The subjects also filled out a questionnaire regarding their MIS coursework and experience. The same experimental procedure was followed as the prior two experiments. The subjects were randomly assigned to two groups and the same travel and auction cases used to develop the E-R models. The first group used the OMDDE system for the travel case and Wikipedia for the auction case. The second group started with the Wikipedia for the travel case and then moved on to the auction case using the OMDDE system. At the end of the experiment, the subjects also provided feedback on their experience with the OMMDE system. The solutions generated by the subjects were transcribed into a common format and randomized before being evaluated by two independent evaluators. The answer key and the scoring scheme from the previous experiments were used to evaluate the models. The inter-rater reliability was high for both the entity and relationship scores (0.97 and 0.93). The results from the data analysis are given in Table 5. ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
Table 5. MANOVA Results Table 5a. Overall MANOVA of E-R Model Scores
Source
F (Pillai’s Trace)
CASE OMDDE (System)
Partial Eta Squared 0.013 0.880
Significance
0.109 58.693
0.897 0.000
Table 5b. MANOVA of Scores for Individual Facets of the E-R Model
Source
OMDDE (System)
Dependent Variable
Means df
F
Partial Signifi Eta cance Squared
With OMDDE
With Wikipedia
Entity Score
9.313
7.475
1
86.783
0.000
0.836
Relationship Score
6.788
5.687
1
46.225
0.000
0.731
As indicated in Table 5a, only the OMDDE system had a significant impact on the ER model scores. As the overall MANOVA is significant at 0.000 (Table 5a), this supports the hypothesis (H3) that designers using the OMDDE system would produce E-R models with higher scores compared to the designers using the Wikipedia. The cases used and the order in which the subjects analyzed the cases did not have any effect on the scores. Similar to the previous experiments, the group that used OMDDE system outperformed the group that used the Wikipedia in both the entity and the relationship scores. As seen in Table 5b, the mean entity score for the group that used OMDDE is 9.313 whereas the score for the group that used the Wikipedia is 7.475. This difference is significant (at 0.000), thus substantiating the hypothesis (H3a) that designers using OMDDE will perform better on identifying appropriate “Entities” for the E-R model than the designers using the Wikipedia. Similarly, the mean relationship score for the group using the OMDDE system is 6.788, and 5.687 for the group using Wikipedia. This difference is also significant at 0.000 (Table 5b) and supports the hypothesis (H3b) that the designers using OMDDE will perform better in identifying appropriate “Relationships” for the E-R model than the designers using Wikipedia. The subjects provided positive feedback on the ease of use and usefulness of the OMDDE system. They indicated that it was useful in identifying entities and relationships, but also suggested some modifications. With respect to Wikipedia, the subjects commented that it was time consuming to go through the information contained in the wikis and there was no support for identifying entities and relationships. Also, the quality of information contained in the wikis may not be consistent. 7. CONCLUSION This research has demonstrated how domain knowledge, stored in an ontology, can be used to assist in the generation of complete and consistent database designs, both when the designs are generated from scratch and when a partial design exists. A framework for these two tasks has been presented which uses ontology primitives such as relationships and constraints between terms to generate a consistent design. A prototype has been ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
implemented called the Ontology Management and Database Design Environment, which supports entity-relationship modeling activities. This research is an initial step to illustrate how domain ontologies, which capture knowledge about specific application domains, can be used for the creation and validation of entity-relationship models for conceptual modeling. The initial results demonstrate the feasibility and usefulness of the research. The prototype clearly helped both the naïve and experienced modelers in creating better quality E-R models compared to either no tool or to a CASE tool or to Wikipedia. The results suggest that domain ontologies can indeed be an asset in supporting conceptual modeling activities. Our future work will refine the prototype, assess its effectiveness for a variety of design tasks, and incorporate organization specific knowledge for data modeling into the methodology and prototype. It is believed that ontologies, eventually, will help to solve the “grand problem” of incorporating semantics into information processing [Embley 2004]. This research has shown one potential application of useful ontologies. ACKNOWLEDGMENTS The authors wish to thank the associate editor, Mary Fernandez, and the anonymous reviewers for their detailed and helpful comments on an earlier version of this paper. REFERENCES ALLEN, A. F. 1983. Maintaining Knowledge about Temporal Intervals. Communications of the ACM, Vol. 26, No. 11, pp. 832 – 843. ALLEN, G. AND MARCH, S. 2005. The Effects of State-Based and Event-Based Data Representations on User Performance in Query Formulation Tasks. MIS Quarterly, forthcoming. AL-MUHAMMED, M., EMBLEY, D.W., AND LIDDLE, S.W. 2005. Conceptual model based semantic web services. In Proceedings of the Twenty Fourth International Conference on Conceptual Modeling (ER’05), Klagenfurt, Austria, 26-28 October 2005, pp. 288-303. ALTER, S. 1999. Information Systems: A Management Perspective. Addison-Wesley, 1999. BATRA, D., HOFFER, J. A., AND BOSTROM, R. P. 1990. Comparing Representations Developed Using Relational and EER Models. Communications of the ACM, Vol. 33, No. 2, pp. 128 – 139. BERGHOLTZ, M., JOHANNESSON, P. 2001. Classifying the Semantics of Relationships in Conceptual Modeling by Categorization of Roles. In Proceedings of the 6th International Workshop on Applications of Natural Language to Information Systems (NLDB’01), Madrid, Spain, June 28-29, 2001, pp. 199 – 203. BOCK, D. B., AND YAGER, S. E. 2002. Improving Entity Relationship Modeling Accuracy with Novice Data Modelers. Journal of Computer Information Systems, Vol. XXXXII, No. 2, pp. 69 – 75. BODART, F., PATE, A., SIM, M., AND WEBER, R., 2002. Should Optional Properties be Used in Conceptual Modeling? A Theory and Three Empirical Tests. Information Systems Research, Vol.12, No.4, 2002, pp.384-405. BREWSTER, C., O'HARA, K. 2004. Knowledge representation with ontologies: the present and future. IEEE Intelligent Systems, Vol. 19, No. 1, pp. 72 - 81. BUNGE, M. 1997. Treatise on Basic Philosophy: Vol. 3: Ontology I: The Furniture of the World. D. Reidel Publishing Co., Inc., New York, NY., 1977. BURTON-JONES, A., STOREY, V., SUGUMARAN, V., AHLUWALIA, P. 2005. A Semiotic Metrics Suite for Assessing the Quality of Ontologies. Data and Knowledge Engineering, Vol. 55, No. 1, pp. 84 – 102. CORCHO, O., FERNANDEZ-LOPEZ, M., AND GOMEZ-PEREZ, A. 2003a. Methdologies, tools and languages for building ontologies. Where is their meeting point? Data and Knowledge Engineering, Vol. 46, No. 1, pp. 41 – 64. CORCHO, O., GOMEZ-PEREZ, A., GUERRERO-RODRIGUEZ, D.J., PEREZ-REY, D., RUIZ-CRISTINA, A., SASTRETORAL, T., AND SUAREZ-FIGUEROA, M.C. 2003b. Evaluation experiment of ontology tools' interoperability with the WebODE ontology engineering workbench. 2nd Int. Workshop on Evaluation of Ontology-based Tools, Oct 20, Sanibel Island, Florida, USA. DAHCHOUR. 2001. Integrating generic relationships into object models using metaclasses. PhD thesis, Department of Computing Science and Engineering, Université catholique de Louvain, Belgium, March 2001. DENNIS, A. AND WIXOM, B. H. 2000. Systems Analysis and Design. John Wiley, 2000. DEY, D., STOREY, V.C., AND BARRON, T.M. 1999. Improving Database Design through the Analysis of Relationships. ACM Transactions on Database Systems, Vol.24, No.4, pp.453-486. ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
DING, Y., AND FOO, S. 2002 Ontology Research and Development: Part 2 – A Review of Ontology Mapping and Evolving. Journal of Information Science, Vol.28, No.5, pp. 375-388. EMBLEY, D.W. 2004. Toward Semantic Understanding—An Approach Based on Information Extraction Ontologies. Proc of the Fifteenth Australasian Database Conference (ADC’04), Dunedin, New Zealand, January 18 – 22. In Klaus-Dieter Schewe, K-D., and Williams, H. (Eds.), Conferences in Research and Practice in Information Technology, Vol. 27. EMBLEY, D.W., TAO C., AND LIDDLE, S.W. 2005. Automating the extraction of data from HTML tables with unknown structure, Data & Knowledge Engineering, Volume 54, Number 1, July 2005, pp. 3-28. FENSEL, D., HARMELEN, F.V., HORROCKS, I., MCGUINNESS, D.L. AND PATEL-SCHNEIDER, P.F. 2001. OIL: An Ontology Infrastructure for the Semantic Web. IEEE Intelligent Systems (March/April), 2001, pp. 38-45. FRIEDMAN-HILL, E. 2006. JESS, the Expert System Shell, Sandia National Laboratories, Livermore, CA, 2004. URL: http://herzberg.ca.sandia.gov/jess GOMEZ-PEREZ, A. 1999. Tutorial on Ontological Engineering. Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99), Stockholm, Sweden, July 31 - Aug 6, 1999. GRUBER, T.R. 1993. A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, Vol.5, 1993, pp.199-220. GUALTLERI, A., AND RUFFOLO, M. 2005. An Ontology-Based Framework for Representing Organizational Knowledge. Proceedings of the 5th International Conference on Knowledge Management (I-KNOW’05), Graz, Austria, June 29 – July 1, 2005, pp. 71 – 78. GUARINO, N. 2005. Personal correspondence on events, 2005. HEFLIN, J., AND HENDLER, J. 2000. Dynamic Ontologies on the Web. Proceedings of 17th AAAI-2000, AAAI/MIT Press, Menlo Park, 2000, pp. 443-449. HENDLER, J. 2001. Agents and the Semantic Web. IEEE Intelligent Systems, March/April, 2001, pp.30-36. KASSARJIAN, H. H. 1977. Content Analysis in Consumer Research. Journal of Consumer Research, Vol. 4, No. 1, 1977, pp. 8 – 18. KNIGHT, C., GAŠEVIĆ, D., AND RICHARDS, G. 2006. An Ontology-Based Framework for Bridging Learning Design and Learning Content. Educational Technology & Society, 9 (1), 23-37. LAMMARI, N., AND METAIS, E. 2004. Building and maintaining ontologies: a set of algorithms. Data and Knowledge Engineering, Vol. 48, No. 2, 2004, pp. 155 – 176. LENAT, D.B. 1995. CYC: A Large-scale Investment in Knowledge Infrastructure. Communications of the ACM, 38, 11, 33 -38. LINDLAND, O. I., SINDRE, G., AND SOLVBERG, A. 1994. Understanding Quality in Conceptual Modeling. IEEE Software, Vol. 11, No. 2, 1994, pp. 42 – 49. LLOYD-WILLIAMS, M. 1997. Exploiting Domain Knowledge during the Automated Design of Object Oriented Databases. 16th International Conference on Conceptual Modeling (E-R’97), Los Angeles, CA., 3-6 November. MASON, O., AND GROVE-STEPHENSEN, I. 2002. Automated Free Text Marking with Paperless School. Proceedings of 6th International Computer Assisted Assessment Conference, Loughborough, July 9th and 10th, 2002, pp.213-219. MCGUINNESS, D. L. 2002. Ontologies Come of Age. In Dieter Fensel, Jim Hendler, Henry Lieberman, and Wolfgang Wahlster, editors. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, 2002. MOODY, D.L., AND SHANKS, G.G. 1998. What Makes a Good Data Model? A Framework for Evaluating and Improving the Quality of Entity Relationship Models. The Australian Computer Journal, Vol. 30, No. 3, 1998, pp. 97 – 110. NOAH, S.A. & WILLIAMS, M.D. 2002. Knowledge-Based Approaches to Database Design Diagnosis; Improving Performance with a Domain Specific Thesaurus Structure. In: Ishii, N. (ed) ACI 2002: Proceedings of the 2002 IASTED International Conference on Artificial and Computational Intelligence, pp. 366-371, Calgary, Alberta: ACTA Press. NOY, N.F., RUBIN, D.L., MUSEN, M.A. 2004. Making biomedical ontologies and ontology repositories work. IEEE Intelligent Systems, Vol. 19, No. 6, pp. 78 - 81. OBERLE, D., VOLZ, R., MOTIK, B., STAAB, S. 2004. An extensible ontology software environment. In Staab, S., and Studer, R., Handbook on Ontologies, pp. 311-333. Springer, 2004. PARSONS, J., AND WAND, Y. 2000. Emancipating Instances from the Tyranny of Classes in Information Modeling. ACM Transactions on Database Systems, Vol.25, No.2, June 2000, pp.228-268. PEASE, A., NILES, I., AND LI, J. 2002. The Suggested Upper Merged Ontology: A Large Ontology for the Semantic Web and its Applications. In Working Notes of the AAAI-2002 Workshop on Ontologies and the Semantic Web, Edmonton, Canada, July 28-August 1, 2002. PINTO, H. S., AND MARTINS, P. J. 2004. Ontologies: How can they be Built? Knowledge and Information Systems, Vol. 6, No. 4, 2004, pp. 441 – 464. PURAO, S., AND STOREY, V.C. 2005. A Multi-layered Ontology for Comparing Relationship Semantics in Conceptual Models of Databases. Journal of Applied Ontology 1, 1, 117-139.
ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.
SHANKS, G., TANSLEY, E., NURELINI, J., TOBLIN, D., AND WEBER, R. 2002. Representing Part-Whole Relationships in Conceptual Modeling: An Empirical Evaluation. International Conference on Information Systems, Barcelona, Spain, 15-18 December 2002. SIAU, K. 2004. Relationship Construct in Modeling Information Systems. Editorial Preface - Journal of Database Management. Vol 15(3). July-Sept 2004. pp. i-v. SIAU, K., WAND, Y., AND BENBASAT, I. 1997. The Relative Importance of Structural Constraints and Surface Semantics in Information Modeling. Information Systems, Vol. 22, No.23, 1997, pp.155-170. STEWART, D.B., ARORA, G. 2003. A tool for analyzing and fine tuning the real-time properties of an embedded system. IEEE Trans on Software Engineering, Vol. 29, No. 4, pp. 311 - 326. STOREY, V.C. 2005. Classifying and Comparing Relationships in Conceptual Modeling.” IEEE Transactions on Knowledge and Data Engineering 17, 11, 1-13. STOREY, V.C., CHIANG, R., AND CHEN, L. 2005. Ontology Creation: Extraction of Domain Knowledge from Web Documents. In Proceedings of the 24th International Conference on Conceptual Modeling (ER’05), Lecture Notes in Computer Science, Springer-Verlag, Klagenfurt, Austria 24-28 October, 2005. STOREY, V.C., DEY, D., ULLRICH, H., AND SUNDARESAN, S. 1998. An Ontology-Based Expert System for Database Design. Data and Knowledge Engineering, Vol.28, No.1, 1998, pp.31-46. SUGUMARAN, V., STOREY, V. 2002. Ontologies for Conceptual Modeling: Their Creation, Use, and Management. Data & Knowledge Engineering, Vol. 42, No. 3, 2002, pp. 251 – 271. SWARTOUT, W. 1999. “Ontologies”, IEEE Intelligent Systems, January-February, 1999, pp.18-19. TEOREY, T.J., YANG, D. AND FRY, J.P. 1986. A Logical Design Methodology for Relational Databases Using the Extended Entity-Relationship Model. ACM Computing Surveys (18:2), 1986, pp. 197-222. TURK, D.E., VIJAYASARATHY, L. R., AND CLARK, J. D. 2003. The Value of Conceptual Modeling in Database Development: An Experimental Investigation. In Proceedings of the Evaluation of Modeling Methods in Systems Analysis and Design Conference, 2003, Velden, Austria. WAND, Y., STOREY, V.C. AND WEBER, R. 1999. Analyzing the Meaning of a Relationship. ACM Transactions on Database Systems, Vol.24, No.4. December, 1999, pp.494-528. WEBER, R. 1996. Are Attributes Entities? A Study of Database Designers' Memory Structures. Information Systems Research, Vol. 7, No.2, June 1996, pp.137-162. WEBER, R. 2002a. Conceptual Modeling and Ontology: Possibilities and Pitfalls. In S. Spaccapietra, S.T. March, and Y. Kambayashi (Eds.): ER 2002, LNCS 2503, pp. 1-2, Springer-Verlag Berlin Heidelberg. WEBER, R. 2002b. Ontological Issues in Accounting Information Systems. Sutton, S. and Arnold, V., (Eds.), Researching Accounting as an Information Systems Discipline. Sarasota, FL: American Accounting Association. WELTY, C., AND GUARINO, N. 2001. Supporting Ontological Analysis of Taxonomic Relationships. In Laender, A.H.F. and Storey, V.C. (Eds.), Special Issue on E-R 2000, Data and Knowledge Engineering, Vol.39, No.1, 2001, pp. 51-74. Received ?????; revised ?????; accepted ?????.
ACM Transactions on Database Systems, Vol. V. No. N, Month 20YY, Pages 1 – 28.