Comparing Relationships in Conceptual Modeling: Mapping to ...

19 downloads 0 Views 1MB Size Report
Sep 19, 2005 - the master of Business Administration degree from Queen's University, Ontario, Canada. In ... Royal Conservatory of Music for flute perfor-.
1478

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 17,

NO. 11,

NOVEMBER 2005

Comparing Relationships in Conceptual Modeling: Mapping to Semantic Classifications Veda C. Storey Abstract—Much of the research that deals with understanding the real world and representing it in a conceptual model uses some form of the entity-relationship model as a means of representation. This research proposes an ontology for classifying relationship verb phrases based upon the domain and context of the application within which the relationship appears. The classification categories to which the verb phrases are mapped were developed based upon prior research in databases, ontologies, and linguistics. The usefulness of the ontology for comparing relationships when used in conjunction with an entity ontology is discussed. Together, these ontologies can be effective in comparing two conceptual database designs for integration and validation. Empirical testing of the ontology on a number of relationships from different application domains and contexts illustrates the usefulness of the research. Index Terms—Design representation, design concepts, design methodologies, logical design data models, logical design schema and subschema.

æ 1

INTRODUCTION

T

HE

amount of data available in both traditional and Web-based databases continues to increase. So does the need for new methods to design and integrate databases and to make the data meaningful for users [5]. The conceptual modeling phase of database design focuses on building a high-quality representation of selected phenomena in some domain [66]. Database designers generate conceptual models (scripts) using 1) conceptual modeling grammars (e.g., the entity-relationship modeling grammar) and 2) conceptual modeling methods, and work within an organizational context [65]. Most conceptual modeling methods are concerned with “things,” often referred to as entities, and associations among things, referred to as relationships [8], [64]. Entities and relationships are, thus, fundamental to conceptual modeling. A major challenge during conceptual modeling is to identify which construct to use in the creation of a conceptual model—that is, whether something should be represented as an entity, a relationship, or an attribute. Useful guidelines are emerging for choosing the appropriate construct (e.g., [6], [51], [53], [54]. The verb phrase of a relationship, however, is usually selected by a designer, sometimes without a great deal of thought given to finding the one that best reflects the semantics of an application. This can lead to problems when designs later need to be compared and their respective databases integrated. During database integration, entities, and relationships are compared and those representing the same real-world situation combined. A database, in essence, represents a conceptualization, or simplified view, of the world. If a common ontology could be developed that would facilitate the sharing of terms, then two database designs that were mapped to the common ontology could be compared, based . The author is with the Department of Computer Information Systems, College of Business Administration, Georgia State University, PO Box 4015, Atlanta, GA 30302. E-mail: [email protected]. Manuscript received 20 Apr. 2004; revised 8 Nov. 2004; accepted 24 Mar. 2005; published online 19 Sept. 2005. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-0110-0404. 1041-4347/05/$20.00 ß 2005 IEEE

upon the common mapping. Fully automated techniques for doing this are unlikely, so, solutions to database integration problems should aid integrators, but require minimal work on their part [5]. Having some mechanism to compare the semantics of relationship verb phrases to ascertain whether they are the same would be useful when comparing relationships during database integration. It would also be valuable for comparing conceptual designs during view integration or comparing a new design to a domain model (or template) for design generation and reuse. The objective of this research, therefore, is to develop an ontology for the semantic classification of relationship verb phrases that will assist in the comparison of relationships for generating, evaluating, or integrating database designs. The comparison will be semiautomated because some interaction with the user (a designer) will be required. To address this objective, the research: develops an ontology for classifying relationship verb, 2. demonstrates the effectiveness of the ontology through an empirical study, and 3. develops a prototype of an interactive system that uses the ontology for comparing relationship verb phrases. The ontology is intended to serve as part of a database design tool that will compare and integrate relationships by identifying similarities and differences in entities, verb phrases, and cardinalities. The contribution of the research reported in this paper is to develop a classification scheme for relationship verb phrases that would be part of a semiautomated approach to sharing and reusing knowledge for the creation and integration of database designs. The research is restricted to business databases, which should make it practical and be a step forward in organizing real-world knowledge for databases [45]. 1.

Published by the IEEE Computer Society

STOREY: COMPARING RELATIONSHIPS IN CONCEPTUAL MODELING: MAPPING TO SEMANTIC CLASSIFICATIONS

1479

Fig. 1. Intradomain relationship integration.

2

RELATED RESEARCH

A model is a representation of the real world which is constructed by an abstraction process in which some details are included and others omitted [22]. Conceptual models are intended to provide an accurate, complete representation of someone’s or some group’s understanding of a domain, but need to be adapted for different purposes [6]. The entity-relationship model is one of the most widely used models in conceptual modeling [14], [50]. It also has been analyzed by other research dealing with topics on knowledge representation formalisms, including graphbased notations [29], [55]. In the entity-relationship model, a relationship is usually defined as an association between two or more entities [13] and expressed as A verb phrase B (A vp B), where A and B are entities. Entities are denoted as nouns, connected by a verb (or verb phrase) [14], [15], [71] based upon normal usage in English. Nonbinary relationships are important and worthy of detailed analysis [23]. However, binary relationships are capable of representing a great deal of real-world applications and are the focus of this research. It can be difficult to classify and compare relationship verb phrases for various reasons. First, generic verb phrases may represent different concepts. For example, “has” can represent possession, part-of, or other interpretations. Second, even relationships that are intended to have well-understood semantics (is-a, [7], part-of, instance-of [61], and member-of [9]) can be subject to multiple interpretations. Third, the meanings of the verb phrases may be dependent upon the application domain and context. For example, Player switches Teams captures the notion of players moving from one team to another, whereas Train switches Cars refers to the assembly of rail cars in a particular order.

2.1 Database and View Integration There is increasing reliance on data collected in databases and data warehouses and continued growth of interorganizational systems and the World Wide Web. As a result, users of information systems need increasing access and use of data in heterogeneous databases [65]. Database integration is the process of combining two or more databases into

one. Before databases can be integrated, differences among them must first be identified and resolved [38]. Database integration involves taking a set of databases and producing a single, unified description of the input schemas (the integrated schema) and associated mapping information [49]. Problems associated with integrating heterogeneous databases [32] are similar to those of view integration, where different aspects, or views, of one database are to be integrated. Central to automating database integration is the need to develop semiautomated approaches to discovering semantic relatedness [5]. This is difficult because it requires some means of capturing or representing the meaning of verb phrases. (See Biskup and Embley [5] for an excellent summary of the problems and vast research on database integration and its automation.) In database design automation, there is a need to incorporate domain knowledge for database design tools to advance [45]. This would be useful both for checking the completeness of a given design and for generating a new design from a generic schema. Consider the relationships, Playwright revises Script and Playwright updates Script in Fig. 1. This is intradomain integration. The two relationships can be integrated into Playwright changes Script. In Fig. 2, the relationships, Playwright writes Script and Composer composes Score are from different, but related, domains. They can be integrated to produce Creator creates Work, for a design of a higher-level, arts database [58]. This is interdomain integration. The higher-level design can serve as a domain model. A domain model, for the purpose of this research, is a description of concepts in the domain of interest expressed in the form of a conceptual model [1, http:// c2.com/cgi/wiki?DomainModel]. From this template, more specific models can be created. For both intradomain and interdomain integration, methods are needed to identify and reconcile the differences in the various constructs of the entity-relationship models.

2.2 Capturing Semantics Mechanisms for capturing some of the semantics of the real world are clearly needed to compare the entities and verb phrases of relationships. For this research, “semantics” is

1480

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 17,

NO. 11,

NOVEMBER 2005

Fig. 2. Interdomain relationship integration; higher-level relationship created.

defined as the meaning, or essential message, of the terms used in the conceptual model, that is, of words and phrases representing entities and verbs. Some semantics are wellknown and unambiguous. For example, in Employer hires Applicant, the verb phrase means that an employer takes an applicant into his or her employ. On the other hand, semantics may vary with context. For instance, Company posts Job-Opening describes an act of communication, often via some form of publication, whereas Company posts Payment communicates a change in financial records. Company posts Employee (e.g, to its branch office in Siberia) is an assignment of an employee to a particular location and could be viewed as moving an employee. When Company posts Offer-Letter, however, it contacts someone. The semantics of the verb phrase thus depends upon the application domain and the context within which they occur.

2.2.1 Ontologies for Capturing Semantics An ontology is a way of describing one’s world [67] and generally consists of terms, their definitions, and axioms relating them [30]. Ontologies are found in many areas, including the semantic Web [3], [34], machine understanding [33], natural language processing and text interpretation [19], [20], domain-specific applications [25], semantic heterogeneity [35], and conceptual database design [2], [24]. Unfortunately, there are many different definitions, descriptions, types, and approaches associated with ontology development [11], [16], [20], [31], [44], [46], [47], [66], [69]. (See Weber [66] for an overview.) For the purposes of this research, an ontology is a structure that organizes knowledge in a systematic way; in this case, so it can be used to capture the semantics of relationship verb phrases. It is an application ontology [66] that provides a structure to specify representational terms [43] in the business domain for database design. It meets the criterion of Swartout [60], who describes an ontology as a “set of concepts or terms that can be used to describe some area of knowledge or build a representation of it.” A database can be thought of as a representation of a realworld situation. Therefore, if the constructs of a database

conceptual model can be mapped to a common structure, they can be compared. In this way, ontologies can support design automation by functioning as semantic maps.

2.2.2 Ontologies for Classifying Large Bodies of Knowledge There have been significant attempts to classify a large body of knowledge. Those that are most relevant to this research are: WordNet [27], [39] because it provides a classification of verbs that have proven useful [27], SPEDE [16] because it focuses on business domains and on reusing previously acquired knowledge [48], and the classification scheme of Bergholtz and Johnannesson [2] because it classifies relationship semantics. The well-known lexical database, WordNet, is a comprehensive online catalog of English terms that attempts to classify all parts of the English language [27]. It organizes them into synonym sets with underlying word senses. For example, the verb “exchange” has five word senses: give and receive, replace with another, change or switch over, hand over and receive another, or convert. WordNet contains over 21,000 verb word forms that are divided into 15 files, based on semantic criteria. WordNet also identifies a set of most frequently used, common verbs. The meanings of verbs often depend heavily upon the nouns with which they occur [39]. In database design, these are entities which, in turn, depend upon the application domain and context. The WordNet classification of verbs has been able to classify all of the synonym sets of WordNet [27]. Previous attempts to adapt WordNet concepts to conceptual modeling have been motivated by the desire to make the words that appear in conceptual models consistent [10]. It includes other types of verb phrases found in design and organizes relationships into user-classified and standard (generalization, aggregation, possession, and instantiation). Other research attempts to classify all “semantic relationships” based upon their verbs [37], [71]. These have been applied to conceptual modeling and other design activities [36]. The tool set, SPEDE [17], includes a database of generic business processes which, combined with an ontology,

STOREY: COMPARING RELATIONSHIPS IN CONCEPTUAL MODELING: MAPPING TO SEMANTIC CLASSIFICATIONS

1481

TABLE 1 Results from Entity Comparisons Using Entity Ontology

TABLE 2 Entity Classification Scheme Based upon Entity Ontology

facilitates reuse. The SPEDE ontology is based on a set of verbs from a collection of generic business processes, the Process Classification Framework, produced by the American Productivity and Quality Center. The ontology includes 19 classifications (modify, identify, adjust, move, etc.). At a minimum, a relationship ontology should be able to accommodate these classifications. These approaches to classifying verb phrases provide some indications of how to develop classification categories. First, they recognize common, generic verb phrases. Second, they highlight the importance of data abstractions in conceptual modeling. Third, they identify the need to deal with the domain-dependent nature of relationships. However, none of these efforts to classify verb phrases has been fully developed for relationship comparison in database design and integration or for creating automated methods to support these activities.

2.3 Relationships in Conceptual Modeling A relationship is of the form A verb phrase B. To compare two relationships, R1 ¼ A1 vp1 B1 and R2 ¼ A2 vp2 B2 , there are three constructs to consider: 1. the entities, 2. the verb phrases, and 3. the cardinalities. Some prior work has been carried out on comparing entities. This research focuses on verb phrases. The cardinalities of a relationship (whether the entity occurrences are optional or mandatory) describe some characteristics of the relationship [64] and need to be considered in future research.

2.3.1 Entity Comparison Consider again the example in Fig. 1 of intradomain integration. Playwright and Writer need to be recognized as synonyms so the relationships can be integrated. The interdomain integration in Fig. 2 requires more effort. Playwright and Composer are both people, so it is possible

that they refer to the same thing. The relationship verb phrases then need to be compared. Both writes and composes could be classified as verbs of creation, so an appropriate, integrated relationship is Creator creates Work. This would require input from the designer, but still would support the integration process. Our prior research proposed a methodology for comparing an entity-relationship model based primarily on entities and attributes [57], [58]. Entities are compared using an entity ontology [59] that classifies an entity into 57 categories based upon the user’s responses to, at most, eight simple “yes/no” questions (e.g., does the entity have weight? can it be bought or sold? does it move?). The result of comparing two entities is to identify them as being the “same” (identical or synonyms), “close” (candidates to be the same because they share one or more characteristics (e.g., has weight, cannot move), but verification is needed from the user), or “different.” For example, Worker and Employee would both be classified as “person.” They are “close” (share the same classification based upon the questioning scheme) and so are candidates to be the same during automated comparison of two models. Table 1 summarizes the comparison results for entities. Examples are shown in Table 2, which indicate that Pilot and Building must be different and that Project and Assignment might be the same. That research, however, considers only the structure of the relationship when comparing two designs. To be more effective, a comparison of the relationship verb phrases is needed. In the relationships Employee moves Equipment and Employee transports Equipment, the entities are the same; the verb phrases both capture some type of “motion.” The relationships are thus candidates to be the same (with confirmation needed from the designer). If they were the same, only one would be needed in the design.

2.3.2 Relationship Comparison The most notable, prior work on relationship verb phrase classification was carried out by Bergholtz and Johannesson [2] who proposed an ontology for classifying and analyzing

1482

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 17,

NO. 11,

NOVEMBER 2005

TABLE 3 Sources for Verb Phrase Classification Categories

relationships to identify their similarities and differences. Their ontology is represented as: . “Common” refers to data abstractions. Speech-act relationships consider the roles of the entities based upon three speech-act classes [52]: comissives, directives, and declaratives. Organization relationships exist between an actor and another object within an organization. Activity relationships have a predetermined extension in time. RoleLabel and RoleCat provide classifications of the roles and relationships. The ontology has been applied to a few relationships, but not tested rigorously.

3

RELATIONSHIP ONTOLOGY

This section proposes an ontology for classifying verb phrases of relationships. The ontology provides a set of classification categories into which relationship verb phrases can be mapped for comparison purposes. The classification categories were derived from results of prior research on database design, linguistics, business applications, and results from testing. Table 3 summarizes the sources consulted in the development of the ontology. These areas are where the ontology in this work gets it expressive power as a classifier mechanism. They have been accepted and proven to some degree, applied previously, and cover a wide range of verb phrases appropriate for database design applications. The initial categories were then refined based upon feedback from a pilot study and initial testing. The ontology provides a set of classification categories from which a user selects an appropriate interpretation. Although there may be no perfect way to neatly divide verb phrases into predetermined dimensions [27], there have been useful attempts to develop exhaustive classifications of verbs of the English language [27], [37]. The best interpretation of many verb phrases depends upon the purpose and context within which a verb phrase occurs and the nouns that surround it [27]. In this research, the verb phrases exist between entities.

The ontology classification categories are summarized in Fig. 3 and explained below. Not shown in the diagram are the individual verb files.

3.1 Common Verb Phrases These are based upon WordNet’s list of common verbs: have, be, run, make, set, go, take, and get. Of these, has occurs commonly in database design and has been analyzed because it is used in reference to the common data abstraction, part-of [56]. Take and get can easily appear in database design (e.g., Student takes Course, Employee gets Promotion). Set, make, and run each have a number of word senses in WordNet. The forms of these verbs that would be useful for database applications can be captured by the other categories in the ontology, most notably the verb files categories. Be and go are not of a form that would normally appear in database design. Table 4 shows possible interpretations for these verbs that were derived from dictionary interpretations, WordNet senses, and design experience. 3.2 Data Abstractions Data abstractions are well-accepted in conceptual modeling and have been applied previously to relationship classification [2], [40], [41]. Research on semantic relationships from cognitive science identifies them as meronymic relationships [18]. Although Winston et al. [71] identify seven interpretations of meronymic relationships, those most relevant to database design are component (part-of) and member-of [42], [56]. Design implications such as semantic integrity constraints, primary-key selection, and inheritance that are associated with these abstraction categories highlight the importance of identifying the correct interpretation [28]. 3.3 Cause “Cause” occurs frequently, capturing the notion of bringing about some effect. In WordNet, it has two senses: 1) to cause to happen or occur, not always intentionally, and 2) to do something in a specified manner. 3.4 WordNet Verb Files The WordNet verb files adapted are: change, communication, competition, consumption, contact, cognition or

STOREY: COMPARING RELATIONSHIPS IN CONCEPTUAL MODELING: MAPPING TO SEMANTIC CLASSIFICATIONS

1483

Fig. 3. Ontology classification categories.

perception, creation, motion, possession, social interaction, and static. Cognition and perception were combined based upon pretesting results. Verbs of bodily care are not relevant to business applications.

3.5 Business Process The category transaction/exchange/trade captures general business dealings. Evaluate or observe are part of business assessment. Business processes were explicitly identified by SPEDE [16]. 3.6 Temporal The importance of temporal representation has long been recognized [4]. An obvious consideration is whether both the “before” and “after” relationships need to be included in a design. For example, Employee starts Job and Employee

ends Job capture different semantics, even though the relationships have the same structure.

3.7 Event Many business operations are affected by events. WordNet recognizes events and other interactions [27]. 3.8 Associated with A relationship has been defined as a connection or association between entities [9], [14], [62]. This is a default category, intended to capture additional associations that might not have been captured by the other categories. The need for this category was identified during the initial testing. Table 5 summarizes the classification categories and provides examples for a particular application domain and context.

TABLE 4 Interpretations of Common Verbs

1484

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

VOL. 17,

NO. 11,

NOVEMBER 2005

TABLE 5 Classification Categories

4

VALIDATION OF CLASSIFICATION SCHEME/ONTOLOGY

To assess the validity of the ontology, dictionary words were first classified. This was followed by an empirical study.

4.1 Dictionary Classifications More than 350 verbs from online dictionaries were classified by a graduate student in information systems and by the researcher. From theses, verbs related to business applications were classified. Table 6 shows sample results. This exercise highlighted the need for additional categories dealing with transactions and events and for associated-with as a general category. 4.2 Empirical Study An empirical study was conducted to assess whether the classification scheme could serve as a stable, viable starting

point for a set of categories designed to classify relationship verb phrases. The study was performed with 31 subjects, each of whom had some general knowledge of information systems, such as that obtained in an introductory MBA course. Three subjects had some exposure to data modeling concepts. A variety of verb phrases were extracted from arbitrarily selected articles in Business Week and the Wall Street Journal. From these, relationships were generated that either included the verb phrase from the article or included a verb phrase that would be used in a database constructed to capture data on the events of the article. Still more relationships were generated based upon WordNet and the researcher’s examples. In all, 30 relationships were generated for four different application domains. Four relationships contained common verbs (e.g., has, get). The subjects were shown the examples in Table 5. For each relationship, the application domain and context were given to the study participants. The subjects were able to

STOREY: COMPARING RELATIONSHIPS IN CONCEPTUAL MODELING: MAPPING TO SEMANTIC CLASSIFICATIONS

1485

TABLE 6 Classification of Dictionary Verbs

TABLE 7 Analysis Perspectives

submit their results using an online form. Examples include: Investor buys Stock: application domain (finance); context (investments), . Recruiter hires Sales-manager: application domain (management); context (employee relations), . Plant builds Engine: application domain (manufacturing); context (production), and . Customer opens Account: application domain (retail); context (customer relations). For each relationship, subjects were asked to indicate one primary classification of the verb phrase. These were considered their “best” answers. If the subjects felt that other classifications were also appropriate, they could indicate their preferences in the space provided for additional classifications (still from the categories given). For example, a subject might classify News-Item triggers Advertisement as “cause” as a first choice and “creation” as a second choice. .

4.2.1 Analyses The data analysis assessed the extent to which the subjects agreed on the classifications. For example, for Company announces Investment, 28 subjects classified the verb phrase as communication; three classified it as event. 4.2.2 Type of Response Two analyses on type of responses were performed: Best answer: Only the subjects’ best choice was considered. . All answers: All possible classifications produced by the subjects were considered. Table 7 describes the analysis for the best answer. A similar analysis was carried out when considering all answers. .

4.2.3 Error Variance Estimation Multiple classifications were accepted. However, most statistical techniques consider multiple classifications to be a form of error variance. To adjust for error variance inflation, a two-step data analysis process was performed. In the first step, the raw data were analyzed. This produced a result that conflated error variance and the variance associated with multiple classifications. To partial out the variance associated with multiple classifications, a second analysis was performed. This analysis considered classifications in which four or more subjects agreed. Four was chosen as the minimum number of entries needed to support a reasonable classification. (For example, for Company posts Earnings, 27 subjects selected “communication” as an appropriate interpretation, whereas four chose “event,” which might be considered a more general answer.) The error variance from multiple classifications was then added to the raw variance. Sensitivity analyses were performed for classifications in which at least five, six, and seven subjects agreed. 4.2.4 Type of Statistic Each subject classified 30 relationships. Both the independent and dependent variables are nominal data, so contingency tables were used. First, a 26 (classification questions) by 20 (classifications) contingency table was generated to ascertain the extent to which the subjects agreed on the classifications. The other four questions related to common verb phrases and were analyzed separately. Two contingency table statistics were measured. The chisquare statistic (2 ) was employed as the test of statistical significance. Goodman and Kruskal’s lambda () was employed to measure statistical magnitude. It measures the proportion of explained variance (equivalent of R2 in regression). Lambda produces a score from 0 to 1, where 1

1486

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

Fig. 4. Results for best answer (first selection).

denotes all variance explained (no randomness); 0 signifies complete randomness [26]. The error variance associated with the classification was moderate ( ¼ 0:658 for the best (first) answer and 0.567 (considering all answers)). When the ability of users to pick multiple choices was adjusted for, the classification scheme explained more of the variance with the adjusted variance being 0.845 for the best answer and 0.853 for all answers. The analysis was also run considering 5, 6, and 7 as acceptable thresholds of agreement among the subjects. The results are summarized in Figs. 4 and 5. The analysis demonstrates that, when multiple classifications are considered, the classification scheme explains at least 77 percent of the variance for agreement among seven or more subjects and as much as 85 percent of the variance for agreement among four or more. Two relationships contributed considerably to the variance. When these were omitted from the analysis, the explained variance rose to 0.80-0.86 for the best answer and 0.81-0.86 for all answers. Approximately 3 percent of all responses were for the default category “associated-with.” These are correct responses, but not as specific as a different selection from the ontology categories. For the four relationships that contained common verb phrases, the subjects agreed in 88.7 percent of the responses, not including the default category. Overall, the results are encouraging. The classification categories were generally effective in approximately 8085 percent of the cases. This demonstrates that the ontology facilitates semantically sound verb-phrase mapping to the classification categories while accommodating legitimate variations in usage. It also suggests that it is feasible to develop such an ontology and the one proposed in this research is a reasonable first start. The testing was carried out on four different application domains with the number of entities in the schemata ranging from 8 to 13 and the number of relationships ranging from 6 to 9.

5

PROTOTYPE

The ontology is intended to be implemented as part of an interactive tool set for comparing and integrating relationships. This is similar to Dahlgren’s [20] system where the ontology serves as a tool to interactively classify terms. A prototype has been developed with the user most likely to

VOL. 17,

NO. 11,

NOVEMBER 2005

Fig. 5. All answers: agreement classification categories by four, five, six, and seven participants.

be a database designer. The architecture is shown in Fig. 6 and is comprised of the following: 1. 2.

3. 4.

The acquisition module enables the user to provide relationships. The ontology classification module consists of the classification categories from which a user selects an appropriate one. The database contains classified relationships, organized by user, application domain, and context. The inference module compares relationships.

5.1 Acquisition Module A user enters a relationship to be classified of the form A verb phrase B. There is no restriction on the number of relationships the user can enter or the domain or context. The user helps to classify the verb phrase through an interactive dialogue that asks the user to select an appropriate interpretation if the system cannot automatically classify it. 5.2 Classification Module To classify a verb phrase, the system first checks if the verb phrase matches one in its classification categories (e.g., partof, change, etc.) and the verb phrase is automatically classified. For is-a relationships, the system assumes the relationship is of the form Entity Type1 verb phrase Entity Type 2. It does not check for instances (tokens), which are often confused with types [21]. For all other verb phrases, the user is first asked to identify the appropriate application domain and context. The user is then given the option of viewing prior classifications (for the same domain and context) with which the user may agree or disagree. There may be more than one prior classification because different users can interpret the same verb phrase differently. If the user disagrees with prior classifications, the user may select a more appropriate classification from the categories. The user is allowed to view prior classifications in an attempt to minimize the work on the part of the user. Ideally, there would be agreement among users. However, this is an explicit recognition of, and attempt to deal with, ambiguities in the English language and the different word senses that verb phrases may have. The option of allowing the user to view prior classifications can be turned off.

STOREY: COMPARING RELATIONSHIPS IN CONCEPTUAL MODELING: MAPPING TO SEMANTIC CLASSIFICATIONS

1487

Fig. 6. Prototype system for relationship comparison.

5.3 Database The relationship ontology system: 1.

2. 3.

provides a set of categories into which verb phrases can be classified for semiautomated comparison of relationships, builds up a base of classified verb phrases, organized by domain and context, and facilitates the comparison of relationships for integration purposes. A classified verb phrase is represented as: ½Database Name; RðA verb phrase BÞ; User; Application Domain; Context; classification:

As mentioned, when classifying a relationship verb phrase, the user can view any previous classifications of the verb phrase for the same domain and context. The user could then select the appropriate classification or identify a more accurate interpretation. Each user’s selection is stored so that, over time, a set of feasible interpretations accumulates. This

is intended to minimize the interaction with, and burden on, the user while capturing meaningful classifications of verb phrases. It would also be feasible to allow the user to select more than one interpretation and store each one.

5.4 Inferencing Module When using the ontology for database integration, entity matches across databases, DB1 and DB2 , must first be identified. The comparison of entities could be done in a semiautomatic manner (e.g., using some type of entity ontology). A user would classify the verb phrases from DB1 based upon the ontology and a (possibly different) user would do the same for the verb phrases from DB2 . The pairs of relationships can then be compared based upon their classifications to ascertain which are indeed representing the same information and can be integrated. Two comparisons of relationships with the same entities are shown in Table 8. The comparisons take into account whether one or more users selected the same classification for the verb phrase. If so, R1 and R2 are candidates to be compared. If not, they must be different. Sections 5.4.1 and 5.4.2, as well as Tables 9 and 10 describe applications of the research to

TABLE 8 Comparison of Relationships

1488

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

interdomain and intradomain relationship comparisons. They can be found in the Appendix on the Computer Society Digital Library at http://www.computer.org/tkde/ archives.htm.

6

CONCLUSION

An ontology for classifying the semantics of verb phrases of relationships in an entity-relationship model has been presented. The ontology is based upon prior research on conceptual modeling, linguistics, and relationship classifications. Verb phrases are mapped to the classification categories that best reflect their meanings. Empirical testing suggests that it is feasible to develop such an ontology and that the one proposed is a reasonable start for relationship verb phrase classification and comparison. The ontology has been implemented in a prototype that accepts binary relationships and interactively classifies the relationship verb phrase. The user’s selected classification is stored by application domain and context with no restrictions on the number of relationships or domains. Further work is required to complete the prototype and incorporate it into a database design tool that includes components for comparing entities and cardinalities so that an overall comparison of relationships can be made. Other work dealing with inconsistent classifications and associated problems needs to be examined (e.g., Bayesian analysis) and more advanced inferencing techniques developed. The research could also be extended to include nonbinary relationships. Then, the system’s database of classified relationships should be built up over time from one application to the next and with a large number of users. The usefulness of the system should be further assessed. Eventually, a set of real-world examples for database design should emerge.

ACKNOWLEDGMENTS This research was supported by Georgia State University. The author thanks Cecil Chua and Yi Ding for work on the implementation, as well as the many subjects who participated in the study. Special thanks to Niki Fowler for her help in the preparation of this manuscript and to Jerry Kane and Brydan Rogers for their assistance. Many thanks to the editor, associate editor, Mohammed J. Zaki, and the reviewers for most helpful comments on previous versions of this manuscripts.

[6]

[7] [8] [9] [10]

[11] [12] [13]

[14] [15] [16] [17]

[18] [19] [20] [21]

[22] [23] [24]

[25] [26]

REFERENCES [1] [2]

[3] [4] [5]

G. Arango, Domain Analysis Methods. Software Reusability, Ellis Horwood, 1994. M. Bergholtz and P. Johnannesson, “Classifying the Semantics of Relationships in Conceptual Modelling by Categorization of Roles,” Proc. Sixth Int’l Workshop Applications of Natural Language to Information Systems (NLDB ’01), pp. 28-29, June 2001. T. Berners-Lee, J. Hendler, and O. Lassila, “The Semantic Web,” Scientific Am. vol. 284, no. 5, pp. 34-43, May 2001. C. Bettini and A. Montanari, “Temporal Representation and Reasoning,” Data and Knowledge Eng., vol. 44, no. 2, pp. 139-264, Feb. 2003. J. Biskup and D.W. Embley, “Extracting Information from Heterogeneous Information Sources Using Ontologically Specified Target Terms,” Information Systems, vol. 28, no. 3, 2003.

[27] [28] [29] [30] [31] [32] [33]

VOL. 17,

NO. 11,

NOVEMBER 2005

F. Bodart, A. Pate, M. Sim, and R. Weber, “Should Optional Properties Be Used in Conceptual Modelling? A Theory and Three Empirical Tests,” Information Systems Research, vol. 12, no. 4, pp. 384-405, 2002. R.J. Brachman, ”What IS-A Is and Isn’t: An Analysis of Taxonomic Links in Semantic Networks,” Computer, Oct. 1983. M. Brodie, “On the Development of Data Models,” On Conceptual Modeling, M.L. Brodie, J. Mylopoulos, and J.W. Schmidt, eds., pp. 19-47, New York: Springer-Verlag, 1984. M. Brodie, “Association: A Database Abstraction,” Proc. EntityRelationship Conf., 1981. J.F.M. Burg and R.P. van de Riet, “COLOR-X: Using Knowledge from WordNet for Conceptual Modeling,” WordNet: An Electronic Reference System and Some of Its Applications, C. Fellbaum, ed., pp. 353-377, Cambridge, Mass.: MIT Press, 1998. Comm. ACM, special issue on ontology, M. Gruninger and J. Lee, eds., vol. 45, no. 2, 39-65, Feb. 2002. R. Chaffin, D.J. Hermann, and M. Winston, “An Empirical Taxonomy of Part-Whole Relation Identification,” Language and Cognitive Processes, vol. 3, no. 1, pp. 17-48, 1998. P. Chen, “Entity-Relationship Modeling: Historical Events, Future Trends, and Lessons Learned,” Software Pioneers: Contributions to Software Eng., M. Broy and E. Denert, eds., pp. 100-114, Berlin: Springer-Verlag, June 2002. P. Chen, “The Entity-Relationship Approach,” Information Technology in Action: Trends and Perspectives, pp. 13-36, Englewood Cliffs, N.J.: Prentice Hall, 1993. P. Chen, “English, Chinese, and ER Diagrams,” Data and Knowledge Eng., vol. 23, no. 1, pp. 6-16, June 1997. H. Cottam, “Ontologies to Assist Process Oriented Knowledge Acquisition,” http://www.spede.co.uk/papers/papers.htm, 2000. H. Cottam, N. Milton, and N. Shadbolt, “The Use of Ontologies in a Decision Support System for Business Process Re-Engineering,” http://www.psychology.nottingham.a...search/ai/themes/ka/ UseofOnto.html, 2000. D.A. Cruse, “On the Transitivity of Part-Whole Relation,” J. Linguistics, vol. 15, pp. 29-38, 1986. K. Dahlgren, “A Linguistic Ontology,” Int’l J. Human-Computer Studies, vol. 43, pp. 809-818, 1995. K. Dahlgren, Naive Semantics for Natural Language Understanding. Hingham, Mass: Kluwer Academic, 1988. J. Davis and R.D. Bonnell, “A Framework for Constructing Visual Knowledge Specifications in Acquiring Organizational Knowledge,” Knowledge Acquisition An Int’l J., vol. 3, no. 1, pp. 79-113, Mar. 1991. J.P. Davis and R.D. Bonnell, “Modeling Semantics with Concept Abstractions in the EARL Data Model,” Proc. Eighth Int’l Conf. Entity-Relationship Approach, pp. 107-117, 1998. D. Dey, V.C. Storey, and T.M. Barron, “Improving Database Design through the Analysis of Relationships,” ACM Trans. Database Systems, vol. 24, no. 4, pp. 453-486, Dec. 1999. J. Dullea and I.-Y. Song, “A Taxonomy of Recursive Relationships and Their Structural Validity in ER Modeling,” Conceptual Modeling—ER’99, Proc. 18th Int’l Conf. Conceptual Modeling, J. Akoka, M. Bouzeghoub, I. Comyn-Wattiau, and E. Metais, eds., pp. 384-389, 1999. D. Embley, D.M. Campbell, Y.S. Jiang, Y.K. Ng, R.D. Smith, S.W. Liddle, and D.W. Quass, “A Conceptual-Modeling Approach to Web Data Extraction,” Data and Knowledge Eng., 1999. B.S. Everitt, The Analysis of Contingency Tables. Chapman and Hill, 1977. V. Fellbaum, “Introduction,” Wordnet: An Electronic Lexical Database, pp. 1-19, Cambridge, Mass.: The MIT Press, 1998. R.C. Goldstein and V.C. Storey, “Data Abstractions: ‘Why and How,’” Data and Knowledge Eng., vol. 29, no. 3, pp. 1-18, 1999. L.C. Gray and R.C. Bonnell, “A Comprehensive Conceptual Analysis Using ER and Conceptual Graphs,” J. Experimental and Theoretical Artificial Intelligence, vol. 4, pp. 95-106, 1992. T.R. Gruber, “A Translation Approach to Portable Ontology Specifications,” Knowledge Acquisition, vol. 5, pp. 199-220, 1993. M. Gruninger and J. Lee, “Ontology Applications and Design,” Comm. ACM, vol. 45, no. 2, pp. 39-41, Feb. 2002. “Information Integration,” IEEE Intelligent Systems, M.A. Hearst, ed., pp. 12-24, Sept./Oct. 1998. J. Hendler, “Agents and the Semantic Web,” IEEE Intelligent Systems, pp. 30-36, Mar./Apr. 2001.

STOREY: COMPARING RELATIONSHIPS IN CONCEPTUAL MODELING: MAPPING TO SEMANTIC CLASSIFICATIONS

[34] IEEE Intelligent Systems, special issue on the semantic Web, pp. 3279, Mar./Apr. 2001. [35] Z. Kedad and E. Metais, “Dealing with Semantic Heterogeneity during Data Integration,” Conceptual Modeling—ER ’99, Proc. 18th Int’l Conf. Conceptual Modeling, J. Akoka, M. Bouzeghoub, I. Comyn-Wattiau, and E. Metais, eds., pp. 325-339, Nov. 1999. [36] P.W. Kuczorz and S.J. Cosby, “Implementation of Meronymic (Part-Whole) Inheritance for Semantic Networks,” KnowledgeBased Systems, vol. 2, no. 4, pp. 219-227, Butterworth and Co. Ltd., 1989. [37] T.Y. Landis, D.J. Harrmann, and R. Charrin, “Development Differences in the Comprehension of Semantic Relations,” Z. Psychologie, vol. 195, no. 2, pp. 129-139, 1987. [38] E. Lim and R. Chiang, “Accommodating Instance Heterogeneities in Database Integration,” Decision Support Systems, vol. 38, pp. 213231, 2004. [39] G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K.J. Miller, “Introduction to WordNet: An On-Line Lexical Database,” Int’l J. Lexicography, vol. 3, no. 4, pp. 235-244, 1990. [40] R. Motschnig-Pitrik, “A Generic Framework for the Modeling of Contexts and Its Applications,” Data and Knowledge Eng., vol. 32, pp. 145-180, 2000. [41] R. Motschnig-Pitrik and J. Mylopoulos, “Class and Instances,” Int’l J. Intelligent and Cooperative Systems, vol. 1, no. 1, pp. 61-92, 1992. [42] R. Motschnig-Pitrik and V.C. Storey, “Modelling of Set Membership: The Notion and the Issues,” Data and Knowledge Eng., vol. 16, pp. 147-185, 1995. [43] K. Mahalingam and M.N. Huhns, “A Tool for Organizing Web Information,” Computer, pp. 80-83, June 1997. [44] J. Mylopoulos, “Information Modeling in the Time of the Revolution,” Information Systems, vol. 23, pp. 127-155, 1998. [45] S.A. Noah and M.D. Williams, “Knowledge-Based Approaches to Database Design Diagnosis; Improving Performance with a Domain Specific Thesaurus Structure,” Proc. 2002 IASTED Int’l Conf. Artificial and Computational Intelligence, pp. 366-371, 2002. [46] N.F. Noy and C.D. Hafner, “The State of the Art in Ontology Design: A Survey and Comparative Review,” AI Magazine, vol. 18, no. 3, pp. 53-74, 1997. [47] D.E. O’Leary, “Impediments in the Use of Explicit Ontologies for KBS Development,” Int’l J. Human-Computer Studies, vol. 46, pp. 327-337, 1997. [48] K. O’Hara, N.R. Shadbolt, and G. Van Heust, “Generalised Directive Models: Integrating Model Development and Knowledge Acquisition,” Int’l J. Human-Computer Studies, vol. 49, pp. 497522, 1998. [49] C. Parent and A. Spaccapietra, “Database Integration: An Overview of Approaches and Issues,” Comm. ACM, vol. 41, no. 5, pp. 166-178, May 1998. [50] J. Parsons, “Effects of Local Versus Global Schema Diagrams on Verification and Communication in Conceptual Data Modeling,” J. Management Information Systems, vol. 19, no. 3, pp. 155-183, Winter 2002-2003. [51] J. Parsons and Y Wand, “Emancipating Instances from the Tyranny of Classes in Information Modeling,” ACM Trans. Database Systems, vol. 25, no. 2, pp. 228-268, June 2000. [52] J.R. Searle, “A Taxonomy of Illocutionary Speech Acts,” Expressions and Meaning: Studies in the Theory of Speech Acts, pp. 1-29, New York: Cambridge Press, 1979. [53] G. Shanks, E. Tansley, J. Nurelini, D. Toblin, and R. Weber, “Representing Part-Whole Relationships in Conceptual Modeling: An Empirical Evaluation,” Proc. Int’l Conf. Information Systems, Dec. 2002. [54] K. Siau, Y. Wand, and I. Benbasat, “The Relative Importance of Structural Constraints and Surface Semantics in Information Modeling,” Information Systems, vol. 22, no. 23, pp. 155-170, 1997. [55] J.F. Sowa, Conceptual Structures: Information Processing in Mind and Machine. Reading, Mass.: Addison Wesley, 1984. [56] V.C. Storey, “Understanding Semantic Relationships,” Very Large Data Bases (VLDB) J., vol. 2, no. 4, pp. 455-488, Oct. 1993. [57] V.C. Storey, R. Chiang, D. Dey, R.C. Goldstein, and S. Sundaresan, “Common Sense Reasoning and Learning for Database Design Systems,” ACM Trans. Data Base Systems, vol. 22, no. 4, Dec. 1997. [58] V.C. Storey and D. Dey, “A Methodology for Learning Across Application Domains for Database Design Systems,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 1, pp. 13-28, Jan./Feb. 2002.

1489

[59] V.C. Storey, D. Dey, H. Ullrich, and S. Sundaresan, “An OntologyBased Expert System for Database Design,” Data and Knowledge Eng., vol. 28, no. 1, pp. 31-46, 1998. [60] W. Swartout, “Ontologies,” IEEE Intelligent Systems, pp. 18-19, Jan./Feb. 1999. [61] T.L. Teorey, D. Yang, and J.P. Fry, ”A Logical Design Methodology for Relational Databases Using the Extended Entity-Relationship Approach,” ACM Computing Surveys, vol. 18, no. 2, pp. 197222, 1986. [62] J.D. Ullman and J. Widom, A First Course in Database Systems. Prentice Hall, 2002. [63] H. Ullrich, S. Purao, and V.C. Storey, “An Ontology for Classifying the Semantics of Relationships in Database Design,” Proc. Fifth Int’l Conf. Applications of Natural Language to Information Systems, (NLDB ’00), June 2000. [64] Y. Wand, V.C. Storey, and R. Weber, “Analyzing the Meaning of a Relationship,” ACM Trans. Database Systems, vol. 24, no. 4, pp. 494528, Dec. 1999. [65] R. Weber, “Conceptual Modelling and Ontology: Possibilities and Pitfalls,” Proc. 21st Int’l Conf. Conceptual Modeling (ER), S. Spaccapietra, S.T. March, and Y. Kambayashi, eds., pp. 1-2, 2002. [66] R. Weber, “Ontological Issues in Accounting Information Systems,” Researching Accounting as an Information Systems Discipline, S. Sutton and V. Arnold, eds., Sarasota, Fla.: Am. Accounting Assoc., 2002. [67] R. Weber, Ontological Foundations of Information Systems. Melbourne: Coopers & Lybrand, 1997. [68] R. Weber, “Are Attributes Entities? A Study of Database Designers’ Memory Structures,” Information Systems Research, vol. 7, no. 2, pp. 137-162, June 1996. [69] C. Welty and N. Guarino, “Supporting Ontological Analysis of Taxonomic Relationships,” Data and Knowledge Eng., A.H.F. Laender and V.C. Storey, eds., special issue on ER 2000, vol. 39, no. 1, pp. 51-74, 2001. [70] M. Winslett, “Peter Chen Speaks Out,” SIGMOD Record, vol. 33, no. 1, 2004. [71] M.E. Winston, R. Chaffin, and D. Hermann, “A Taxonomy of PartWhole Relations,” Cognitive Science, vol. 11, pp. 417-444, 1987. Veda C. Storey received the BS degree from Mt. Allison University, New Brunswick, Canada, the master of Business Administration degree from Queen’s University, Ontario, Canada. In addition, she received the Associate of the Royal Conservatory of Music for flute performance from the University of Toronto, Canada. She is the Tull Professor of Computer Information Systems and Computer Science, Georgia State University, and has research interests in ontologies, intelligent systems, and real-world knowledge. Her research has been published in the ACM Transactions on Database Systems, IEEE Transactions on Knowledge and Data Engineering, Information Systems Research, and MIS Quarterly. She has served on the editorial boards of a number of journals including Management Science, Information Systems Research, MIS Quarterly, and Data and Knowledge Engineering.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Suggest Documents