Reverse Engineering Relational Schemas to

Reverse Engineering Relational Schemas to Object-Oriented Schemas

Shekar Ramanathan Julia Hodges

Technical Report No. MSU-960701 July 1, 1996 Department of Computer Science Mississippi State University Box 9637 Mississippi State, MS 39762-9637 [email protected]

Reverse Engineering Relational Schemas to Object-Oriented Schemas

Shekar Ramanathan Julia Hodges

Abstract Due to the wide use of object-oriented technology in software development, reverse engineering of relational schemas to object-oriented schemas is gaining a lot of interest. One of the major problems with existing approaches for this schema mapping is the extensive amount of information that must be gathered either automatically or from the user. This paper presents an object-centered approach (as opposed to other relation-centered approaches) for doing the schema mapping. The procedure maps a 3NF relational schema into an object-oriented schema without the explicit use of inclusion dependencies and provides a greater scope for automation. Introduction Object-oriented technology is being widely used in software development these days. The object-oriented data model is growing in popularity in the area of database development (Catell 1991). Consequently, there is also a growing interest in re-engineering relational databases to object-oriented databases. Reverse engineering of relational databases is concerned with the task of extracting structures corresponding to a particular conceptual model such as the entityrelationship (ER) model, the extended entity-relationship (EER) model, the object-oriented (OO) model, and so on (Fahrner 1995b). The motivation for carrying out such a reverse engineering could be to gain an understanding of the relational schema, to migrate the database from a relational model to some other model, to document the structure of the relational schema, and so on (Premerlani and Blaha 1994). One of the problems with the existing approaches for reverse engineering a relational database is the amount of information that has to be gathered for carrying out the mapping process. Different methods require different amounts of information for schema mapping. This information may be derived either automatically or obtained from the user. Inclusion dependencies, functional dependencies, and attribute information such as homonyms and synonyms are examples of such information. A majority of the techniques use a relation-centered approach. That is, they start with 1

the relational schema and all the available information (again, either derived automatically or obtained from the user) and then examine the relational structure to identify a corresponding object structure. On the other hand, in an object-centered approach, the focus is directly on the goal -- the object-oriented constructs of interest. Once we know what constructs to extract, we can proceed with the process on a case-by-case basis. The advantage of this approach is that only the information that is required is made available to identify a particular object construct (say, a class). The major contributions of this paper to the research in reverse engineering of relational schemas to object-oriented schemas are: •

an object-centered approach as opposed to a relation-centered approach to schema mapping

•

heuristics to identify inheritance structures and cardinality constraints without the explicit use of inclusion dependencies.

•

a method that provides greater potential for automation, especially for modern RDBMS

•

certain heuristics to cope with database optimizations involving the use of BLOBs (binary large objects), horizontal and vertical partitioning

The rest of this paper is organized as follows. First there is an outline of the major steps for carrying out schema mapping. Following that, there is a description of a relational schema that is used in this paper as a running example. The main schema mapping process is then described. That is followed by a discussion of a few heuristics to cope with certain optimization structures. Finally there is a brief review of the literature in related areas and the conclusions of this paper.

The Overall Process Formally, a schema mapping of a source schema S1M1 defined on a data model M1 to a target schema S2 M 2 defined on a model M2 may be defined as a transformation T such that T(S1M1) = S2M2. The transformations are usually a collection of rules that map concepts from the source data model to the target data model. One schema mapping that relational database designers often use is the mapping from an entity-relationship (ER) schema to a relational schema (Elmasri

2

and Navathe 1992). Another example of a schema mapping technique that is gaining a lot of interest is from the object-oriented model to the relational model (Blaha, Premerlani, and Rumbaugh 1988; Premerlani et al. 1990). The first task of mapping from a relational to an object-oriented schema is to identify what object-oriented constructs are we interested in extracting from the relational schema. Once they are specified, then the actual procedure for carrying out the mapping can be specified to target only the constructs of interest.

Main Goals 1. Identifying Object Classes. Those relations that correspond to object-classes should be identified. 2. Identifying Relationships. There are mainly three types of relationships that can be represented in an object model. They are the associations, generalizations/specializations, and aggregations (Rumbaugh et al. 1991). Identifying each of these constructs constitutes a step in the mapping process. a. Identifying associations. Since the object model allows associations themselves to be modeled as classes, we need to either establish just a connection between two object classes or identify relationships that correspond to association classes. b. Identifying inheritance. Inheritance structures capture the generalization/specialization relationships between object classes that have been identified so far. c. Identifying aggregation. The aggregation relationship models the composition of one object with other objects. Such complex objects have to be identified. 3. Establishing Cardinalities. Establishing the cardinalities of associations is important in order to facilitate the implementation of the object schema. The different possible cardinalities are one-one, one-many, and many-many.

Assumption The procedure outlined below makes an assumption that the source relational schema is at least in third normal form (3NF). A relational schema is in 3NF if in each relation, there are no transitive dependencies and every non-key attribute is fully functionally dependent on the key of the relation (Elmasri and Navathe 1992). Even though the assumption of 3NF is a restriction on the proposed method, most of the production databases are highly likely to satisfy the 3NF

3

requirement. There are only two likely reasons for a database not to be in 3NF: (1) due to performance reasons, or (2) poor database design. If it is the latter, then the database has to be normalized in any case to improve its design. On the other hand, the developers of optimized databases must be willing to accept the loss of some object-oriented abstraction caused by mapping a non-3NF schema.

An Example Schema Table 1 contains the relational schema that the mapping procedure uses for illustration of the steps. In the table, PK stands for primary key and FK stands for foreign-key. The name in parentheses following the name of the foreign key refers to the table of which it is a primary key. Note that in many modern RDBMSs, all this information can be obtained automatically from the catalog of the relational database.

Table 1. An Example Relational Schema. Employee Department Project Works-On Hardware-Project Software-Project Supplier Dependent

SSN, Name, Age, Sex, Dept# PK = {SSN} FK = {Dept#(Department)} Dept#, Name, Location PK = {Dept#} FK = None Proj#, Name, Leader-SSN, Budget, Target-Date PK = {Proj#} FK = {Leader-SSN(Employee)} SSN, Proj#, Start-Date PK = {SSN, Proj#} FK = {SSN(Employee)}, {Proj#(Project)} Proj#, Supplier# PK = {Proj#} FK = {Proj#(Project)}, {Supplier#(Supplier)} Proj#, Language, Lines-of-code PK = {Proj#} FK = {Proj#(Project)} Supplier#, Name, Phone, Address PK = {Supplier#} FK = None SSN, Dependent-Name, Dependent-Age PK = {SSN, Dependent-Name} FK = {SSN(Employee)}

4

The Mapping Procedure Almost all of the following steps require just the information about the primary and foreign keys of the relations. Where the processing requires additional information, it is noted explicitly in the corresponding step under a separate heading. Each step of the mapping procedure is described below in three parts. The first part specifies what additional information is required to carry out the step, and the second part describes the procedure itself. The two parts are separated since the way the required information for a step is obtained is orthogonal to the actual processing in the step itself. The third part illustrates the step with the help of an example. In the following discussion, two attributes are considered the "same" if they have the same domain. For example, "SSN" and "Leader-SSN" are considered the same since they have the same domain (namely, the set of all social security numbers). Such similarity of the attributes can be determined based on the primary keys to which the foreign keys refer.

Identify Object Classes •

Additional Required Information: None

•

Process: A relation can qualify to be mapped to an object class in two cases. First, every relation that has only one attribute in the primary key corresponds to an object class. Second, if a relation's key has more than one attribute and at least one of them is not a foreign key, then the relation corresponds to an object class. In either case, the attributes of the relation become the attributes of the object class. The corresponding relations are called class-relations.

•

Example: The relations Employee, Department, Project, Hardware-Project, SoftwareProject, and Supplier qualify to be object classes since their respective primary keys have only one attribute. The relation Dependent also qualifies to be an object class since one of the attributes of its primary key, Dependent-Name, is not a foreign key.

Identify Relationships As explained earlier, three kinds of relationships can be can be expressed in the object model: association, inheritance, and aggregation.

5

Identify Associations •


•

Process: Associations between object classes can be identified based on two cases. First, every relation whose primary key is entirely composed of foreign keys is a class that represents an association between all the classes corresponding to the foreign keys. The attributes of the relation become the attributes of the association class. Call the corresponding relation as class-relation. Second, for every class-relation that is not an association class, establish an association between this class and the class corresponding to each foreign key, unless it also happens to be a part of the primary key.

•

Example: The primary key of Works-On is entirely composed of foreign keys. Therefore, it qualifies to be an association class. Based on the presence of foreign keys in the relations, an association relationship can be established between Employee and Department; Department and Employee; Project and Employee; and Hardware-Project and Supplier. Notice that an association is not established between the class-relations Software-Project and Project even though Proj# appears as a foreign key of Software-Project.

Identify Inheritance •


•

Process: Every pair of class-relations (CR1,CR2) that have the same primary key may be involved in an inheritance relationship. The inheritance relationship CR1 "is-a" CR2 exists if the primary key of CR1 is also a foreign key and that refers to the class-relation CR2.

•

Example: The class-relation Hardware-Project "is-a" Project since both of them have the same primary key (Proj#) and, in addition, Hardware-Project.Proj# is a foreign key that references Project.Proj#. On the other hand, notice that even though the class-relations Hardware-Project and Software-Project have the same primary key, in neither of the cases do the foreign keys refer to each other's primary key. Hence, there is no inheritance relationship between the two.

Identify Aggregation •


•

Process: Consider a class-relation CR1 whose primary key has more than one attribute and at least one of them is not a foreign key. Let CR2 be the class-relation corresponding to the largest subset of foreign keys in the primary key of CR1. If such a class-relation exists, then CR1 "is-part-of" CR2 . Notice that this definition recognizes aggregations of association classes, too.

•

Example: The class-relation Dependent's primary key has two attributes and one of them (Dependent-Name) is not a foreign key. The largest subset of foreign keys in the primary

6

key is SSN and it references the class-relation Employee. Therefore, class-relation Dependent "is-part-of" Employee.

Identify Cardinalities Identifying cardinalities of associations is necessary in order to facilitate the implementation of the object schema in a programming language. Note that each many-many association has already been accounted for when association classes were identified in an earlier step. •

Additional Required Information: Whether a foreign key attribute is nullable or not.

•

Process: Consider an association A(CR1, CR2) between two class-relations CR1 and CR2 (identified earlier). If the primary key of CR1 is a foreign key in CR2, then every instance of CR2 is associated with at most one instance of CR1 if the foreign key is nullable, and with exactly one instance of CR1 if the foreign is not nullable. Finally, if in an association A(CR1 , CR2 ), class-relation CR2 contains a foreign key corresponding to CR1 but CR1 does not contain a foreign key corresponding to CR2, then more than one instance of CR2 may be associated with each instance of CR1.

•

Example: The nullability of the foreign key can be obtained either from the catalog of the database or from the user. Assuming that Employee.Dept# is not nullable, we can conclude that every instance of the class-relation Employee is associated with exactly one instance of the class-relation Department. Since class-relation Department does not contain any foreign key corresponding to Employee, we can conclude that more than one instance of Employee may be associated with each instance of Department. A similar reasoning can be used to explain the cardinalities of other associations, too. Notice that a many-many association between Employee and Project was already identified based on the discovery of an association class.

The final object-schema following the schema mapping is shown in Figure 1. The notation of Object Modeling Technique (OMT) is used in the figure (Rumbaugh et al. 1991). Rectangular boxes represent object classes. Lines connecting the object classes represent associations. A filled circle at the end of an association indicates the "zero or more" cardinality of the association. Aggregation is represented using a little diamond-shaped box near the end of the aggregate object. Inheritance is represented by a triangle connecting the super-class and all the sub-classes. Notice that in none of the steps outlined above is user intervention required for actually carrying out a particular step. All the information required for the step comes mostly from the 7

information on primary keys and foreign keys of relations. The method thus provides a greater potential for automation of the process. A user may be involved in this mapping process at certain points to confirm a particular construct that the mapping procedure identified (e.g., inheritance), or provide naming assistance. For example, in the object schema in Figure 1, the mapping procedure did not give any names for the associations. The user will have to do that.

Department Dept# Name Location

Dependent SSN Dependent-Name Dependent-Age

Works-On SSN Proj# Start-Date

Employee SSN Name Age Sex Dept#

Project Proj# Name Leader-SSN Budget Target-Date

Software-Project Proj# Language Lines-of-code

Hardware-Project Proj# Supplier#

Supplier Supplier# Name Phone Address

Figure 1. The Mapped Object Schema.

8

Coping with Optimizations The presented schema mapping process handles all typical 3NF relational schemas properly. However, the procedure may not produce the intended results if the database employs peculiar design constructs, perhaps, due to optimization concerns. Horizontal partitioning of a relation is the splitting of a single logical relation into two tables having exactly the same attributes. The presented method tends to treat these two tables distinctly. The problem can be easily solved by carrying out a simple preprocessing of the relational schema where duplicate relations that have the exactly same schema are removed from consideration in the mapping process. A more difficult problem is the case of vertical partitioning. Vertical partitioning takes place when the attributes (columns) of a single logical relation are split across more than one table. Recognizing vertical partitioning requires the use of inclusion dependencies. If two relations R1 and R2 have the same primary key and the inclusion dependency R1[PK] = R2[PK] holds, then R1 and R2 vertically partition a relation whose attributes are the union of the attributes of R1 and R 2. Another type of optimization structure in the database schema relates to the storing of multivalued attributes (e.g., arrays). Many RDBMS provide BLOB (binary large objects) data type. Multi-valued attributes are usually represented by packing them into BLOB attributes. The applications retrieve the BLOB value and unpack it before using it. Such use of BLOB values corresponds to the aggregation construct in the object-oriented model. However, the database user must be consulted before making such an interpretation because a BLOB could be used for just about any type of data (e.g., multimedia data such as image, audio, etc.).

Related Work Fahrner and Vossen (1995b) have written a survey article on the database transformations that are based on the entity-relationship (ER) model. A majority of the work on reverse engineering of relational databases to conceptual models has concentrated on extracting ER and extended ER (EER) models (Andersson 1994; Markowitz and Makowsky 1990; Jeusfeld and

9

Johnen 1994; Johannesson and Kalman 1989; Hainaut et al. 1993). Chiang, Barron, and Storey (1994) provide a comprehensive method for extracting an EER schema from a 3NF relational schema. Their method uses inclusion dependencies in the process. They also propose heuristics for automatic extraction of inclusion dependencies thus increasing the potential for automation. Fahrner and Vossen (1995a) describe a method for transforming relational database schemas to object schemas that satisfy ODMG-93. They attempt to deal with non-3NF databases by incorporating normalization as one of the intermediate steps in the process. They make extensive use of inclusion and exclusion dependencies, too. Johannesson and Kalman (1989) first classify relations into primary, secondary, or ternary relations and then provide alternative interpretations of inclusion dependencies based on the interaction between the keys and inclusion dependencies. Premerlani and Blaha (1994) try to address the problems associated with reverse engineering of relational databases that are optimized or not well designed (1994). They use information from various sources such as user input (for semantic interpretation), analysis of the schema, and analysis of the database instance. The Penguin system (Keller 1994; Keller and Hamon 1994) uses a semantic model called the Structural Model as an intermediate model for allowing integration of multiple heterogeneous views on the underlying data. The Structural Model is a network of nodes where each node represents a relation and each edge represents a link (relationship). Although the Structural Model provides a middle ground for mapping from existing relational schemas to object schemas, the current work on Penguin has concentrated on using relational databases to add persistence to existing object-oriented applications. A study of many of these methods shows that different methods have different strengths and different focus. One common observation is that the user's intervention is required at some point for successfully carrying out a complete schema mapping process.

10

Conclusions This paper presented an object-centered method for mapping a 3NF relational schema to an object-oriented schema. A lot of the schema mapping approaches described in literature make extensive use of inclusion dependencies. Automatic extraction of inclusion dependencies may not be practical because it often involves examining the database extent. On the other hand, the mapping process described in the paper does not make explicit use of inclusion dependencies. Instead, it carries out the mapping process using the information from the primary and foreign keys of the relations. In most RDBMSs, the information about the keys can be extracted automatically from the database catalogue. User's input is required mainly to confirm certain extracted objectoriented constructs (e.g., inheritance) and to give names to discovered associations. The proposed method, thus, could be easily incorporated into a simple interactive schema mapping tool. It is however true that inclusion dependencies are not totally dispensable. They are needed when the relational schema contains certain unusual constructs and optimization structures (e.g., vertical partitioning). Assuming that such scenarios occur less frequently, this paper fills in the need for a method that uses relatively less information and provides a greater scope for automation for reverse engineering conventionally designed 3NF relational databases.

11

References Andersson, Martin. 1994. Extracting an entity-relationship schema from a relational database through reverse engineering. In Proceedings of the 13th. Intl. Conference on ER Approach, Pericles Loucopoulos, editor, 403-419, Manchester, UK, December 1994. Blaha, Michael R., William J. Premerlani, James E. Rumbaugh. 1988 Relational database design using an object-oriented methodology. Communications of the ACM 31(4):414-427. Catell, Rick. 1991. What are next generation database systems? Communications of the ACM 34(10):31-33. Chiang, Roger H.L., Terence M. Barron, and Veda C. Storey. 1994. Reverse engineering of relational databases: Extraction of an EER model from a relational database. Data and Knowledge Engineering (12):107-142. Elmasri, Ramez, and Shamkant Navathe. 1992. Fundamentals of database systems. Redwood City, CA: The Benjamin/Cummings Publishing Company, Inc.. Fahrner, Christian, and Gottfried Vossen. 1995. A survey of database design transformations based on the entity-relationship model. Data and Knowledge Engineering (15): 213-250. Fahrner, Christian, and Gottfried Vossen. 1995. Transforming relational database schemas into object oriented schemas according to ODMG-93. In Proceedings of the 4th International Conference on Deductive and Object-Oriented Databases, National University of Singapore. Hainaut, J-L., C. Tonneau, M. Joris, and M. Chandelon. 1993. Schema transformation techniques for database reverse engineering. In Proc. 12th Intl. Conf. on the EntityRelationship Approach, Ramez Elmasri, Vram Kouramajian, and Bernhard Thalheim, editors, Texas, USA. Jeusfeld, Manfred A., and Uwe A. Johnen. 1994. An executable meta model for re-engineering of database schemas. Technical Report 94-19, Technical University of Aachen. Johannesson, Paul, and Katalin Kalman. 1989. A method for translating relational schemas into conceptual schemas. In Proc. 8th Intl. Conf. on the Entity-Relationship Approach, Frederick H. Lochovsky, editor, Toronto, Canada.. Keller, Arthur M. 1994. Penguin: Objects for programs, relations for persistence. Submitted for publication, April 1994. Keller, Arthur M., and Catherine Hamon. 1994. A C++ binding for Penguin: A system for data sharing among heterogeneous object models. In Proceedings of the 4th International Conference on Foundations on Data Organization and Algorithms, David B. Lomet, editor, Chicago, Illinois. Markowitz, Victor M., and Johann A. Makowsky. 1990. Identifying extended entity-relationship object structures in relational schemas. IEEE Transaction on Software Engineering 14(8):777-790.

12

Premerlani, William J., Michael R. Blaha, James E. Rumbaugh, and Thomas A. Varwig. 1990. An object-oriented relational database. Communications of the ACM, 33(11):99-108, November 1990. Premerlani, William J., and Michael R. Blaha. An approach for reverse engineering of relational databases. Communications of the ACM 37(5):42-49,134 Rumbaugh, James E., Michael R. Blaha, William J. Premerlani, Frederick Eddy, and William Lorensen. 1991. Object-oriented modeling and design. Englewood Cliffs, NJ: PrenticeHall, Inc.

13