Hierarchical entity-relationship diagrams: the model, method of ...

3 downloads 64 Views 307KB Size Report
Oct 5, 2004 - Hierarchical structuring has been a key tool for abstraction, as it removes the complexity of large sche- mata generated by enterprise modeling ...
Requirements Eng (2004) 9: 217–228 DOI 10.1007/s00766-004-0201-9

O R I GI N A L A R T IC L E

Peretz Shoval Æ Revital Danoch Æ Mira Balabam

Hierarchical entity-relationship diagrams: the model, method of creation and experimental evaluation

Received: 24 April 2003 / Accepted: 25 May 2004 / Published online: 5 October 2004  Springer-Verlag London Limited 2004

Abstract A bottom-up method for creating a hierarchy of entity-relationship diagrams (HERD) from a given, ‘‘flat’’ ER diagram (ERD) is proposed. The hierarchy consists of simple and interrelated diagrams—ER structures—with external relationships to other structures. The HERD-tree diagram, which provides the most general view of the conceptual schema, is located at the top of the hierarchy. The method is based on packaging operations, which group entities and relationships according to certain criteria. These operations are applied in several steps on a given (presumably largescale) ERD. We describe the new constructs, which are added to the ER model to enable the creation of HERD, and a bottom-up method for creating HERD. We also evaluate HERD from the point of view of user comprehension and preference, based on an experimental comparison to flat ERDs. Keywords Conceptual schema Æ Data modeling Æ Entity relationship model Æ ER diagram Æ Experimental evaluation Æ User comprehension of data model

1 Introduction Conceptual modeling is an important stage in designing a successful database application. The concepts in a data model are usually represented diagrammatically. A conceptual schema diagram must be powerful in its semantic expressiveness and easily comprehensible, as it serves as a communication medium between professional designers and users (including managers) who interact during the stage of requirements analysis and modeling, to validate the design [19, 22]. Once approved by users (as a proper representation of reality), the P. Shoval (&) Æ R. Danoch Æ M. Balabam Department of Information Systems Engineering , Ben-Gurion University of the Negev, Beer-Sheva, 84105 Israel E-mail: [email protected]

conceptual schema is converted into a specific database schema, depending on the data model and the DBMS that is used for implementation [20]. The major problem, however, is to create a good conceptual schema that is semantically correct, complete, easy to use, and comprehensible. The entity-relationship (ER) model [5] is one of the most widely used conceptual data models. An ER diagram (ERD) models the data structure of a reality in terms of entities, relationships, and attributes. However, in the case of a large-scale application, the ERD can be too complex and difficult to manage, especially by end users and managers [21]. As database application requirements increase in size and complexity, the comprehensibility and maintainability of the specification degrades rapidly [14]. Therefore, a mechanism is needed to improve ERD comprehensibility and to simplify its maintainability, in particular if we want to effectively apply the ER model to large-scale applications. Indeed, the common ER model includes some abstraction mechanisms that support comprehensibility, mainly generalization (subtyping of entity types) and aggregation (whole-part relationships). However, these abstraction mechanisms alone do not solve the problem of too much detail in too small a space [14]. Hierarchical structuring has been a key tool for abstraction, as it removes the complexity of large schemata generated by enterprise modeling [9]. Hierarchical structuring is commonly used in software engineering, for example, in traditional system analysis, where functional decomposition is carried out with data flow diagrams [7]. Another example for the use of layered diagrams is Statecharts [10]. Unified modeling language (UML) class diagrams include the package construct, but only as an organizational tool (like a folder in file management), and not as a first class construct. Evidence from research on memory indicates that hierarchical organization of materials serves as a retrieval plan as recall facilitates a general–specific search through the organizational structure, to locate particular items [17]. From these applications, we can infer that hierarchical

218

organization and layering of diagrams is likely to be effective for ER diagrams. In this paper, we propose a structured method consisting of several packaging operations and steps for creating hierarchical entity-relationship diagrams (HERD) in a bottom-up manner1. For instance, if we have a large-scale flat ER diagram, the method creates from it a hierarchy of simple and interrelated diagrams. To enable this, we enrich the common ERD by introducing new constructs: structures, external relationships and composite relationships. Structure is a (partial) ERD that consists of entities and relationships, but may also contain other structures (hence, it is similar to package in UML), and is related to other structures by external relationships. External relationship is a relationship of an entity within a structure with other entities that belong to other structures. Composite relationship is a relationship between substructures or between entities and substructures within a structure. The method of creating HERD includes packaging (grouping) operations that are applied in several steps. In the first step, the packaging operations are applied on the flat (bottom-level) ERD, which creates leaf-level structures and external relationships. In the subsequent steps, the packaging operations are applied recursively on the bottom-level ERD as well as on the already created structures, thus creating higher level structures (possibly with external relationships to other entities) and composite relationships among subordinate structures and entities. In the last step, a top-level diagram—the HERD-tree—is created, which depicts the tree structure of the entire schema. Being a new data model, this must be evaluated. A model can be evaluated according to various dimensions, e.g., quality, comprehensibility, learnability, ease of use, maintainability, preference by users or professionals, etc. [22]. We evaluated HERD from the users’ viewpoint as they are expected to be able to understand the meaning of their reality, as expressed by the data model. The evaluation is based on an experimental comparison of HERD with flat ERD. The rest of this paper is structured as follows: Section 2 presents related studies on abstraction of ERD. Section 3 describes the HERD model, and Section 4, the method for creating HERD diagrams. Section 5 presents an experimental comparison of HERD with flat ERDs. Section 6 presents the results and Section 7 summarizes the paper and suggests further research issues.

2 Related studies Several abstraction methods can be found in the ER literature, some of which are reviewed in this section. Teory et al. [21] proposed an entity clustering technique that integrates objects clustering concepts with the traditional design of ER schemas to produce higher levels 1

An earlier version of the method has been presented in [6]

of abstractions in a bottom-up process. The fundamental components of their technique are the grouping operations that define which collections of entities and relationships comprise higher level objects (entity clusters). A precedence order of grouping operations is defined according to the internal strength of the relationships among the entities in the cluster. This method, however, has several shortcomings: (1) external relationships are not well-defined, (2) relationship cardinalities and degrees are not detailed, (3) abstractions (generalization, aggregation) are not considered (4) no information is provided on entities in the ‘‘external side’’ of the relationship, and (5) the root entity cluster (the highest level entity cluster representing the entire conceptual schema) is very complex because it describes all the relationships existing in the data model. Jaeschke et al. [11], based on the works of Feldman and Miller [8], Teory et al. [21], Rauh and Stickel [18], and others, refined and extended approaches for clustering an existing ER diagram. They claim that their approach can be used in top-down as well as in bottomup design processes. The main idea is to determine the major entity types and the coarse relationship types between them. Then, these relationship types are refined iteratively, top-down, by complex and simple relationship clustering. Like Teory et al. [21] they also used entity clustering. After determining the major entity types, a detailed design of the different relationship clusters can be realized concurrently and independently by different project groups. This method has a major drawback because the entity has many relationships, which generate many diagrams at the end of the process, one for each refinement. Furthermore, there is no limitation with respect to the participation of an entity or an entity cluster in the different refinements, and therefore there is no way to find all the locations of an entity or an entity cluster. Thus, the data model can become very complex and complicate maintainability of the model. Another abstraction method is leveled entity relationship (LER) [9]. This method consists of three basic object types: aspects, entities, and relationships. An aspect is the outward appearance of an inherent feature of the entity. An entity may be atomic (like an ER entity) or it may have an internal structure. An LER relationship may directly associate two entities (like an ER relationship), or it may associate entities by linking subentities within them. A complete LER diagram consists of two sets of diagrams: a set of ‘‘glass-box views’’ defining the entities of interest in a particular schema, and a top-level diagram of the system. A major weakness of LER is that it introduces diagrams that differ significantly from ‘‘standard’’ ER diagrams and changes their interpretation [15]. LER diagrams are complex compared to standard ER diagrams. Moreover, LER does not deal with structural constraints (e.g. cardinalities) and generalizations. Moody [14, 15] proposed a technique for leveling ER diagrams that resemble the organization of a city street directory. The data model is divided into subject areas,

219

in which a subject area is a partial ER diagram. In addition, he created a context data model (CDM) diagram, which provides an overview of the whole model, showing all the subject areas and how they fit together. ‘‘Foreign entities’’ are included in each subject area diagram to show relationships to entities in other subject areas. One drawback of this technique is that it is not clear how the model can be organized on any number of levels (as claimed by the author), as it is based on a partitioning of the ER diagram into partial diagrams with no provision for hierarchies. Another limitation is that the technique does not consider ternary relationships (only binary). Finally, the CDM details only the main relationships, without indicating how to determine the important relationships among the subject areas. The method we are proposing overcomes the limitations of the above-mentioned studies. It extends the standard ERD by adding three new constructs (structure, external relationship and composite relationship), but the diagrammatical notation is consistent with the standard ERD. HERD handles various constructs of the ER model, including weak entity types, chains of weak entity types, weak entity types having multiple strong entity types; n-ary relationships (not only binary), generalization hierarchies, and various types of attributes. The method of creating HERD diagrams is based on well-known packaging operations that were also used in earlier studies, but we apply them in a semi-algorithmic process, the result of which is a tree of leveled ER structures. The designer can improve the result of this process (e.g., combine two small related structures) by applying heuristic rules of improvement. The next two sections describe the model and the method for creating HERD.

3 The HERD model HERD is a hierarchy of diagrams created from a given (presumably large-scale) flat ER diagram. The method applies packaging operations that gather entities and relationships into higher level ER diagrams called structures. Gathering entities and relationships is based on three types of grouping operations: dominance grouping, abstraction grouping, and accumulation. The process is iterative in that the grouping operations are applied on the bottom-level (flat) diagram, as well as on already created structures. A leaf-level structure includes entities, relationships and their attributes (hence, it is a partial ERD). A higher level structure also includes lower level structures and composite relationships, where a composite relationship is an aggregation of relationships among structures and entities. For ease of comprehension, a structure will also show the external relationships, i.e., relationships of entities within the structure with entities belonging to other structures. There is no limitation to the number of levels produced by the method (and the hierarchy of diagrams is not

necessarily balanced). On top of all the diagrams we create a top-level diagram called the HERD-tree, which shows the whole tree of structures in the hierarchy. To elucidate the model and the method, we use an example: the University schema. The flat, bottom-level ERD is presented in Fig. 1. As can be seen, this diagram (which is ‘‘squeezed’’ into a single page) cannot be easily understood. In reality, an ERD might be much more complex and spread over several pages, each page including arrows or page numbers in its margins directing the reader where to look further—like in road maps). 3.1 Structures Three types of aggregation can be distinguished in the common ER model: (a) an entity that is an aggregation of attributes, (b) a relationship that is an aggregation of entities and attributes, and (c) a composite attribute that is an aggregation of attributes [1]. We extend the common ER model by introducing a new type of aggregation: structure. A structure is a high-level entity that is an aggregation of entities, relationships, and subordinate structures. A structure endures the transitivity property, that is, if structure ‘‘A’’ is part of structure ‘‘B’’ and structure ‘‘B’’ is part of structure ‘‘C’’, then structure ‘‘A’’ is part of structure ‘‘C’’. A structure is also asymmetric, that is, if structure ‘‘A’’ is part of structure ‘‘B’’, then structure ‘‘B’’ is not part of structure ‘‘A’’ [3]. A structure can be represented as a large bold rectangle with a small rectangle (a ‘‘tab’’) attached to its upper left corner like a folder. A leaf-level structure contains only (elementary) entities and the relationship between them, along with their attributes (see examples in Figs. 2, 3, 4). A higher level structure also contains one or more subordinate structures (see examples in Figs. 5, 6, 7). Note that when a structure is contained within a superstructure, its details are not shown; rather only its name and number are written within the bold rectangle. Note also that a structure may have external relationships, which appear outside the frame of the structure (as will be described in the next section). Structures may be created on several levels, according to the packaging operations and steps (also to be detailed in a later section). Higher level structures are likely to contain more substructures than lower level structures. The top-level structure contains mainly lower level structures, but it may also contain elementary entities and relationships (See example in Fig. 7). It is important to note that every entity belongs to one structure only (at any level), so the structure hierarchy is a strict tree. This ensures that structures are non-overlapping (disjoint), and therefore minimize redundancy between them [16]. Obviously, attributes of an entity are shown only once, that is, in the structure where the entity belongs. Note, however, that an entity may appear

220

birth date

address

sex

0:n

1:1

1:1

Employee depended

Depended

employee ID

employee name

department name

Works for

Employee

1:n

student ID

student name

department code

1:1

1:n

Study at

Departm'nt

address 1:n

role

1:n

Paper

Authors of

1:1

Belongs to

grade

Offered

Teaches

date received

no. of months

Publish in

1:1

2:2

Books for research

Secondary researchers

1:n year

1:n

Years of research

1:1 1:n

0:n 1:n Journal editors

editor name publisher name

1:n Research Year

1:1 Publisher

Research Proposal

year

Journal

Journal publisher

1:1

1:n

Volumes of journals

journal name

address

0:n

Submitted to address

1:n 1:n Editor

amount ($)

1:n

research code

Teams in compt.

Books for course

date coach ID title

research name

no. of years

Sport Area

Competition area description

0:n

1:1

1:n

0:n

Compet. in area

1:1 fund name

1:1

ISBN

Book

1:n

area name

level

0:n 1:n

Coach

Area of team

0:n

Reference

Football Swimming

Book authors

Fund

style copy no.

address

Fig. 1 Flat ERD of University schema

more than once as part of an external relationship of some structures (as will be described later). A structure is given a name that is supposed to describe its contents (the name may be identical to the name of a major entity contained in the structure). Every structure is given a unique decimal number to indicate its level in the hierarchy. For example, a leaf-level structure is numbered 1.x (where x is a serial number), a subsequent-level structure is numbered 2.x, etc. A higher level structure may contain structures from different levels (see example in Fig. 7). Structure numbers help in recognizing structures and navigating between them, as well as in finding entities and relationships that belong to each structure. 3.2 External relationships and abstractions An entity contained in a structure may have relationships with other entities that are not part of that structure (besides its relationships with entities within the structure). Such relationships appear outside the frame of the structure; hence, we term them external relationships [similar to 21]. An external relationship shows the involved entities and the relationship’s attributes (if any). Those entities may be called external entities or

coach name

Coached by

result

0:n semester

0:n

1:n

1:n

Volume number

no of players

Sport Team

1:n Course Offering

Principal researchers

year

1:n

no. of times

team name

1:n

Taking

1:n order of author

average grade

1:n

Course

course code

1:n

1:n

Undergrad Student

thesis topic

1:n

1:n

city

Graduate Student

course name

Lecturer

title

1:n

0:n

Guided by

1:n Admin.

no. of pages

number

Student

specialization dependent name

street

phone number

Book Copy

league

1:n

Copies of 1:n

Author author ID

author name

foreign entities, as suggested by Moody [15]. Semantically, an external relationship belongs to the superlevel structure where it is included within a composite relationship. For example, Structure 1.5 (Fig. 4) and Structure 1.6 (not shown2) have an external relationship Author-of, which is included within the composite relationship Publications of Structure 2.1 (Fig. 5). For ease of comprehension, next to every entity of an external relationship, we write (in parentheses) the structure number to which it belongs. 3.3 Composite relationships For consistency, a higher level abstraction must preserve the relationship among the entities existing in a lower level abstraction [21]. In our case, a higher level structure must ‘‘hide’’ the specific relationships among the entities that belong to subordinate structures. Jaeschke et al. [11] introduced the concept of complex relationship clustering in which a complex relationship is divided into several relationships and sometimes also entities. We adopt a similar approach: relationships that connect two or more elements (structures or entities) that belong to different structures are grouped together into a com2 Due to space limitations we do not show the figures of all structures created for the example; the reader who follows the packaging operations and steps should be able to construct them.

221 Sport Areas & Competitions – 1.1

Staff and Publications - 2.1

area name 1:1

Area of team

Sport Area

area description

1:1

1:n

University Staff (1.6)

Competition in area

Sport Team (2.2)

league

Football

Swimming

0:n

0:n

date style

Teams in competition

2:2

Competition

Publications

level result

Fig. 2 Structure produced in Step 1 due to Operations (a) and (c) Books for course

Journals (1.5)

Course Offering (1.4)

0:n

Books – 1.2

Lecturer (1.6)

1:n

title

0:n 1:n

Fig. 5 Structure produced in Step 2 due to Operation (b) ISBN

Book

Books for research

1:n 1:n

0:n

Teams and Competitions - 2.2

1:1

no. of times

0:n Reference

Research Proposal (1.3)

Book authors

copy no

author ID

0:n

1:n

Book Copy

Copies of

1:n

Author

author name

Fig. 3 Structure produced in Step 1 due to Operations (a) and (b)

Journals – 1.5

Lecturer (1.6)

(1:n)

title 1:n

Paper

no of players

Sport Team

Belongs to

1:n Undergrad. Student (1.7)

Competitions

team name

Sport Aareas & Competitions (1.1)

Authors of

no of pages

Fig. 6 Structure produced in Step 2 due to Operation (b)

1:n received date

order of author

Publish in

year

1:1

Volume number 1:n

Volumes of journals

journal name

1:1 Journal publisher

1:n

Journal 1:n

1:1 publisher name

Journal editor

Publisher editor name

address

1:n

Editor address

Fig. 4 Structure produced in Step 1 due to Operations (a) and (b)

posite relationship. Hence, a composite relationship is actually a grouping of external relationships of subordinate structures. A composite relationship is represented as a bold diamond. It is given a name that is descriptive of the specific relationships that it groups. Note that sometimes a composite relationship may consist of only one specific relationship, but sometimes it groups two or more specific relationships. An example of a complex relationship of the former type is Publications (Fig. 5), which relates substructures 1.5 and 1.6 due to the single relationship Authors-of between Paper (that belongs to Structure 1.5—see Fig. 4) and Lecturer (that belongs to Structure 1.6—not shown). An example of a complex relationship of the latter type is Researches (Fig. 7), which relates three structures: Books—1.2 (Fig. 3), Staff-and-Publications—2.1 (Fig. 5) and Research-Proposals—1.3 (not shown). It groups several specific relationships that relate different entities contained in those three structures: one of them is a ternary relationship Books-for-research,

222 Fig. 7 Top-level structure

University- 4.1

department name

Employment

Staff and Publications (2.1)

department code

Department

Studying

Students & Sport Activities (3.1)

Guidance Course lecturer Participation

Researches

Research Proposals (1.3)

which relates the entities Books, Lecturer and ResearchProposal (Fig. 3); another specific relationship is Secondary-researchers, which relates the entities Research-Year and Lecturer (figures not shown). A composite relationship can be binary, if all the relationships it groups are binary, or ternary—if at least one of its component relationships is ternary, as can be observed in the previous example (Fig. 7). While every specific relationship that belongs to a composite relationship has its specific cardinalities (e.g. 1:n, m:n), a composite relationship does not define cardinalities because its member (specific) relationships may have different ones. Similarly, a composite relationship does not detail attributes, because each of its member relationships may have other attributes.

4 The method of creating HERD 4.1 Packaging operations The method we have designed to create HERD is based on packaging operations that are applied in several steps. Packaging operations gather entities and relationships to form structures. We distinguish between three types: (a) dominance grouping, (b) accumulation, and (c) abstraction grouping: (a) Dominance grouping: This operation groups weak entities together with their strong entities (see examples in Figs. 2, 3). In the case that a weak entity depends upon more than one strong entity, it is arbitrarily grouped with one of the strong entities.3 In the case that a weak entity depends on another weak entity, all weak entities in the ‘‘chain’’ are 3 This means that the method does not provide a deterministic solution and slightly different structures may be created, as will be elaborated later on.

Courses (1.4)

Books (1.2)

Use

grouped together with the top-level strong entity (see example in Fig. 4). (b) Accumulation: An entity that is related to only one other entity (that is, it has only one relationship) is grouped together with that entity. The reason for this is that such an entity cannot be detached from the related entity. Examples are the entity Author, which is related only to Book (Fig. 3); and the entities Publisher and Editor, which are related only to the entity Journal (Fig. 4). The accumulation operation can also be applied on a structure that is related to another structure (see Fig. 5), or to a structure that is related to an entity (see Fig. 6) depending on the packaging step in which it is applied. (c) Abstraction grouping: Multilevel data objects that are related to as generalization/specialization (super/ subtypes) or aggregation (whole-parts) may be grouped into an entity cluster [21]. Subtypes are grouped together with their supertype (see Fig. 2), and participating entities (the ‘‘parts’’) are grouped together with the aggregating entity (the ‘‘whole’’).

4.2 Packaging steps The above packaging operations are applied in the following four steps: 4.2.1 Step 1: create leaf-level structures The first step in the process is to form leaf-level structures, by applying the grouping operations (a), (b) and (c) on the flat (bottom-level) ERD. Every structure is given a number (1.x). Within the frame of each structure are the entities and relationships that are grouped by following one or more of the above operations. Outside

223

the frame are the external relationships. We add to every external entity the number of the structure to which it belongs. Initially not every external entity already belongs to a structure, so structure numbers may be added at later steps, once an external entity is grouped within a structure. In our example we obtain seven leaf-level structures as a result of Step 1. We show only three of them: Fig. 2 shows Structure 1.1, the result of the application of operations (a) and (c); Fig. 3 shows Structure 1.2, the result of operations (a) and (b); and Fig. 4 shows Structure 6, the result of operations (a) and (b). As noted earlier, application of Step 1 does not provide a deterministic solution; that is, we may obtain slightly different structures, depending on the order in which the packaging operations are applied. For example, a weak entity that depends on more than one strong entity may be arbitrarily grouped with one of them, thus enabling construction of slightly different structures. 4.2.2 Step 2: create higher level structures In this step, higher level structures are constructed by applying packaging operations (b) and (c) recursively on the leaf-level structures produced in Step 1, and on entities that have not been grouped yet. (Operation (a) is no longer relevant; it is applied only in Step 1). For example, Structure 2.1 groups subordinate structures 1.5 and 1.6 due to operation (b) (Fig. 5). Similarly, Structure 2.2 groups subordinate Structure 1.1 and the (simple) entity Sport-Team, the only entity to which it is related (Fig. 6). Step 2 is applied recursively, until the grouping operations (b) and (c) can no longer be applied to the remaining elements (i.e., the entities and structures). Following every iteration of Step 2, we add structure numbers to the external entities of the already created structure. In the last iteration of Step 2, we obtain the top-level structure (in our example—Structure

4.1—University; see Fig. 7). Note that though this structure consists mainly of structures and composite relationships, in the example we can also see the entity Department (which could not be grouped into a structure earlier in the process because of its various relationships with other entities). In summary, in its first iteration Step 2 created two structures: 2.1 (Fig. 5) and 2.2 (Fig. 6). In the second iteration it created Structure 3.1 (not shown), and in the last iteration—Structure 4.1 (Fig. 7). 4.2.3 Step 3: change and improve structures Steps 1 and 2 applied the packaging operations according to the defined operations. In Step 3 the designer may apply heuristic rules to improve the resulting structures, because sometimes a structure may be too small or simple (it may consist of too few entities/ structures and relationships), or too large or complex. One such rule would be to eliminate a small structure by incorporating its elements within a related structure, thus reducing the total number of structures in the model. For example, Structure 2.2 (Fig. 6) is too small as it contains only one substructure and one entity. It may be eliminated by grouping entity Sport-Team within Structure 1.1 (see Fig. 8). Similarly, Structure 2.1 (Fig. 5) consists of two substructures. We could eliminate one of its substructures, say Structure 1.5 and replace it with its components (not shown). On the other hand, if a structure is too large, a rule could be to split it into two smaller structures. If we split a structure, we also have to update the external relationships. Another situation that could require making a change is when a weak entity has several strong entities, each placed in a different structure. In that case, we may opt to combine those structures (if they are not already too big). These are just a few examples of heuristic rules that might be applied by the designer to produce a comprehensive set of hierarchical diagrams. Note that after performing any

Fig. 8 Revised structure 1.1 produced by Step 3

Belongs to

Sport Teams & Competitions – 1.1

area name area description

1:1 Sport Area

1:1

Area of team

0:n 1:n Sport Team

Competition in area

Swimming

0:n

Teams in competition

2:2

style

Competition level

Undergraduate Student (1.7)

no of players team name

0:n

Football

date

1:n

result

224

of independent variables consists of user characteristics. These may include general work/computer/IS/database experience, education, intellectual ability, etc. The subjects of the experiments in these studies were usually students with varying degrees of training or experience in information systems and databases. Other independent variables were task characteristics, including task type (e.g., comprehension or problem solving), and task complexity. Dependent variables in data model research are generally divided into two main categories: user performance and user attitude. Performance is divided into three subcategories: model correctness (also referred to as skill knowledge), time used to create the solution, and declarative knowledge (understanding of the notation). In most cases, the correctness of the model is measured according to the degree to which it corresponds to a predefined solution. User attitude mainly includes preference for a certain model and perceived ease of use. Some studies take a designer/modeler perspective, and these are mainly concerned with measuring model correctness and determining which model/method will yield more accurate and precise solution. Other studies take an end-user perspective by measuring which model is more comprehensible or preferred by users who communicate with professionals. In this study, we take an end-user perspective, to determine the degree of comprehensibility of HERD diagrams as compared to that of ‘‘standard’’, flat ER diagrams.

change in structures, the respective higher level structure must be updated accordingly. 4.2.4 Step 4: create the HERD-tree In Step 4 we create the HERD-tree diagram. Figure 9 shows the HERD-tree created from the structures obtained before any changes were made in Step 3 (that is, we disregard the change presented in Fig. 8). At the bottom of each structure are listed the entity names that the structure includes. This facilitates coordination: to find a certain entity and its attributes or relationships, one has to review the HERD-tree diagram, locate the structure to which that entity belongs, and view the details in that structure. The HERD-tree is built bottomup: at the bottom level are the leaf-level structures (numbered 1.x), above them are the secondary-level structures (numbered 2.x), and so on—until the top-level structure is reached.

5 Experimental comparison of HERD and FLAT ERDs 5.1 Related studies on experimental comparisons of data models Numerous studies have evaluated and compared various data models in experimental settings. An extensive review of prior research on data modeling is provided by Topi and Ramesh [22]. They surveyed 27 studies from 1978 to 2001 that employ social science methods, mainly laboratory experiments, to evaluate and improve the usability of data models/methods. Obviously, the data model is the most common independent variable. Many of the studies compared ERDs with normalized relations, or with object-class diagrams. The next category

5.2 Hypotheses of comprehension and preference We define eleven hypotheses which encompass three main user performance criteria: comprehension of models (diagrams), time to complete the task of com-

Fig. 9 HERD-tree (before changes applied by Step 3)

University 4.1

Department

Students & Sport Activities - 3.1

Staff and Publications –

Teams and Competitions –

2.1

2.2

Sport Team

University Staff - 1.6

Journals 1.5

Courses 1.4

Proposals 1.3

Sport Teams & Competitions

Books 1.2

1.1

Student & Sport - 1.7

Graduate Student

Student

Undergrad. Student

Sport Area

Competition

Author

Book

Book Copy

Research Year

Research Proposal

Fund

Course

Course Offering

Publisher

Editor

Journal

Volume

Paper

Dependent

Lecturer

Employee

Administrative

225

1. There is no difference in the overall comprehensionof the two models. 2. There is no difference in comprehension when dealing with attributes of an entity. 3. There is no difference in comprehension when dealing with a binary relationship. 4. There is no difference in comprehension when dealing with a ternary relationship. 5. There is no difference in comprehension when dealing with an abstraction (generalization). 6. There is no difference in comprehension when dealing with two binary relationships. 7. There is no difference in comprehension when dealing with more than two relationships. 8. There is no difference in comprehension when dealing with both an abstraction (generalization) and a relationship. 9. There is no difference in comprehension when dealing with a weak entity. 10. There is no difference in time to complete the comprehension tasks. 11. There is no difference in user preference of the two models.

5.3 Experimental design A laboratory experiment was designed to test the hypotheses. Laboratory experiments offer the advantages of control, manipulation, and measurement of variables, but suffer from lack of external validity [12]. Realistic examples may enhance the external validity of the study. The experimental design is described in Fig. 10. The major dependent variable is the user comprehension of the diagrams. Additional criteria are, as said, time to complete the task of comprehension and user preference of a model. The independent variables include the two models (namely diagram sets), which were compared by using two examples (case studies) that we prepared for the experiment—a Hospital schema, and a Parliament schema. First, we created a flat ERD for each example. Since each ERD could not fit into a single page, we had to split it into a number of pages, and paste them together at the margins of each page. Using the packaging operations and steps as described above, we created a set of HERD diagrams and a HERD-tree for each example. The controlled variables were the subjects (users) and the tasks. The subjects were asked to respond to questions aimed at measuring the comprehensibility of the two models. We prepared two questionnaires, each consisting of 41 ‘‘true’’/‘‘false’’ statements about facts appearing in the respective diagrams. We classified the statements into eight categories, according to the hypotheses above. Table 1 shows the number of statements in each category per example. We also prepared four sets of questionnaires for each example, each with a different ordering of the (same) statements. The

Time to complete task

Data Models HERD

Flat-ERD

Tasks (Examples)

User comprehension

Hospital Parliament

(Overall and per 8 facets)

Subjects

Performance

prehension, and preference of models. Nine hypotheses deal with comprehension: one is the overall comprehension of a model, and the other eight refer to the comprehension of specific facets (namely structural elements) of a model. The reason for this is that we need to distinguish between the comprehensibility of the various facets of each model. Comprehension of model facets was first measured by Batra et al. [2], who were followed by others, including Bock and Rian [4], Shoval and Shiran [20], and Kim and March [13]. Analysis of performance by model facets is a form of analysis of the task type, because modeling of specific facets can be seen as subtasks; the facet being modeled often moderates the impact of a specific modeling form on performance [22]. The eight facets in our experiment include attributes of an entity: a binary relationship, a ternary relationship, an abstraction (generalization), two relationships, more than two relationships, an abstraction and a relationship, and a weak entity. The tenth hypothesis concerns the time it might take to complete the task of comprehension. The assumption is that if a model is more comprehensible, comprehension will take less time. With regard to the last hypothesis, it is assumed that users will prefer the model they find the most comprehensible. Following are the null hypotheses:

Preference of model

Group A Group B

Fig. 10 The experimental design

Table 1 Number of statements in each facet/category Example

Hospital Parliament

Facet Attributes of an entity

Binary relationship

Ternary relation-ship

Abstraction

Two relation-ships

More than two relationships

Abstraction and relationship

Weak entity

Total

5 5

6 7

3 4

2 3

7 6

7 6

6 6

5 4

41 41

226

questionnaires were randomly assigned to the subjects to avoid any bias arising from the order of the statements. Forty-two students from the Software Engineering Department participated in the experiment. These students took the same courses, including the Databases course, where they studied the two models. They first studied and exercised with the ‘‘traditional’’ ER model, and then with the HERD model, on which they spent less time and did fewer exercises. (This fact might have influenced the experimental results, as will be shown.) The subjects were randomly divided into two main groups. Each subject in each group received a set of pages including the flat ERD of one case study, and then a set of HERD diagrams of the other case study. The division of subjects and tasks is described in Table 2. Subjects in Group ‘‘A’’ evaluated HERD diagrams of the Hospital example and an ER diagram of the Parliament example. Subjects in Group ‘‘B’’ evaluated an ER diagram of Hospital and HERD diagrams of Parliament. To avoid bias arising from the order of tasks, the subjects in each group were further randomly divided into two subgroups; in each subgroup they started working with a different example and model. Along with the diagrams of each example, the subjects received the respective questionnaire, and were asked to mark ‘‘true’’ or ‘‘false’’ next to each statement. (Recall that we prepared four sets of questionnaires, with different ordering of the statements; they were distributed to the subjects randomly.) To measure the

time it took to complete the tasks, for each subject we recorded the start and end time of completing each questionnaire. After having completed the two tasks, each subject was given a separate questionnaire that rated each user’s model preference on a 1–7 point scale.

6 Results The comprehensibility of the models was measured by counting the number of correct answers given by each subject to the questions. Then the number of correct answers per facet/category was summed, for each example and model separately, and converted to a percentage scale. We also computed the overall grade for each subject, per example and per model, and then we applied the t-statistic to test the significance of differences between the mean grades per facet/category and the overall grades. Similarly, we computed the time it took each participant to complete each task, and based on that, computed the mean time to complete each task, per example and model. Then, we tested the significance of the differences between the mean times using the ttest. Finally, we computed and compared the average preferences of models, as expressed by each subject. The results are presented in Tables 3, 4, 5, and 6. Table 3 shows the results of the Hospital example. As indicated in the last column, we found no significant difference in the comprehension of almost all facets/

Table 2 Division of subjects and tasks Subjects

Subgroup A-1

Subgroup A-2

Subgroup B-1

Subgroup B-2

No. of subjects First task Second task

21 Flat ERD—Hospital HERD—Parliament

21 HERD—Parliament Flat ERD—Hospital

20 Flat ERD—Parliament HERD—Hospital

20 HERD—Hospital Flat ERD—Parliament

Table 3 Results of the Hospital example No.

Facet/category

Model

Mean grade

No. of observation

t-Statistic

P(T £ t) two-tail

Which model is significantly better (at a=0.05)

1

Overall comprehension



0.843

0.404



3

Binary relationship

1.759

0.087

HERD (weak)

4

Ternary relationship

0.020

Flat ERD

5

Abstraction

1.000

0.324



6

Two relationships

0.141

0.889



7

More than two relationships

1.601

0.118



8

Abstraction and relationship

1.019

0.315



9

Weak entity

0.361

0.720



10

Time (min)

20 21 20 21 20 21 20 21 20 21 20 21 20 21 20 21 20 21 21 21

0.565

Attributes of entity

0.8124 0.7943 0.900 0.867 0.850 0.816 0.700 0.857 0.975 0.929 0.839 0.830 0.726 0.646 0.917 0.873 0.640 0.667 42.950 37.524

0.580

2

HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD

0.919

0.063

Flat ERD (weak)

0.433

227 Table 4 Results of the Parliament Example No

Facet/category

Model

Mean grade

No. of observation

1

Overall comprehension

2

Attributes of entity

3

Binary relationship

4

Ternary relationship

5

Abstraction

6

Two relationships

7

More than two relationships

8

Abstraction and relationship

9

Weak entity

10

Time (min)

HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD HERD Flat ERD

0.813 0.8679 0.971 0.970 0.830 0.907 0.810 0.713 0.714 0.883 0.778 0.900 0.825 0.900 0.778 0.775 0.750 0.850 51.524 48.000

21 20 21 20 21 20 21 20 21 20 21 20 21 20 21 20 21 20 21 20

categories and in the overall comprehension. Only in the ternary relationship facet there was a significant advantage to Flat ERD, while in the binary relationship facet there was a slight (not significant) advantage to HERD. With respect to time, we found that it takes slightly (not significantly) less time to complete the comprehension task when working with a flat ERD. Table 4 shows the equivalent results of the Parliament example. As indicated in the last column, there were no significant differences in most of the categories and the overall grades. Only for the abstraction category and two relationships category was there a significant advantage to Flat ERD diagrams, while in the ternary relationships category there was a slight (not significant) advantage to HERD. Overall, there was a slight (insignificant) advantage to flat ERD. With respect to time, we found no significant difference between the two models. Tables 3 and 4 present the results for each example separately. Before attempting to combine the results of the two examples we had to first verify that they are not different (in terms of their comprehensibility) and that there is no effect of interaction. For this, we conducted a two-way ANOVA test (see Table 5), which disclosed that there was a significant difference between the two examples and some interaction effect. As such, we could not combine the results of the two examples. Obviously, Table 5 ANOVA Effect

F.

p-level

Model Example Interaction

0.073 2.986 2.889

0.3950 0.0879 0.0931

P(T £ t) two-tail

Which model is significantly better (at a=0.05)

0.069

Flat ERD (weak)

0.950



0.077

Flat ERD (weak)

0.080

HERD (weak)

0.225

0.032

Flat ERD

0.197

0.035

Flat ERD

0.479

0.148



0.971



0.247

0.031

Flat ERD (weak)

0.976

0.335



t-Statistic 0.869 0.063 0.818 1.803

0.037

Table 6 Model preference Model

Mean

No. of observation

t-Statistic

Which model is significantly better?

HERD Flat ERD

5.683 4.805

41 41

3.826

HERD (at 0.000)

due to the inconclusive results obtained for each example separately, we could not expect different results even if we could combine them. Table 6 presents the results of user preferences of models. We used a paired sample t-test for each model. The t-test revealed that the subjects preferred the HERD diagrams.

7 Summary and further research We introduced extensions to the common ER diagram, and a method for creating hierarchical ER diagrams using a bottom-up method. The main idea was to group entities and relationships into higher level structures and composite relationships. Each such structure is a partial ER diagram that is small enough to be easily comprehensible, and has external relationships to related structures. We compared the comprehensibility of HERD to the flat ERD. Surprisingly, although users expressed a preference for HERD, we did not find a clear advantage with respect to comprehensibility. Although hierarchical models are commonly used in many areas, including software engineering, we could not prove that hierarchical ER diagrams are more comprehensible than nonhierarchy (flat) ER diagrams. To verify these results, we identify two experimental effects that might have biased the results in favor of flat-ERD: order of learning

228

and time of learning. With respect to order of learning, the subjects learned flat ER diagrams first, and then HERD. With respect to time of learning, they spent more time on learning and exercising with flat ER diagrams—two 3-h lectures, including several examples, plus three homework problems—compared to only one 2-h lecture on HERD, including a single example only, and no homework assignment. Another possible explanation for the results is that the questions of comprehension posed to each participant addressed all parts of the diagrams, while in practice it is likely that different people would only need to focus on or validate limited parts of the overall model. If so, the use of the hierarchical abstraction mechanism might be much more valuable than what we found in this study. All in all, the results of this experiment are not conclusive and further experiments are needed. In future, we plan to conduct more controlled experiments, controlling the order of learning and the time of learning effects. We also plan to use examples with varying numbers of entities and relationships, in order to explore the effects of task size/complexity on comprehension. We hypothesize that the advantage of hierarchy will appear as the schema increases in size and complexity. The evaluation of HERD as presented in this study took mainly user comprehension and preference points of view. Other factors which may affect the value of the model and the method are also worthy of consideration; for example, efficiency. Although the method utilizes well-defined packaging steps, following them and constructing the diagrams may require more work. This can be measured and contrasted with any advantages the model may have. Another issue worth pursuing is how to define more precisely the heuristic rules (applied in Step 3 of the method) and how to apply them. For example, how small/simple should two structures be in order to combine them, or how large/complex should a structure be in order to decompose it into smaller structures. A similar issue is the priority of operations when more than one operation can be applied. We noted (in Sect. 4.2) that the application of the grouping operations does not provide a deterministic solution, because it depends on the order in which the operations are applied, but we did not examine different alternatives and their impact on the resulting structures. Another, more general topic for further research is extending the method so as to enable creating HERD from top-down as opposed to the bottom-up method presented here. The objective would be to propose a method to create a top-level ER structure first (based on some description of user requirements), and then to decompose it following certain steps and rules of decomposition, until bottom-level ER diagrams are obtained.

References 1. Batini C, Ceri S, Navathe S (1992) Conceptual database design: an entity relationship approach. Benjamin Cummings, Redwood City 2. Batra D, Hoffer J, Bostrom R (1990) Comparing representations with relational and ER models. Commun ACM 33(2):126–139 3. Blaha M, Premerlani W (1998) Object-oriented modeling and design for database applications. Prentice-Hall, Englewood Cliffs 4. Bock D, Rian T (1993) Accuracy in modeling with extended Entity-Relationship and object-oriented data models. J Database Manag 4(4):30–39 5. Chen P (1976) The entity-relationship model—toward a unified view of data. ACM Trans Database Syst 1(1):9–36 6. Danoch R, Shoval P, Balaban M (2001) Hierarchical evolution of entity-relationship diagrams—a bottom-up approach. In: Proceedings of the 6th CAISE/IFIP8.1 int’l workshop on evaluation of modeling methods in systems analysis and design (EMMSAD’01). Interlaken, Switzerland 7. De Marco T (1978) Structured analysis and system specification. Yourdon Press New York 8. Feldman P, Miller D (1986) Entity model clustering: a data model by abstraction. Comput J 29(4):348–360 9. Gandhi M, Robertson EL, Gucht DV (1994) Leveled entityrelationship model. In: Proceedings of the 13th international conference on the entity-relationship approach. Manchester, pp 420–433 10. Harel D (1988) On visual formalism. Commun ACM 31(5):514–530 11. Jaeschke P, Oberweis A, Stucky W (1993) Extending ER model clustering by relationship clustering. In: Proceedings of the 12th international conference on the entity-relationship approach, Berlin, pp 451–462 12. Kerlinger F (1986) Foundation of behavioral research. Holt, Rinehart and Winston, Orlando 13. Kim Y, March S (1995) Comparing data modeling formalisms. Commun ACM 38(6):103–115 14. Moody D (1996) Graphical entity relationship models: toward more user understanding representation of data. In: Proceedings of the 15th international conference on conceptual modeling. Cottbus, pp 227–244 15. Moody D (1997) A multi-level architecture for representing enterprise data models. In: Proceedings of the 8th international database workshop. Springer, Singapore, pp 42–61 16. Moody D (1999) A methodology for clustering entity relationship models—a human information processing approach. In: Proceedings of the 18th international conference on entityrelationship approach, pp 114–130 17. Najarian SE (1981) Organizational factors in human memory: implications for library organization and access systems. Libr Q 51(3):269–291 18. Rauh O, Stickel E (1992) Entity tree clustering-a method for simplifying ER design. In: Proceedings of the 11th international conference on the entity-relationship approach, pp 62–78 19. Shoval P, Frumermann I (1994) OO and EER conceptual schemas: a comparison of user comprehension. J Database Manag 5(4):28–38 20. Shoval P, Shiran S (1997) Entity-Relationship and object-oriented data modeling – an experimental comparison of design quality. Data Knowl Eng 21:297–315 21. Teory T, Wei G, Bolton D, Koenig J (1989) ER model clustering as an aid for user communication and documentation in database design. Commun ACM 32(8):975–987 22. Topi H, Ramesh V (2002) Human factors research on data modeling: a review of prior research, an extended framework and future research directions. J Database Manag 13(2):3–19

Suggest Documents