Int. J. Human-Computer Studies (2001) 54, 25}51 doi:10.1006/ijhc.2000.0406 Available online at http://www.idealibrary.com on
Consulting support during conceptual database design in the presence of redundancy in requirements specifications: an empirical study DINESH BATRA College of Business Administration, Florida International University, University Park, Miami, FL 33199, USA. email:
[email protected]. SOLOMON R. ANTONY Area of ISQS, College of Business Administration, ¹exas ¹ech ;niversity, Lubbock, ¹X 79409, ;SA. email:
[email protected]. (Received 15 July 1999 and accepted in revised form 6 June 2000) This study examines the e$cacy of a consulting system for designing conceptual databases in reducing data modelling errors. Seventy-two subjects participated in an experiment requiring modelling of two tasks using the consulting system. About half the subjects used the treatment version and the other half used the control version. The control version resembled the treatment version in the look and feel of the interface; however, it did not embed the rules and heuristics that were included in the treatment version. Research "ndings suggest that subjects using the treatment version signi"cantly outscored their control version counterparts. There was an interaction e!ect between system and prior knowledge*subjects who scored low in a pre-test bene"ted the most from the treatment version. This study has demonstrated that a consulting system can signi"cantly reduce the incidence of errors committed by designers engaged in conceptual database modelling. Further, the system is robust and can prevent errors even in the presence of redundancy in user requirements. 2001 Academic Press KEYWORDS: Consulting system; expert system; database design; redundancy; requirements speci"cations; problem-solving.
1. Introduction Enormous growth in the number and importance of database applications has occurred in the past two decades. Currently, there is a critical shortage of skills in areas like database analysis, design, and administration (McFadden, Ho!er & Prescott, 1998). Thus, the database area is a good candidate for employing knowledge-based systems (e.g. consulting systems), which can be used by novices to improve their performance (Turban, 1992). In this paper, we focus on advanced beginners or novices, who are likely to bene"t from knowledge-based systems. This paper does not consider pre-novices, who are also referred to as naive designers and do not have the skill base to bene"t as much from such a knowledge-based system (Antony, Batra & Santhanam, 1999). 1071-5819/01/010025#27 $35.00/0
2001 Academic Press
26
D. BATRA AND S. R. ANTONY
Several systems that provide consulting support for novice designers during conceptual database design have been reported in the literature (see e.g. Reiner, 1992; Storey & Goldstein, 1993). These systems have revealed many new ideas to approach the modelling issue. Certainly, signi"cant progress has been made in the area. However, none of these systems has been empirically examined for e!ectiveness. Further, the design of these systems has not been based on empirical studies of error behavior of novices. This paper reports on a study that addresses these limitations. The study used a tool COnsulting system for DAtabase design (CODA) that embeds rules and heuristics from several sources to address the general types of errors committed by novices, and is based on a context-speci"c guidance strategy (Silver, 1990). Its e$cacy was compared in a laboratory experiment with a control system that has the same look and feel. In the next section, we discuss various issues pertaining to conceptual data modelling in the presence of redundancy in requirements speci"cations. This is followed by a brief survey of existing consulting systems and a discussion of their shortcomings. Next, we present a theoretical framework called GEMS, which provides the framework for discussing novice errors in data modelling and for developing hypotheses. This is followed by a description of the laboratory experiment and the results. The discussion of the results, and the theoretical and the practical implications spell out the contributions of the study.
2. Modelling database requirements This study focuses on the modelling of database requirements. One of the main objectives of database design is to minimize redundancy (McFadden, Ho!er & Prescott, 1998). In the relational model (Codd, 1970), this is achieved by designing a solution that is normalized and minimal (Maier, 1988). Other database design objectives relate to integrity, response time, turnaround time, e$ciency of storage, etc. However, at the logical design stage, minimum redundancy along with elimination of anomalies and prevention of derived relationships and derived attributes are the most important objectives. The novice designer is likely to make errors (Fitts, 1964; Storey, Thompson & Ram, 1995), and erroneous designs can lead to signi"cant costs (Davis, 1993; Vessey & Conger, 1994). This is especially true today with ever-increasing use of data not only via the traditional access to systems but via the Internet, intranets and client}server link to data warehouses (Korth & Silberschatz, 1997). Entity relationship (ER) modelling (Chen, 1976) is typically the "rst step in modelling database requirements. Among the conceptual modelling techniques, the ER representation has been shown to lead to superior performance in many experiments (e.g. Juhn & Naumann, 1985; Jarvenpaa & Machesky, 1989; Batra, Ho!er & Bostrom, 1990; Bock & Ryan, 1993). A representation in the ER format is easily and automatically translated into a popular format that facilitates implementation (e.g. the relational model) using a set of rules (Teorey, Yang & Fry, 1985; Ram, 1995). However, the ER diagram must be precise and unambiguous to preserve the normal forms and, consequently, minimize the redundancy of the stored data. The ER diagram cannot be developed ad hoc or at random. Given that there is likely to be much redundancy in the presentation of the requirements, a literal translation of the requirements to an ER diagram would lead to a redundant, and, perhaps, a wrong relational representation.
CONCEPTUAL DATABASE DESIGN
27
The quality of an ER diagram depends on the accuracy of the entities (including sub-types), the relationships, and the attributes represented. Further, the accuracy in modelling requirements will depend on the designer's experiences and characteristics, and the complexity of the task. Empirical studies have indicated that novice designers do not run into much di$culty in modelling entities and attributes. However, they encounter a lot of problems modelling relationships. Although the common degree of a relationship is binary, some applications require the ternary relationship concept. Although it is not the main aspect of the study, we accept and employ the ternary relationship concept (UML Notation Guide, 1997, p. 61) since at the logical design stage, relational databases are generally normalized to fourth normal form. The very de"nition of the fourth normal form implicitly refers to the existence of ternary relationships (Jones & Song, 1996). In simple terms, the fourth normal form states that two multivalued facts that are not independent should be treated as one fact. As an example, if the assignment of books to courses is not independent of the assignment of courses to instructors (i.e. instructors choose books for courses), in ER terms, there exists a ternary relationship among the entities instructor, book and course. The ternary relationship has been studied or discussed in a number of articles and books (e.g. Hammer & McLeod, 1981; Teorey et al., 1986; Scheer, 1989; Teorey, 1990; Bock & Ryan, 1993; Shoval & Shiran, 1997; McFadden et al., 1998). The modelling of relationships is a di$cult problem for novices (Batra & Antony, 1994). One strategy to reduce the incidence of errors is to provide the designer with a consulting tool, which can assist during conceptual database modeling. Such a tool must be robust in the presence of redundantly stated requirements. Managing redundancy is an innate problem of database design even in simple problems. As an example, consider a typical statement (adapted from the casebook by McFadden, Ho!er & Srinivasan, 1991) to be modeled: &&A sales order involves the sale of a new vehicle along with certain options by a salesperson to a customer.'' It is assumed that a salesperson as well as a customer is involved in many sales, a new vehicle is sold only once, and an option (e.g. automatic transmission) can be installed on many vehicles. Consider also two reports: the "rst report shows a list of sales orders and the corresponding customers, and the second report shows a list of salespersons and the corresponding customers to whom sale has been made. Assume that each sales order involves one customer, and each sales order involves one salesperson. In developing the data model (Figure 1), the "rst report does result in a relationship that is not derived and needs to be modeled, but the second report results in a derived relationship and is not modeled. (Note that a many side in a relationship is represented by a shaded triangle; a one side is represented by unshaded triangle). If the second relationship is indeed modeled, then according to the translation rules speci"ed in Teorey et al. (1986) and in Ram (1995), the resulting relations will not be minimal. The two reports have striking similarities, yet are treated di!erently in data modelling. Note that the relational translation of the ER diagram in Figure 1 is: Customer (CNum, Name, Phone) Salesman ( SMCode, Name, Title) Vehicle ( VIN, Make, Model, Year) SalesOrder ( OrderC, Date, CNum, SMCode, VIN)
28
D. BATRA AND S. R. ANTONY
FIGURE 1. ER model for vehicle problem.
Option ( OptC, Descr, Price) VehicleOptions ( VIN, OptC).
Even if redundancy is ignored, there is still a serious problem from a novice viewpoint with this seemingly simple application. How exactly does one model the statement &&A sales order involves the sale of a new vehicle along with certain options by a salesperson to a customer''? The customer purchases options, so is there a relationship between the two entities? Or, is there a (ternary) relationship between customer, vehicle and option? Or, is there a "ve-way relationship among order, salesperson, customer, vehicle and option? Which statements need to be modeled directly, and which statements are derived? Note that any of the statements corresponding to the relationships (e.g. a customer purchases vehicle along with options) are perfectly valid statements. From a requirement perspective, none of them is redundant. A closer look indicates that there could be a large number of relationships among a relatively small number of entities. This problem essentially requires the determination of possible combinations, given a speci"ed number of entities. With "ve entities, there are C (i.e. 10) binary relationships, C (i.e. 10) ternary relationships, C (i.e. 5) four-way relationships and C (i.e. 1) "ve-way relationships. Each relationship can have several connectivities. Although four- and "ve-way relationships are very rare in practical situations, the number of possible binary and ternary relationships by itself is quite large (120 to be exact) because of the combinatorially expanding characteristic of relationships with respect to the number of entities. Among this large set of relationships (e.g. 120), only 3 or 4 will be in the correct solution. It is apparent that novice designers can be a!ected by the combinatorial complexity of the problem. A protocol study on novice designers seems to con"rm this observation (Batra & Antony, 1994). It is reported that for a problem involving 5 entities, 31 subjects came up with 26 distinct solutions. They reported that a large number of errors were caused by &&literal translation'' of a portion of requirements into database structures. In other words, the subject included many derived relationships in their data model. In the
CONCEPTUAL DATABASE DESIGN
29
sales order and vehicles example, a typical case of literal translation could arise as follows: a subject visualizes a customer purchasing a vehicle with options and (erroneously) models a ternary relationship among customer, vehicle and option. Another major cause of errors identi"ed by Batra and Antony (1994) was due to the &&anchoring'' bias. For example, if in the previous example, a subject starts with the erroneous ternary relationship among customer, vehicle and option, additional information may not help correct the design. The designer may be so anchored to an initial solution that it becomes di$cult to extricate from the erroneous design. Thus, a designer might realize that since option naturally belongs to vehicle, it would be prudent to decompose the ternary to a binary; however, the designer would then probably end up (erroneously) relating vehicle with customer, instead of relating the two entities separately to a transaction entity like invoice. The subject would start with an initial solution, but be unable to revise the representation su$ciently to reach the correct solution. According to Batra and Antony (1994), cognitive biases accounted for the majority of the errors. These authors did not introduce redundancy in the statements, yet, a signi"cant number of relationships modeled by subjects were derived relationships. ¹hus, redundancy in requirements speci,cations is not the sole cause of derived relationship errors. However, redundancy can easily bias a novice by triggering literal translation and anchoring phenomena, eventually leading to derived relationships. The problem of redundancy is well recognized in the "eld of the theory of relational databases (Maier, 1988). Algorithms have been devised to come up with a minimal relational solution, given a set of functional dependencies. A functional dependency is generally represented as APB, and implies that for a given value of A, there is utmost one value for B. The problem with these approaches is coming up with the complete set of functional dependencies, which are de"ned among attributes, instead of among entities. Since even simple applications will have a large number of attributes, the task of de"ning a complete set of dependencies is tedious. Unlike the ER model, where the relationships are de"ned among abstract objects called entities, the relational model does not provide an abstraction mechanism, and, therefore, does not a!ord a practical solution.
3. Review of consulting systems for database design An extensive list of expert tools for database design is reported in the survey by Storey and Goldstein (1993). There is also some coverage of such tools in Reiner (1992). Based on the analysis of these systems, Storey and Goldstein (1993) concluded that &&although a great deal of work has been done in this area, the existing systems are generally at quite an early stage of development'' and &&they have been exposed to very little real-world, or even laboratory testing''. Since the Storey and Goldstein (1993) paper, there has not been much progress in enhancing such tools, or evaluating them in an empirical test. Among the software mentioned in the survey, the most promising expert tool for database design seems to be the View Creation System (Storey & Goldstein, 1988). The tool engages the designer in a dialog to determine entities, attributes and relationships. However, no empirical study has shown the viability of the tool. In fact, none of the tools mentioned in the Storey and Goldstein (1993) survey has been tested for e$cacy in a laboratory or "eld environment. Other consulting systems are EDDS (Choobineh,
30
D. BATRA AND S. R. ANTONY
Konsynski, Mannino & Nunamaker 1988), CARS (Demo & Tilli, 1986), Modeller (Tauzovich, 1989), Intelligent Interview Systems (Kawaguchi, Taoka, Mizoguchi, Yamaguchi & Kakusho 1986), and SECSI (Bouzeghoub, Gardarin, Metais 1985). There are "ve main shortcomings of these tools: (1) there is no guarantee that the conceptual data models will translate to normalized relations, (2) there is no mechanism to prevent derived relationships from being modeled, (3) the tools address binary relationships only (and could violate the fourth normal form), (4) the tools are not based on "ndings from empirical study of novice designers and (5) there is no empirical validation. The development of CODA and the consequent empirical study are intended to alleviate these shortcomings. There are systems that do not have the "rst three shortcomings but that have not been empirically evaluated. Ram and Curran (1989) proposed Synthesizer, which accepts the de"ned functional dependencies, determines a minimal cover, and determines relations in the fourth normal form. Ram (1995) has proposed FDEXPERT, which can infer functional dependencies from ER diagrams. Thus, FDEXPERT can act as a front-end to Synthesizer. This is a promising line of research. If reasonably accurate ER diagrams can be developed, these tools should lead to rigorous solutions. We developed the tool CODA based on many sources including "ndings from a study that has examined novice error behavior (Batra & Antony, 1994). After studying these errors, Batra and Zanakis (1994) devised rules and heuristics to prevent novice errors in data modeling. Our tool incorporates these rules and heuristics. The notations, basic concepts like entity, degree and connectivity of a relationship and translation rules (from ER to relational) are based on the paper by Teorey et al. (1986). The translation rules are also captured in a paper by Song, Jones and Park (1993). The translation rules set the stage for a &&correct'' ER diagram since one would expect the translation to be normalized and minimal. To allow self-monitoring, the modeling of entities, and the connectivity of a relationship is facilitated by providing feedback in natural language (Batra & Sein, 1994). Basic concepts like primary key and attributes are described in popular textbooks. A detailed description of the tool is shown in Appendix D.
4. Research framework It is well accepted that performance in problem-solving is a!ected by the limitations of the cognitive resources available that can be devoted to the task (Norman & Bobrow 1975). By providing external assistance, users can perform better and reduce errors (Anderson, Bole & Reiser, 1985, Carroll & Kellogg, 1989). The type of external assistance needed can be determined by using a research model of human errors and by determining the nature of human errors in a given context. This paper employs the Generic error-modeling system (GEMS) model (Reason, 1990), which is an extension of human error models proposed by Rasmussen (1986) and Rouse (1981). The model divides human cognitive activity into three performance levels: skill-based, rule-based and knowledge-based; and, correspondingly, human errors into three types: slips (and lapses), Rule-based (RB) mistakes and Knowledge-based (KB) mistakes. The connection between the three levels is shown in Figure 2 (adapted from Reason, 1990). Slips correspond to routine actions, while rule- and knowledge-based errors correspond to problem-solving activities. Both slips and rule-based levels are governed
CONCEPTUAL DATABASE DESIGN
31
FIGURE 2. The Generic error modeling system (GEMS) framework.
by automatic processors in the form of schemata and stored rules while the knowledgebased level requires limited, conscious processes. Slips and rule-based errors are largely predictable; knowledge-based errors are variable. The detection of an error in case of slips is usually rapid and e!ective, but in the case of rule- and knowledge-based errors is di$cult, and often only achieved through external intervention. The problem-solving elements of GEMS are based (Reason, 1990) upon a recurrent theme in the psychological literature, namely that &&humans, if given a choice, would prefer to act as context-speci"c pattern recognizers rather than attempting to calculate or optimize'' (Rouse, 1981). The key feature of GEMS is that, when confronted with a problem, human beings are strongly biased to search for and "nd a prepackaged solution at the RB level before resorting to the far more e!ortful KB level, even when the latter is demanded at the outset. In relation to Figure 2, this means that they are inclined to exit from the decision box (Is the Pattern Familiar?) along the a$rmative route.
32
D. BATRA AND S. R. ANTONY
Only when humans become aware that successive cycling around this rule-based route is failing to o!er a satisfactory solution will the move down to KB level take place. The factors determining the transition are less clear-cut (Reason, 1990) although these are likely to be dependent on complex interaction between subject uncertainty and concern. In the database-modeling context, this presents a signi"cant problem since the goal itself may not be clear-cut*after all, a novice does not know if the solution is correct. A conceptual database design by itself does not provide any feedback or cues to indicate to the designer any cause for concern. This is unlike, say, a query language situation where the result of running a query provides cues to the correctness of the query. A designer engaged in data modeling may continue to work at the rule-based level when a switch to the knowledge-based level is advantageous. Thus, an external consulting source is required to inform the designer if the situation is a cause for concern and if there is a need for more analysis. Whenever the consulting source can detect the possibility of an error, it should guide the designer. The external consulting source would be more bene"cial if it could reduce the complexity at the outset so that rule-based behavior can be used e!ectively. This is to reduce information overload. The novice "nds it di$cult to detect countersigns if there is abundance of information (Billman, 1983). Further, information overload is likely to lead to selectivity (Evans, 1983), that is, giving attention to the wrong features, or not giving attention to the right features. The consulting system CODA is intended to provide external help to the novice designer by reducing the information overload, and by providing guidance, whenever possible, if it detects that the designer is moving in the wrong direction. Given the GEMS model, one can now explain the e!ectiveness of CODA by applying the following line of reasoning (1) Conceptual and logical data modeling is a complex task for novices. (2) ER and relational models do not provide a step-by-step approach to manage the complexity. (3) When novices engaged in data modeling encounter complexity, they resort to simplistic rules and heuristics that minimize cognitive strain but lead to erroneous solutions. (4) A consulting tool should provide guidance to help the novice reduce the complexity by dividing the problem into manageable steps, and by preventing erroneous choices through feedback whenever possible. These arguments are detailed. (1) Perceived complexity of the task: empirical studies have shown that novice designers commit a number of errors in the conceptual and the logical modeling phase (Juhn & Naumann, 1985; Jarvenpaa & Machesky, 1989; Bock & Ryan, 1993). As discussed earlier, the combinatorial complexity arises because given a limited number of entities, a large number of relationships are possible (Dullea & Song, 1997). The modeling of even simple statements like &&an order involves a salesperson who sells to a customer'' can cause considerable di$culty for novices, who may not be able to discern if the relationship is ternary, or there are two (or three) binary relationships (Batra & Antony, 1994). (2) ER and relational models do not provide a step-by-step approach to manage complexity: although the ER model presents de"nitions for entity, attribute and degree and
CONCEPTUAL DATABASE DESIGN
33
connectivity of a relationship, it does not provide a step-by-step approach to help a novice with conceptual data modeling. The use of basic de"nitions and operations often leads to simplistic approaches. The relational model provides the normalization rules. However, normalization depends on stipulating dependencies, which itself is an e!ortful exercise. There is no evidence that novice designers de"ne these dependencies before attempting normalization. There is evidence (Batra et al., 1990; Juhn & Naumann, 1985; Jarvenpaa & Machesky, 1989; Bock & Ryan, 1993), however, that novice designers commit signi"cantly more errors using the relational rather than using the ER model thus suggesting that an abstraction-based approach (like ER) is more likely to lead to promising results. (3) Novices resort to simplistic rules and heuristics that minimize cognitive strain but lead to erroneous solutions: the variety of errors observed in Batra and Antony (1994) seems to indicate that the novices were attempting to minimize cognitive strain. This error behavior can be explained by the GEMS model shown in Figure 2. In modeling relationships, the subjects should have been working at the KB level but it seems they were primarily working at the RB level. However, they seemed to be using &&strong but wrong'' strategies, e.g. literal translation. Novices are generally aware of simple relationship rules only. People have a marked tendency to think in linear sequences instead of in causal nets (Rasmussen, 1986). If a subject observes that two entities are associated, there is a pattern match for modeling a relationship (RB level). Instead of focusing on whether the correct entities are related (KB level), the relationship is assumed and the focus shifts on less important but more manageable issues like the connectivity (e.g., whether the relationship is one}many or many}many). There is no motivation to analyse the relationship in the complete context and determine if it is derived. As long as all entities are somehow linked, the overall goal seems to have been met. However, as has been argued earlier, the mere fact that two entities are associated does not imply that a relationship should be modeled. Thus, there is a need for a set of rules and heuristics that prevent such errors. Further, whenever the system can determine possibility of an error, the novice should be warned so that the novice can switch to the KB level and attempt to resolve the situation. (4) Supplanting declarative knowledge by practical rules and heuristics: Senders and Moray (1991) recommend improving design to minimize the impact of errors as well as providing feedback so that it is possible for the operator to recognize that an error has occurred. Batra and Zanakis (1994) came up with rules and heuristics that could be applied in developing ER diagrams that would lead to normalized and minimal solutions. These rules and heuristics reduce the search space for the designer, and attempt to detect errors and provide feedback. They justi"ed the rules based on theoretical constructs (e.g. Armstrong's axioms) presented in Maier (1988), and employed the notational scheme, de"nitions of entity and relationship, the dependencies implied by the degree and connectivity con"guration of a given relationship, and the translation rules from ER to relational presented by Teorey et al. (1994). Note that similar translation rules are also available in the papers by Song et al. (1993) and Ram (1995). The approach indicates to the designer which kind of relationship to model "rst, and if a certain relationship has been modeled, what kinds of relationships are precluded, thereby reducing the search space. The rules and heuristics garnered from various sources provided a practical
34
D. BATRA AND S. R. ANTONY
approach to tackle such a problem. The rules included in CODA are presented in Appendix D. One may question the need for the tool since rules and heuristics can alternatively be taught to the designer. Although the rules and heuristics can be taught to the designer, he/she may not be able to recall them at will while solving the design problem. So, the system steps in with reminders and help messages. The designer is relieved of the burden of remembering or looking up a manual for help. This is especially the case when the designer has made an error and the system can detect it. A warning message can prevent the short-circuit from &&Is the pattern familiar'' to &&Apply stored rule'' in Figure 2. Thus, the tool can provide constant testing of the evolving design which has been shown to lead to more accurate data representations (Srinivasan & Te'eni, 1995). The four-step argument detailed so far is the basis for the main hypothesis in the study. It is natural to believe that if a system is redesigned to eliminate error-reducing features, then the number and frequency of errors will decrease. That indeed should be the case, but only with respect to the old errors (Senders & Moray, 1991). A redesigned system introduces the possibility that new and di!erent errors may occur. Hence, the new and existing systems must be variables in an experimental comparison for e$cacy. The most important independent variable in the study is the type of system; a second independent variable is task. The dependent variable is designer performance measured by modeling correctness. Subjects' prior knowledge was measured as a covariate; other designer characteristics were randomized. Two versions of system were considered*the consulting system CODA (treatment) and control. Two versions of task were considered*one in which a ternary relationship was included and redundant information was provided by binary statements (Type A or Task A), and the other in which only binary relationships were included and redundant information was provided by ternary statements (Type B or Task B). Appendix A describes the tasks and Appendix B shows the respective solutions. The main hypothesis tested the e$cacy of the consulting system CODA. The "rst hypothesis expressed as an alternate hypothesis was: H1 : Modelling correctness will be higher for the consulting system CODA group as compared to the control group. The second main e!ect was due to the Task variable. This was not the main question under scrutiny but still one of interest since the problems involved redundant information. Previous studies (e.g. Batra & Antony, 1994) have indicated two types of serious errors: the modeling of a ternary (or higher degree) relationship as binary relationships and the modeling of binary relationships as a ternary (or higher degree). A consulting tool in this domain should be robust to both types of errors. The issue was: will the presence of redundant binary statements, when the actual relationship is ternary (Task type A), result in better designer performance as compared to the presence of redundant ternary statements when the actual relationships are binary (Task type B). Task type A was exempli"ed by the SSI problem, and Task type B by the SODA problem (see Appendix A). It was expected that it is more di$cult to provide consulting support for tasks of Type B. This can be explained by the frequent prevalence of literal translation errors in novice designers&& solutions (Batra & Antony, 1994). One of the problems in database
CONCEPTUAL DATABASE DESIGN
35
design is the mismatch between the number of entities in a statement and the degree of the resulting relationship. This can become a problem in distinguishing binary and ternary relationships. A ternary relationship requires a ternary statement, but a ternary statement does not necessarily imply a ternary relationship. The presence of a ternary statement can lead to pattern-match (see Figure 2) at the rule-based level and the consequent modeling of a ternary relationship and prevent switching to knowledge-based level. In terms of relational terms, the issue pertains to the abuse of the fourth normal form, which allows a single relation between two facts only if they are multivalued, and if the facts are not independent. These are the same considerations for use of a ternary relationship. In Task A, there is a ternary statement, &&A programmer may work on many platforms and on many projects, but for a speci"c project, a programmer works on only one platform'' and it does imply a ternary relationship. In Task B, there is a ternary statement, &&To attend a seminar, a member must submit at least one paper,'' but the ternary relationship is precluded by the presence of one*many relationship presented earlier, &&A paper can be presented in one seminar only''. However, the literal translation tendency of novice designers can bias the subjects to commit more errors in the case of Task B. This led to the following hypothesis. H2 : Modelling correctness using the tool (treatment or control version) will be higher for ¹ask type A as compared to the ¹ask ¹ype B.
5. Method A laboratory experiment was conducted using a treatment and a control version of CODA. A pilot study had been conducted and the tasks and training scripts were consequently re"ned. The pro"le of subjects, a description of pre-test, the schedule of experiment sessions, the procedure followed during the experiment, the di!erences in training between the treatment and control group, the di!erences in the two tasks, and the scheme for grading are presented in this section. 5.1. SUBJECTS
Seventy-two students (excluding those who dropped out during the experiment) enrolled in a required database course for undergraduate MIS major completed the study. These students came from two sections of about 40 students each. The subjects were clearly novices since they had typically taken one sophomore level computer skill course and one junior level introductory information systems course. According to Lord and Maher (1991), novices have no more than a few hundred hours experience in an area, and prenovices have only a few hours experience. One class for one section was held on Monday evenings and another for the second section on Tuesday afternoons. A student could earn up to 10 extra points in the midterm exam, which constituted 30% of the overall grade. In other words, they could increase their overall grade by up to 3%. Although they could earn these points by doing an alternative project, most students consented to participate. They were told that they would be trained to use a software tool during the experiment. A few subjects were lost because of one of the following reasons: not showing up (2 students), quitting before the
36
D. BATRA AND S. R. ANTONY
completion of the experiment (1 student), or being asked to quit (4 students) if they showed considerable slow progress and had not completed even one task by the estimated time which was 3 h after the starting time of the experiment. A typical session was complete in about 3.5 h (which included "lling questionnaires, training, practice time and tasks). 5.2. PRE-TEST
The experiment was held immediately after the mid-term exam. Five class meetings had been held before the mid-term. A typical two and one-half hour class during the period before the mid-term had about a third of the meeting time devoted to conceptual and logical modeling; the rest of the time was devoted to non-procedural database languages and corresponding hands-on experience in the lab. Students had been lectured on developing ER diagrams and on relational concepts like normalization. The mid-term exam had two questions on conceptual modeling*the "rst question had three binary relationships, and the second involved a ternary relationship. The total score in these two questions was considered as the pre-test score. The performance in the pre-test was quite comparable between the Monday and Tuesday group and between treatment and control groups. 5.3. PROCEDURE
Four sessions were held in a computer lab*two each for the Monday and the Tuesday class. Two of the four sessions were held on two consecutive Saturdays and the other two on regular class days sandwiched between the two Saturdays. During the experiment, the graded mid-term exams were returned to the students along with the correct solutions to the questions. Subjects were asked to "ll in a &&Background Questionnaire''. They were then trained in the software assigned for the session. A training script was provided. A routine customer-order-product (two binary relationships) example was taken up to illustrate how the software worked. Next, a ternary relationship example, which was the second problem from the pre-test (i.e. from the mid-term exam) was represented using the software. Once the training was over, each subject was provided with a disk containing the software. The subject was instructed to model any two entities and a binary relationship between them from the "rst problem in the mid-term exam. This ensured that each subject knew how to use the software. The subject was then handed one of the two problems*SSI (for Task A) or SODA (for Task B). The assignment was not random, but based on minimizing chances of copying given the layout of students in the lab. The experimenter proctored the subjects. About half the subjects started with the SSI problem and the remaining subjects with the SODA problem. When the subject "nished an exercise, s/he was given the second exercise. Subjects returned the materials handed out before they left. 5.4. TRAINING DIFFERENCES
The two versions of the software*treatment and control*were virtually identical in look and feel. The treatment version included the rules and heuristics described in
CONCEPTUAL DATABASE DESIGN
37
Appendix D. Four screenshots from the treatment version are also illustrated in the Appendix. A limited period of extra training to treatment group (about 20}25 min) was provided to demonstrate the use of the rules and heuristics. Two short examples seemed adequate. It was required to show that a binary one}many relationship needed to precede a ternary relationship (of any connectivity), and that a ternary relationship needed to precede a binary many}many relationship. Each demonstration took about 10}15 min. The subjects were provided just enough training that they would be aware that a sequence of relationships existed, and that the sequencing has been implemented in the system. The subjects were requested to make use of the advice given by the system. The control group had to be provided an alternative means for distinguishing between the binary and the ternary relationships. A conventional approach was followed for this group. A number of examples illustrated during the semester had made this distinction. This had been revisited when discussing the fourth normal form. Further, a number of students had committed errors like showing an extra binary relationship in addition to the ternary relationship (as in the second problem in the exam). This was discussed in the training session when modeling the problem using the software. The control system did not formulate or provide help in formulating questions for the subject. 5.5. TASKS
Two tasks A and B*named SSI and SODA in this study*were selected for the experiment (see Appendix A). The SSI task (Type A) involved a binary one}many, a binary many}many, and a ternary relationship. The other problem*SODA (Type B) *had three binary relationships. Only one out of six relationships in the two tasks was ternary since such relationships are relatively infrequent. 5.6. GRADING OF THE ER DIAGRAMS
Two graders evaluated each subject's ER diagrams independently. Each entity in the ER diagram can be assigned one of six grade codes and each relationship can be assigned one of three grade codes as shown in the grading scheme (see Appendix C). Hence, it was possible to compute Cohen's (1960) kappa as a measure of inter-rater reliability. The overall kappa (combining entities and relationships) for the SODA problem grading is 0.89 and the measure of inter-rater agreement for relationships alone is 0.95. Both statistics are found to be signi"cant at a"0.01. The measures for SSI task are also found to be 0.89 (overall) and 0.93 (relationships only) and they are both signi"cant at a"0.01. For the analysis, the average of the two scores is used as the value for the dependent variable.
6. Results The main purpose of the study was to identify the e!ects of the rules and heuristics on the task performance. The independent variables were the System types (Consulting System CODA or Control), and the tasks (Type A or B). Subjects' performance in the mid-term exam was the measure of their pre-test data modeling knowledge. The dependent variable for measuring the designer performance was the model correctness score.
38
D. BATRA AND S. R. ANTONY
6.1. PRE-TEST DATA MODELING KNOWLEDGE LEVEL (KL)
Since some of the variation in the dependent variable can be explained by subjects' data modeling knowledge, it is desirable to remove such a variation. The measure of subjects' prior knowledge is based on their performance in a data-modeling task in the mid-term exam. Although, their score in the mid-term question can be considered a continuous variable, the number of distinct score values is small. It was decided to use this score as the basis for strati"cation of the data set instead of using it as a covariate. Subjects that scored above the average were classi"ed as high KL and subjects that scored below the average are classi"ed as low KL. Strati"cation strategy was considered preferable to covariance because our principal interest was in reducing the error variance rather than removing bias from the estimates of treatment e!ects (Kirk, 1995), and because Feldt (1958) found that the randomized block design is superior when the correlation is less than 0.4 and the analysis of covariance is better when the correlation is greater than 0.6. The correlation in our study is 0.25. The summary statistics for the dependent variable grouped by high and low KL are shown in Table 1. The mean represents the average of scores which could range from 0 to 100. The label N represents the count of subjects for a speci"c combination of levels of factors. &&High KL'' represents high knowledge level subjects and &&low KL'' represents low knowledge level subjects. For example in the Treatment group, 19 subjects were classi"ed as high KL and 17 as low KL with mean scores of 90.4 and 78.3 respectively for the SSI task The treatment group performed substantively better than the control group in both tasks. It can also be noted that the Task B (SODA) scores (56.6 and 47.2) were lower than
TABLE 1 Data model correctness score 2 summary statistics Treatment Task (A) SSI
High KL Low KL Mean
Both tasks
All
High KL
Low KL
All
90.4
78.3
84.7
83.4
56.4
69.1
19
17
36
17
19
36
S. D.
15.3
21.3
19.1
27.8
35.7
34.6
Mean
52.3
61.3
56.6
52.6
42.5
47.2
n
19
17
36
17
19
36
S. D.
11
27.8
20.9
18.1
11.9
15.8
Mean
71.4
69.8
70.6
68
49.5
58.2
19
17
36
17
19
36
23.3
25.9
24.4
27.9
27.2
28.9
n
(B) SODA
Control
n S. D.
39
CONCEPTUAL DATABASE DESIGN
TABLE 2 E+ects of system, task and knowledge level on performance Analysis of variance Source of variation
SS
DF
MSS
F
p value
Knowledge level
4119
1
4119
8.31
0.0052
System
5064
1
5064
10.22
0.0021
Knowledge level * System
2585
1
2585
5.22
0.0254
Task
22 464
1
22 464
45.33
0.0001
Subject (System)
37 843
68
556
1.12
0.3154
352
1
352
0.71
0.4023
System * Task
Task A (SSI) scores (84.7 and 69.1) for both the groups. The high KL subjects performed consistently better than the low KL subjects except in the case of high KL subjects in the treatment group solving the SODA task. In the latter case, the low KL subjects actually did better than the high KL subjects (61.3 vs. 52.3). When we compare the performance between groups, the treatment group subjects consistently outperformed the control group subjects, except in the case of high KL subjects solving the SODA problem. In the latter case, the treatment and control groups performed about equally (52.3 and 52.6). It may be noted that almost all the subjects modeled the entities correctly. The overall score for entities was 24.31 on 25 in the "rst and 24.46 on 25 in the second task, that is, almost perfect scores. Thus, the di!erences arose because of the lower scores by the control group in modeling relationships, which was the main aspect of data modeling under scrutiny. Given that the sample size was reasonably large, and the standard deviations did not vary appreciably, the F test was assumed robust with respect to the possible variations of normality or homoscedasticity. Since each subject solved two tasks, using one of the two software systems, the analysis of variance was done using a mixed-factor design. In Table 2, system is a between-groups factor and the factor task represents a repeated-measures factor. KL represents the knowledge-level factor and the subject (system) refers to nesting of subjects within system. The details of the ANOVA are shown in Table 2. The F-values, when signi"cant, are usually quite large, and render any e!ect of non-normality or heteroscedasticity on p-values an academic exercise. The e!ect of system on performance was signi"cant (p(0.05). This supports the basic hypothesis of the study (H1) that the use of the knowledge-based system CODA reduces the incidence of designer error. Further, as is evident from both tables, the performance for either group was signi"cantly higher for the Task A as compared to the Task B (H2). There was no interaction between task and system. Although the knowledge-based version led to a somewhat better improvement for the Task A (about 16 points) than for the Task
40
D. BATRA AND S. R. ANTONY
FIGURE 3. Interaction of system and knowledge levels }䊉} , Low; }T} , High.
B (about 10 points), the analysis indicated (p"0.40) that task did not interact with the system variables. However, an interesting "nding of this analysis is the presence of an interaction of system with knowledge level (p"0.025). A plot of the mean scores is shown in Figure 3. The plot indicates that the low knowledge-level subjects bene"t more by the system than the high knowledge-level subjects.
6.2. ERROR ANALYSIS
We were interested in "nding if the knowledge-based system was e!ective in preventing errors in modelling relationships. To accomplish this, each relationship in the solutions of the subjects was classi"ed into (1) correct (i.e. it is part of the correct solution) or (2) incorrect. Each incorrect relationship was further classi"ed into (a) connectivity error, (b) high degree as low degree or (c) low degree as high degree or (d) derivable relationship. The results are shown in Table 3. The entries represent percentage of cases with respect to total number of relationships for each task and control/treatment combination. It may be observed that the total number of relationships modelled by the control group is more than the treatment for both tasks. In the Task B (SODA problem) case, this is caused by a larger number of extra relationships, while in the Task A (SSI problem) case, it is caused by the decomposition of the ternary into the binary relationships. In either case, the cause seems to be the same: the control group seemed to be biased by the redundancy in the task description. A comparison of subject performance in tasks suggests that the consulting tool seems to be more e!ective for Task A type of problems. There is a substantive di!erence between the treatment and the control groups in the number of connectivity errors and degree errors for such problems.
41
CONCEPTUAL DATABASE DESIGN
TABLE 3 ¹ypes of relationship errors Relationship error type
Task A"SSI
Task B"SODA
Control
Treatment
No error
65%
87%
30%
31%
Connectivity error
12%
5%
30%
34%
Low degree as high degree error
2%
2%
30%
33%
High degree as low degree error
22%
4%
Extra Derived/derivable relationship error
0%
2%
9%
2%
Total no. of relationships
129
112
Control Treatment
96
86
7. Discussion of results and implications of the study The results clearly show that the consulting tool CODA does lead to improvement in the performance of novice designers engaged in database design. This is the "rst knowledgebased database design tool that has been empirically demonstrated to improve nonexpert designer performance in the presence of redundantly stated application requirements. The fact that the results were conducted in a controlled laboratory environment lends to high internal validity. Further, the presence of redundant statements in the problem provides a fair degree of external validity since it is consistent with everyday experience. Presenting only the most relevant statements tends to &&give the solution away'' and deals with the database design problem only at a super"cial level. The most interesting result was the interaction between the variables pre-test knowledge and experiment score as depicted in Figure 3. The tool bene"ted the low knowledge (low KL) subjects considerably more than the high knowledge (high KL) subjects. In fact, both the high and the low KL treatment groups seem to converge to similar performance. There was no such convergence for the control groups. Thus, CODA seems to be the most useful to those who need it the most. This interaction e!ect does not mitigate the signi"cance of the di!erence between the treatment and control group scores. The improvement in score was more pronounced in Task A (SSI problem) as compared to Task B (SODA problem). An analysis of the Task B solutions revealed that many subjects from both the treatment and control groups had modeled a ternary relationship among MEMBER, ARTICLE, and SEMINAR based on the statement &&A member may submit several articles and attend many seminars'' although according to the heuristics, the binary one}many relationship between ARTICLE and SEMINAR precluded a ternary relationship. There were 29 such errors (30%) in the control group and 28 (33%) in the treatment group (see Table 3). The treatment subjects, despite being
42
D. BATRA AND S. R. ANTONY
furnished with a sequence for tackling the relationships, were biased by the presence of this statement and jumped to the ternary solution. This appears to be another instance of cursory &&pattern-matching'' with a corresponding movement from the decision box (Is the Pattern Familiar?) along the a$rmative route (see the GEMS model depicted in Figure 2). The following statement, as was hypothesized, can explain the relatively inferior performance in Task B: a ternary relationship requires a ternary statement, but a ternary statement does not necessarily imply a ternary relationship. In Task A, the ternary statement &&A programmer may work on many platforms and on many projects, but for a speci"c project, a programmer works on only one platform'' implied a ternary relationship. In Task B, as explained before, the ternary relationship is precluded by the presence of one}many relationship, but the presence of ternary statement did bias a large number of subjects. The consulting tool was designed so that if a one}many relationship is speci"ed, the system will warn the designer if a ternary relationship involving the same entities is then speci"ed. Subjects had been instructed to model binary one}many relationships before attempting a ternary relationship. Yet, the overriding tendency to map a ternary statement to a ternary relationship made the subjects anchor to an incorrect solution, preempting help from the consulting system. The practical implication of the study is that the consulting tool CODA can improve the data modeling performance of novice designers. The bene"ciaries would also include end-user programmers who are certainly novices, and use the largest number of software and perform the largest number of tasks (Schi!man, Meile & Igbaria, 1992). It can even be used as a validation tool by more experienced designers. The demand for database designers is on a steep rise (see e.g. February 1, 1999 Newsweek, &&Your next job''), and it is evident that in the absence of any apparent sign of concomitant supply, there is a good likelihood that the market will encounter a large number of novice designers. The study by Batra and Antony (1994) indicated that although novice designers should work at knowledge-based level, the need to minimize cognitive strain leads to rule-based behavior based on super"cial resemblance of patterns, and &&strong but wrong'' rules, thus resulting in a variety of errors. The consulting system CODA attempts to minimize the cognitive strain by reducing the search space, and encouraging the shift to knowledge-based level by pointing out errors, whenever the system can detect such an error. This study was not designed to analyse such dynamics, but the results seem encouraging to consider detailed scrutiny in the area. A consulting tool can only be built after an understanding of its domain. Since novices commit errors, we suggest that the quest for a consulting tool for novice database designer requires at least some understanding of human error behavior.
8. Conclusions The study reports an experiment demonstrating that a consulting system for database design (called CODA) does improve novice designer performance. This study provides a systematic way to evaluate knowledge-based systems for database design. However, this is only one study on one consulting tool. A number of such tools have been reported in the literature. It is essential that these tools be empirically evaluated so that the strong features of these tools can eventually be integrated and a highly e!ective consulting tool can emerge. Since the purpose of these tools is to be eventually adopted by database
CONCEPTUAL DATABASE DESIGN
43
designers, it is pointless to propose the next such tool without a research study that illustrates the strengths and weaknesses of the tool. The use of a tool like CODA can provide bene"ts in several areas of systems development and use. Training is certainly one of the areas that this tool can "nd practical application. One would also expect a signi"cant decrease in the costs, albeit di$cult to quantify, that result from preventing incorrect design. Incorrect information is a natural consequence of incorrect design, and it is evident that the costs of incorrect information are invariably high. CODA is not an automated device for producing correct database logical designs; however, more research about the area of discovery phase of database design could eventually lead to an enhanced tool that engages a designer or user in a dialogue leading to more accurate solutions.
References ANDERSON, J. R., BOLE, C. F. & REISER (1985). Intelligent Tutoring System, Science, 228, 456}462. ANTONY, S., BATRA, D. & SANTHANAM, R. (1999). Empirical validation of knowledge-based systems for conceptual database design. In Proceedings of the American Conference on Information Systems. Milwaukee, WI. BATRA, D. & ANTONY, S. (1994). Novice errors in database design. European Journal of Information Systems, 3, 57}69. BATRA, D., HOFFER, J. A. & BOSTROM, R. P. (1990). Comparing representations with the relational and extended entity relationship models. Communications of the ACM, 33, 126}139. BATRA, D. & SEIN, M. (1994). Improving conceptual database design through feedback. International Journal of Man Machine Studies, 40, 653}676. BATRA, D. & ZANAKIS, S. H. (1994). A conceptual database design methodology based on Rules and Heuristics. European Journal of Information Systems, 3, 228}239. BILLMAN, D. (1983). Inductive learning of syntactic categories. Doctoral dissertation, University of Michigan, MI, USA. BOCK, D. B. & RYAN, T. (1993). Accuracy in modelling with extended entity relationship and object oriented data models. Journal of Database Management, 4, 30}39. BOUZEGHOUB, M., GARDARIN, G. & METAIS, E. (1985). Database design tools: an expert system approach. In A. PIROTTE & Y. VASSILIOU, Eds. Proceedings of the 11th International Conference on Very Large Databases, pp. 82}95. San Mateo, CA: Morgan Kaufmann. CARROLL, J. M. & KELLOGG, W. A. (1989). Artifact as theory-nexus: Hermeneutics meets Theorybased Design. CHI Proceedings, Austin, TX CHEN, P. P. (1976). The entity-relationship model*toward a uni"ed view of data. ACM ¹ransactions on Database Systems, 1, 9}36. CHOOBINEH, J., KONSYNSKI, B. R., MANNINO, M. V. & NUNAMAKER, J. F. (1988). An Expert System based on forms. IEEE ¹ransactions on Software Engineering, 14, 242}253. CODD, E. F. (1970). A relational model of data for large shared banks. Communications of the ACM, 13, 377}387. COHEN, J. (1960). A coe$cient of agreement for nominal scales. Educational and Psychological Measurements, 1, 37}46. DAVIS, A. M. (1993). Software Requirements: Objects, Functions, and States. Englewoods Cli!s, NJ: Prentice-Hall. DEMO, B. & TILLI, M. (1986). Expert system functionalities for database design tools. In D. SRIRAM and R. ADEY, Eds. Applications of Arti,cial Intelligence in Engineering Problems: Proceedings of the 1st International Conference, pp. 1073}1082. Berlin: Springer-Verlag. DULLEA, J. & SONG, I.-Y. (1997). An analysis of cardinality constraints in redundant relationships. Proceedings of the International Conference on Information and Knowledge Management, ¸as