Exploring Conceptual Schema Evolution

3 downloads 0 Views 1MB Size Report
May 5, 2002 - Unlike [Aiken, Girling'98], we think that a CS that has reached the replacement ...... Selections, abridged and introduced by Richard E.Leakey.
Exploring Conceptual Schema Evolution Proefschrift ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof.dr.ir. J.T. Fokkema voorzitter van het College voor Promoties, in het openbaar te verdedigen op dinsdag 24 september 2002 om 13.30 uur door Lex WEDEMEIJER doctorandus in de Wiskunde en Natuurwetenschappen geboren te Velsen

Dit proefschrift is goedgekeurd door de promotor: Prof.Dr.-Ing. habil W. Gerhardt-Hackl

Samenstelling promotiecommissie: Rector Magnificus, voorzitter Prof.Dr.-Ing. habil W. Gerhardt-Hackl, Technische Universiteit Delft, promotor Prof.dr. H. Koppelaar, Technische Universiteit Delft Prof.dr.ir. J.L.G. Dietz, Technische Universiteit Delft Prof.dr.ir. T.P. van der Weide, Katholieke Universiteit Nijmegen Prof.dr. H.A. Proper, Katholieke Universiteit Nijmegen Prof.dr. R.A. Stegwee, Technische Universiteit Twente Dr. E.O. de Brock, Rijks Universiteit Groningen

ISBN 90-5681-142-8

i

CONTENTS PREFACE................................................................................................................................. iii CHAPTER 1. INTRODUCTION 1.1 Design the flexibility, maintain the stability ..................................................................1 1.2 Research of long-term CS evolution ..............................................................................5 1.3 Longitudinal case studies..............................................................................................11 1.4 Outline ..........................................................................................................................17

THEORETICAL TRACK CHAPTER 2. STATE OF THE ART 2.1 Introduction ..................................................................................................................21 2.2 The static CS.................................................................................................................22 2.3 Data model theory ........................................................................................................26 2.4 The dynamic CS ...........................................................................................................28 2.5 Information system dynamics.......................................................................................31 2.6 Longitudinal research of evolution...............................................................................32 2.7 Conclusions ..................................................................................................................33 CHAPTER 3. FRAMEWORK FOR FLEXIBILITY 3.1 Introduction ..................................................................................................................35 3.2 Dimensions and guidelines of the framework ..............................................................37 3.3 Survey of active strategies............................................................................................41 3.4 Survey of passive strategies..........................................................................................47 3.5 Survey of abstraction strategies ....................................................................................53 3.6 Analysis results.............................................................................................................57 3.7 Conclusions ..................................................................................................................61 CHAPTER 4. DEFINING METRICS FOR EVOLUTION 4.1 Introduction ..................................................................................................................63 4.2 Metrics for Conceptual Schema evolution ...................................................................65 4.3 Operationalizing the metrics.........................................................................................76 4.4 Reliability of the metrics ..............................................................................................79 4.5 Conclusions ..................................................................................................................81

PRACTICAL TRACK CHAPTER 5. FOUR CASE STUDIES 5.1 Case study: Benefit Administration..............................................................................85 5.2 Case study: Settlement of Pension Benefits ...............................................................102 5.3 Case study: Sites Registration ....................................................................................110 5.4 Case study: Franchiser Compensation........................................................................124

ii

Exploring Conceptual Schema Evolution

CHAPTER 6. SHORTTERM PATTERNS & PRACTICES 6.1 Introduction ................................................................................................................139 6.2 The change process.....................................................................................................140 6.3 Business practices in change coordination .................................................................145 6.4 Semantic Change Patterns ..........................................................................................148 6.5 Conclusions ................................................................................................................154 CHAPTER 7. LONG-TERM TRENDS IN EVOLUTION 7.1 Introduction ................................................................................................................155 7.2 Measurements of CS evolution...................................................................................156 7.3 The aging CS ..............................................................................................................173 7.4 Derived data decrease stability of the CS...................................................................177 7.5 Best-Practices for CS evolution..................................................................................185 7.6 Conclusions ................................................................................................................190

SYNTHESIS CHAPTER 8. SYNTHESIS 8.1 Introduction ................................................................................................................193 8.2 Validity of the exploration results ..............................................................................194 8.3 Reviewing the framework for flexibility....................................................................197 8.4 Design strategies revisited ..........................................................................................203 8.5 Laws of Conceptual Schema evolution ......................................................................209 8.6 Conclusions ................................................................................................................211 CHAPTER 9. CONCLUSIONS 9.1 Exploring Conceptual Schema evolution ...................................................................215 9.2 Research objectives ....................................................................................................216 9.3 A Learning Cycle approach........................................................................................219 APPENDICES A : Data Model Theory.......................................................................................................225 B : Catalogue of Semantic Change Patterns .......................................................................228 REFERENCES .......................................................................................................................253 PUBLICATIONS BY THE AUTHOR ..................................................................................272 NEDERLANDSE SAMENVATTING ..................................................................................273 SUMMARY............................................................................................................................279 INDEX....................................................................................................................................284 CURRICULUM VITAE.........................................................................................................286

iii

PREFACE Why would anyone embark upon a nine year long academic research, in his spare time ? Well, things add up. In 1993, a colleague at Postkantoren BV in Groningen asked me to audit a large Conceptual Schema design and to confirm its high quality. He particularly wanted a confirmation that the design was exceptionally flexible and would prove to be stable over the expected systems life time of over 10 years. The diagram of the CS (i.e., Conceptual Schema) appeared to be attractively simple, but intricate complexities were built into it that I objected to. This led to heated discussions about quality and future flexibility of the CS design, and as a result, the audit was a disaster. Instead of contributing to overall quality, it caused delays, dissension, and distress. We could not agree on many points: what the flexibility requirement should be, how to check the CS design for compliance, and which best practices should be applied to ensure future flexibility of the CS design. Discussions were endless and relentless as we always found each others arguments to be shallow, subjective and short of the mark. In the same period, I found myself leafing through university curriculums looking for an appropriate course in information modeling methods and theories. However, none satisfied my curiosity. Finally, I came to see that I was not looking for some standard course. What I really wanted was a Doctor's degree, and I set my mind to it. I needed a topic for research. The audit still fresh in mind, the topic was easy to find: I would investigate the long-term evolution of conceptual schemas. This kind of research calls for a data 'archeologist', to collect and amass documentation. It calls for an accurate analyst to get the data organized. And it needs a theoretician to make some sense out of it all. In short, this research needed me. Having decided on the topic, I needed a university supervisor. With nothing to show yet, I approached professor Gerhardt, and she immediately agreed to guide me in the research. It has taken me nine years of research, years that I have enjoyed. And I hope that you will enjoy the final result of these nine years, and here it is: my dissertation.

iv

Exploring Conceptual Schema Evolution

Acknowledgements For her encouragements and many inspiring conversations, I thank professor Waltraud Gerhardt. Every visit to Delft was a long journey, but it was always rewarding. She also introduced me to professor Theo van der Weide whom I thank for bringing some organization and direction into my research efforts. I thank Postkantoren B.V. and Stichting Pensioenfonds ABP for allowing me to conduct research of their proprietary information systems, for supplying me with ample documentation on their evolving schemas, and for granting me permission to publish the findings. The more so as my research takes the long-term view, and no short-term benefits for the companies themselves were to be expected. Of course, the companies have moved forward in their business domains since the evolving CSs were analyzed for flexibility, and my conclusions about the evolution of the four investigated information systems are not intended as a criticism. Many colleagues supported me in the research with many valuable comments, suggestions, and encouragements, and I thank them all. In particular, I thank my friends of the 'breakfast club': Fimke de Jong, Jan Gastkemper, and Benny Lankhorst. But I dedicate this dissertation to Helma Janssen, my partner, for moral support, all the way.

Lex Wedemeijer Maastricht May 5th, 2002

1

Introduction

CHAPTER 1. INTRODUCTION 'The audit was a disaster. Instead of contributing to overall quality, it caused delays, dissension, and distress. We could not agree on many points: what the flexibility requirement should be, how to check the CS design for compliance, and which best practices should be applied to ensure future flexibility of the CS design' From the preface

1.1 DESIGN THE FLEXIBILITY, MAINTAIN THE STABILITY 1.1.1 As the environment changes 1. Introduction Over time, every enterprise and its ways of doing business will change, technologic capabilities will continue to improve, and user demands will change. Hence, to safeguard business investments, enterprises need information systems that are flexible. In particular the Conceptual Schema as a core component of such information systems must be flexible, i.e., must prove to be a sound basis for long-term business support and for a long systems life span. Figure 1 represents the ANSI/X3/sparc 3-Schema Architecture, dating back to 1975.

Universe of Discourse evolution over time Models CS CS

ES ES

IS IS

data store Figure 1. 3-Schema Architecture

2

Exploring Conceptual Schema Evolution

The 3-Schema Architecture introduced the notion of the Conceptual Schema, 'perhaps one of the most essential components of the architecture', by defining it as 'the "real world" view of the enterprise being modeled in the database. It represents the current "best model" or "way of doing business" for the enterprise. It provides an unconstrained, long-term view of the enterprise' [Yormark'76] (p.3). The architecture introduces two other types of schema: External Schema and Internal Schema, abbreviated to ES and IS, respectively. Chapter 2 will discuss these and other fundamental notions in conceptual modeling of relational databases. A Conceptual Schema, well designed according to the architecture, satisfies many quality requirements [Kesh'95], [Moody, Shanks'98]. Whereas all other quality aspects of a Conceptual Schema can be ascertained at design time, flexibility is the exception. To consider flexibility of the Conceptual Schema, we need to examine the behavior of the Conceptual Schema as it evolves over time in the business environment. 2. Flexibility of a Conceptual Schema Intuitively, flexibility means adaptability, responsiveness to changes in the environment. And 'more' flexibility will mean a smaller impact of change. Our working definition of flexibility is: the potential of the Conceptual Schema to accommodate changes in the information structure of the Universe of Discourse, within an acceptable period of time. This definition seems attractive, as it is both intuitive and appropriate. The idea is that of joint evolution of the UoD (Universe of Discourse) and the CS (Conceptual Schema) that models it (figure 2). The definition assumes a simple cause-and-effect relation between the structural changes in the Universe of Discourse, and ensuing changes in the Conceptual Schema. By restricting the relevant environment from which changes stem to the Universe of Discourse, the definition also prevents inappropriate demands of CS flexibility.

Universe of Discourse

Universe of Discourse evolution over time

change drivers Conceptual Schema

Conceptual Schema changes accommodated in the CS

CS and UoD should display

Figure 2. CS and UoD should display joint evolution

Introduction

3

This notion of flexibility has severe drawbacks that become apparent when one tries to assess the level of flexibility of a CS design: - flexibility is established on the fly. There is no way to verify that a CS has enough flexibility, or to discover beforehand that more flexibility is needed. Moreover, it acts as a kind of dissatisfier: a lack of flexibility is harmful but an excess goes unnoticed, - it leaves open what time span is acceptable to accommodate a change, - it is not a persistent property. A CS that has adapted to many changes in the past may prove to be inflexible in the future. Nor does this notion of flexibility help us to understand the evolution of the CS. The main problem is its dependence on future events. What we need is sound criteria to investigate and learn about CS behavior from long-term evolution; criteria that do not depend on future developments. 3. Stability of the operational CS Flexibility of a CS cannot be assessed in a straightforward manner, as it depends on future events. Instead, we propose to look at past events and investigate the stability of the CS as evidenced in the past. Generally speaking, while flexibility is a demand that must be met by the designer, stability is proof that it has been delivered. Stability is one of the most important quality aspects of a CS in its operational life. To assess the notion of stability in an operational CS, we develop a framework consisting of three dimensions in chapter 3. Chapter 4 develops a set of operational metrics for the framework that we use in the case studies described in chapter 5. Perhaps the most important metric in the framework is justified change. A change in the CS is justified if it is associated with a corresponding change driver in the information structure of the UoD. A change in the CS that occurs for any other reason is unjustified. A basic assumption in this and other metrics is that single CS changes can be isolated and studied separately. Still, it can be difficult to determine whether a given change is justified, as it must be traced back to a corresponding change in the information structure of the UoD. The following complications might arise: - lag time between change in the CS and change in the UoD, - update anomaly if a single CS change is the compound of several UoD changes, - parts of the CS may be invalid or unused; changes to those parts are irrelevant, - user perceptions about the UoD may change without there being a material change in the way of doing business

1.1.2 The need for understanding long-term CS evolution 1. The scientific angle It is a common assumption that a good CS design will prove to remain stable over its operational life, and at the same time be flexible enough to adapt to changing requirements [Vreven, Looijen'97]. Many state-of-the-art design strategies claim to deliver a high-quality CS, but fail to demonstrate how a delivered CS does indeed show a graceful evolution once it has gone operational. From a scientific point of view, the evolution of an operational CS in response to change in the UoD information structure is not well understood nor investigated by academic

4

Exploring Conceptual Schema Evolution

researchers. We consider it significant that reports on the flexibility of CSs in live business environments are scarce in the literature. In contrast, allegations of CS inflexibility are not too hard to find [Boogaard'94], [Li, Looijen'99]. However, very few of these reports provide insight into the alleged shortcomings of CS flexibility [Winans, Davis'91]. 2. The business angle From a business point of view, it is important to know in advance whether a given CS is likely to adapt well to future changes in its environment. Many of today's companies are vitally dependent upon efficient information systems to handle large data volumes. A high-quality CS as core component of these systems is assumed to be a solid basis for the information system, and to provide long-term support for the ongoing business. At the same time, stakeholders will naturally oppose any change that is not in their immediate interest, and this must not be taken as a sign of inflexibility on the part of the CS. As the dependence on information systems continues to increase, engineers are challenged not only to develop high-quality CS designs today, but even more so to ensure the effectiveness and efficiency of their designs in the business environment of tomorrow [Nelson, Ghods'98].

1.1.3 Research motivation On the one hand, the operational CS is expected to remain stable. On the other hand, the CS must be flexible enough to adapt to changing requirements and allow a graceful evolution of the information system. But how do engineers succeed in meeting these conflicting demands in operational CSs ? The two main questions that have motivated our research are: what is the natural evolution of an operational CS in the business environment ? and how can we assess the level of flexibility of a CS, in view of that natural evolution ? These questions can only be researched by investigating the past evolution of CSs in their natural environment, the operational business. And there is a strong need to do so. In our opinion, awareness of the evolution over the operational life span of the CS, and of what may cause the operational CS to change, are important yet generally underrated topics in the field of data modeling. This kind of research will encounter serious obstacles, as it is very difficult to observe CS changes in real business environments. Nevertheless, it is only through an understanding of the evolution that we may hope to improve current practices in maintenance and design of CSs.

Introduction

5

1.2 RESEARCH OF LONG-TERM CS EVOLUTION 1.2.1 Research objective The importance of the CS within the 3-Schema Architecture is that it should provide a point of stability. [ANSI/X3/sparc'75] stipulates that 'generally, changes to the CS permeate the database environment causing schemas, mappings, programs, and the database itself to be modified. More frequently, however, the information usage of the enterprise as synthesized in the conceptual model may remain relatively stable while the internal and external environments require change' (p.III-12). Our aim is to investigate and understand changes as the CS evolves and to investigate questions like: what causes change, what constitutes change in the CS, and how does it affect the overall CS in the long term. The objective of our research is twofold: -

to develop a framework for CS flexibility that captures relevant aspects of flexibility in the business environment and demonstrate its relevance,

and - to explore long-term evolution of CSs in the ongoing business, in order to disclose operational practices and experiences and outline the implications for flexibility of CS designs. CS flexibility in the ongoing business environment is today a terra incognito. We aim for a qualitative understanding of CS evolution, and for the discovery of characteristic phenomena that landmark the area. We do this by conducting four case studies of evolving CSs that we selected for diversity, not for representativity. We measure the semantic changes in the CSs by a set of operational metrics, and we establish some important trends in their long-term evolution.

1.2.2 Research subject 1. The operational CS Our area of research is information systems used on a daily basis to support information transaction processes in data-intensive administrative enterprises. Characteristics of such information systems are high transaction processing rates, large data volumes, and low costs per unit. Further characteristics are that the business proposal to customers as well as the way of conducting the business develop gradually and gracefully over time. Our research subject is the operational CS as a formal description of the information structure of the UoD, implemented as such in the enterprise's running information systems [Date'00]. The CS according to the 3-Schema Architecture is uniquely defined, isolated from any other information system component, and its evolution can be studied over time. We further restrict the subject by focusing on overall construction of entities and references, as depicted in CS diagrams, thus excluding conceptual attributes and constraints [Hartmann'00]. This is not a severe restriction for our research. On the one hand, changes in attributes are generally well supported by current database management systems and therefore less interesting. On the other hand, changes in conceptual constraints (other than the two basic rules of entity and referential integrity) are excluded because of the general lack of reliable documentation. An effort to recover complete documentation on all conceptual constraints in every CS version would have far exceeded our research capacity.

6

Exploring Conceptual Schema Evolution

2. Alternative research subjects Research of the UoD, its evolution and its impact on CS evolution Our subject for research into flexibility is the CS. An alternative would be to use the UoD as research subject. This would mean a tremendous switch. The CS is a well-defined, formal model that can be studied in isolation. In contrast, the UoD is informally known, a common understanding of what is 'out there' in the business domain that resists to be caught in formal definitions: any attempt at formalizing the UoD is equal to creating a CS. Some attempts in this direction can be found in [Giacommazzi, Panella, Percini, Sansoni'97]. This line of research is more akin to the scientific area of Business Administration and Scientific Management. We decline this approach as it far exceeds our research motivation of understanding the evolutionary behavior of the CS. Moreover, working from the UoD would yield biased results. If we would restrict the investigation to UoD changes only, we would be observing only the justified CS changes. Research of the data model theory, its taxonomy and its application to CS evolution A common approach in the study of a changing CS is using elementary changes that each add, modify, or eliminate a single construct or construction in the CS. A taxonomy is defined as the set of all elementary changes. By this definition, every CS can be transformed into any other CS by a chain of elementary changes [de Troyer'93], [Ewald'95]. But this is syntax only. Taxonomy is useful to deconstruct a CS change to the utmost, but this operation is devoid of semantics and therefore insufficient to meet our research goals. There is a reason for, an interpretation of, and a coherence in the semantic change that cannot be captured by the individual syntactic changes.

1.2.3 Research approach The work that we present here consists of two parts. We employ different approaches for each part, in keeping with the two research objectives. 1. Survey of literature To develop our framework for CS flexibility, we conducted a survey of current literature. Some 150 papers were selected and surveyed for CS design strategies that authors claim will ensure CS flexibility to elicit the (often implicit) hypotheses about evolution of the CS in the operational business environment that authors rest their claims on. Working from flexibility features that authors alluded to in the survey, we confirmed our definition of CS flexibility and developed a comprehensive framework for CS flexibility from it. 2. Case study approach We employ the case study approach, also known as the longitudinal field study [Lee'89] to explore the long-term evolution of CSs in the ongoing business, disclose operational practices and experiences and outline the implications for flexibility of CS designs. The case study approach is suitable to arrive at a first qualitative understanding and to determine some general patterns of behavior. Thus, we select an operational CS, trace it over a length of time, determine changes in its composition, and match these with change drivers in its UoD. Actually, we explore the area by doing not one, but four case studies. This however must not be misinterpreted as an attempt at complete coverage or statistical significance of our results [Schneidewind'92].

7

Introduction research approach

orientation / type of question

requires control over behavioral events ?

historical approach

how, why

no

focus on contemporary or past events ? past

who, what, where, when, how many how, why

no

either

no

contemporary

who, what, where

no

contemporary

how, why

yes

contemporary

analysis of available archives case-study approach field-survey approach controlled laboratory experiment

Table 1. Table 1, adapted from [Yin'88] (p.17), summarizes suitable approaches for research into evolving subjects such as the CS. The alternative approaches in table 1 are considered less suitable for our purpose: - the historical approach [Goodhue, Kirch, Quillard, Wybo'92] is rejected because this would mean reverting to subjective insights that draw heavily upon personal recollection, whereas we are looking for objective and reproducible arguments. - an analysis of available archives is impossible, simply for the lack of archives. It would require an extensive set of complete and undisputed documentation on evolving CSs. The current state of affairs in most enterprises and in the scientific field does not meet this prerequisite. - the field-survey approach [King, Thompson'96], [Genero, Jimenez, Piattini'01] involves interviewing participants in the field and eliciting their subjective opinions, perhaps obtaining statistically significant results. But it would not meet our research objective: to understand CS evolution and obtain objective insights on operational practices and experiences. We think it is typical that a field survey conducted by [Wang, Strong'96] into aspects of data quality had to eliminate 'flexibility' because of lack of objectivity. - the controlled laboratory experiment [Batra, Davis'92], [Basili, Briand, Melo'96] is unsuitable for our purposes, as it takes control over the environment and imposes a severely restricted time frame. In our opinion, it is in no way certain that experimental short-term results will turn out to be valid in real, evolving business situations.

1.2.4 Research setting The settings for our investigations are: 1. 3-Schema Architecture and relational data model theory Longitudinal research presumes that the subject to be researched has existed for some time. Thus, we investigate CSs founded on the conventional 3-Schema Architecture, foregoing schemas built according to more recent architectures. Our work is entirely done within the context of the strict-typing paradigm, and relational data model theory [Date'00]. As a large part of today's mainstream business information systems is founded on these conventions, these settings do not restrict the business relevance of our research.

8

Exploring Conceptual Schema Evolution

We use a single data model theory to describe all CSs in a uniform way. Appendix A outlines this data model theory. We do not claim that this particular theory is good for flexibility, or better than any other data model theory. Nor does our choice of data model theory affect the validity of our results: it provides us with a language to express the CS as it evolves, but does not determine the evolution. The CS evolution would not be much different if another suitable data model theory were used, regardless of the modeling paradigm it would be based on; the relational, object-relational, object-oriented, or any other. 2. Semantic change as unit of analysis It is a fundamental assumption in our approach that CS changes are made in a deliberate, coherent and meaningful way that can be identified, isolated and studied. We will call this unit of analysis the 'semantic change'. The CS is composed of many constructs, each of which can be impacted by several changes at once. The other way around, a single semantic change can affect many constructs at once. As a result, the constructs in the operational CS do not change arbitrarily, but in a practical situation, it may be hard to identify the single semantic changes. Notice that the assumption is not, that semantic change in the CS can only be driven by some change in the UoD information structure. 3. Operational life cycle phase Our research focuses on the operational CSs in its operational life cycle phase of usage and maintenance. Some authors use the term 'evolution' broadly to encompass both the initial development of the system and its subsequent maintenance, but we focus exclusively on the events after initial implementation [Kemerer, Slaughter'99]. A simplified CS life cycle model puts this into perspective (figure 3). The model is not very detailed, as we do not want to study the CS life cycle as such. More extensive descriptions can be found in [Teorey'94], [Elmasri, Navathe'00].

design design

build, build, test, test,& & implement implement

use use& & maintain maintain

replace replace

Figure 3. CS life cycle model Design is the first phase in the life cycle of any Conceptual Schema. Design quality is an important issue: the better the outcome of this initial step, the better the prospects for the operational life span of the CS [Proper'98]. To minimize errors in modeling, every CS design proposal ought to be validated by users. In practice, user validation proves to be rather difficult as most users are unfamiliar with data modeling and cannot be expected to understand the strong points or shortcomings of a CS design. Although we will be surveying CS design strategies in the theoretical track of our research, our focus is on CS flexibility in the operational life cycle phase. We investigate the arguments that design strategies proponents put forward to underpin claims of flexibility as supposedly effectuated in the operational life cycle phase of their CS designs.

Introduction

9

Build, test and implement is the next life cycle phase. The CS design of the preceding phase is now used as the baseline document for building, testing, and implementing the database [Zamperoni, Löhr-Richter'93]. Substantial changes in the CS are quite common in this phase to correct omissions and errors, to improve design flaws etc. The adjustments reflect either a progressive understanding of requirements or an improvement in the way of modeling. We consider such adjustments to be part of schema development, not schema evolution. The amount and characteristics of these adjustments are a hallmark of the designer's ability and experience in modeling, rather than an expression of real changes in the UoD and corresponding evolution of the operational CS. Use and maintain is the operational life cycle phase, which we focus on. It is important to realize that in this life cycle phase, the CS is the formalized description of the UoD information structure, and vice versa. By definition, the information structure of the UoD equals the CS; the notion of modeling error, so important in the previous phases, has simply ceased to exist in this phase. The terminal life cycle phase is to replace the CS. This phase is about quality deficit rather than quality enhancement, and it is beyond the scope of our research. Unlike [Aiken, Girling'98], we think that a CS that has reached the replacement phase is still valuable as a valid description of recorded business data. In particular, many efforts to replace legacy information systems start with extracting the valuable knowledge contained in its CS [Gray, Wikramanayake, Fiddian'94], [Kotter'95], [Davis'96], [Vissagio'97], [Bisbal, Lawless, Wu, Grimson'99], [Li, Looijen'99]. If the old CS structure can be salvaged, adapted and reused as core component in the newly developed information system, it is proof of good flexibility rather than poor quality on the part of the CS.

1.2.5 Contributions of the research Our research offers the following contributions to the scientific field: - a theoretical framework for understanding flexibility of a CS (chapter 3) that we use in an analysis of current CS design strategies, - a set of metrics that serves to gauge change characteristics in an evolving CS (chapter 4), - in-depth analyses of four evolving CSs (chapter 5), - recognition of Semantic Change Patterns as instrument to ease CS evolution (chapter 6), - best-practice recommendations expected to improve flexibility as the CS evolves in the business environment (chapter 7), based on operational experience instead of theoretical hypotheses, and - five conjectured laws of CS evolution founded on objective observations (chapter 8). Researchers and practitioners alike have long been aware of the need to understand and exploit CS flexibility [Yormark'76], [Davis'90]. Even so, our research approach to this subject is new. We explore the evolution of operational CSs using the longitudinal case study approach. It discloses a new source of practical experience, which, up until now, has remained virtually inaccessible for theoretical examination. To the best of our knowledge, we are the first to present case studies of long-term evolution of operational CSs, and to conduct an indepth analysis of CS evolution by means of a rigorous and objective set of metrics.

10

Exploring Conceptual Schema Evolution

We believe that a deeper understanding of CS evolution in the long term is fundamental for the future enhancement of methods for CS maintenance and CS design. In turn, this will pave the way to increased levels of flexibility and stability of operational CSs. As remarked upon by [Perry'94]: 'we will be able to effectively understand and manage the evolutions of our systems only when we have a deep understanding of these sources (i.e. change drivers), the ways in which they interact with each other, and the ways in which they influence and direct system evolution' (p.303).

Introduction

11

1.3 LONGITUDINAL CASE STUDIES 1.3.1 Case selection criteria In order to select relevant, manageable and interesting cases, we used the criteria as outlined below. We think that these selection criteria are pretty much what would be expected. It does not set standards too high: we do not require that the evolving CS is of immaculate quality, nor that only best practices were always followed in maintenance, nor that every CS version be perfectly documented. Similar criteria have been used by others, e.g. by [Lehman, Ramil, Wernick, Perry, Turski'97]. One of their selection criteria was simply stated as 'the preference was for medium to large systems, however defined'. Another one of their criteria was 'the availability of historical data on system evolution' (p.23). 1. Primary business function To ensure relevance, we demand that the UoD, and the CS that models it, is part of the core business. Flexibility in primary business functions is more critical, and will receive more attention from management than secondary business functions like payroll, project planning etc. The criterion also prevents the inclusion of personalized software solutions like spreadsheets. 2. Moderate size To ensure manageability, we selected CSs that contain between 3 and 30 entities. This size is common enough in business practice, and a single investigator can research the evolution of a moderate-size CS within a reasonable amount of time. A moderately sized CS is most often designed and maintained by a small team, but it is still simple enough to be understood in its entirety. Larger-sized CSs are often highly integrated, supporting a variety of business functions. Such CSs are hard to understand as a whole, and a case study of their evolution would exceed our available time and resources. 3. Adequate documentation on the evolution To quote [Darwin1859], 'the mere lapse of time by itself does nothing, either for or against natural selection' (p.85). Therefore, to study evolution, we demand at least three different versions of the CS with adequate (but not: perfect, or complete) documentation of the CS versions as well as on dominant changes in the UoD. Availability of documentation must not be taken for granted, nor its quality. Documentation deficit is a problem that has long been recognized [Fox, Frakes'95]. We experienced in practice how valuable documentation is simply lost whenever maintenance responsibility is transferred from one engineer to the next.

1.3.2 Four selected cases The research into operational business information systems has been conducted in cooperation with Postkantoren B.V. and Stichting Pensioenfonds ABP. The four selected CSs are embedded in information systems that were well used and maintained on a regular basis, and that were neither perfect and finished, nor accused of being legacy or obsolete: - Benefit Administration - Settlement of Pension Benefits - Sites Registration - Franchiser Compensation Chapter 5 reports the long-term evolution of these four CSs.

12

Exploring Conceptual Schema Evolution

1.3.3 Shared characteristics In selecting the cases, we aimed for diversity. Representativity, however defined, was not an issue. Our selected cases represent ordinary, realistic CSs; the selection criteria do not interact in such a way that only perfect CSs happen to be selected. After selection, we found that the cases have several characteristics in common: - the operational life span of the CSs was never less than 5 years. This indicates that the UoDs and business functions being modeled are not very volatile, and our results will have a bias towards long-lived, and hence more steady, environments. - each CS ran in a single version at a time. Either it was implemented on a single mainframe, or it ran on multiple installations with identical hardware, software and database management platforms. Our conclusions do not readily extend to CSs that are operational in multiple versions at a time, or on heterogeneous platforms. - no CS in any of our case studies was properly modularized. Although some were designed in parts, the partitions were never meant to encapsulate changes or contribute to flexibility in other ways. Indeed, maintenance engineers never considered the impact of changes on the separately designed parts. Therefore, we cannot draw conclusions as to the potential benefits of modularization. We do not presume that the selected cases completely cover the area of long-term evolution and CS flexibility in any way. Nor must it be taken for granted that our conclusions on CS evolution will extend to CSs that do not share the above characteristics.

1.3.4 Case study protocol The case study protocol (figure 4) outlines the uniform way of working towards our goal of acquiring and understanding the long-term evolution of a real CS. The protocol ensures that - the case studies follow the same procedures and guidelines, - changes in personal preferences are excluded, and - analysis of each evolution step is conducted with equal rigor and verifiability. Adherence to the protocol also prevents omissions or premature conclusions. Our research protocol follows rather standard conventions for longitudinal case study research [Yin'88], and other researchers engaged in case studies have used similar protocols [Sjøberg'93].

1.1. 2.2. 3.3. raw rawdata data organize organize analyze analyze collection the collection the per perCS CS material material version version

4.4. analyze analyze shortterm shortterm evolution evolution

5.5. 6.6. apply apply validate validate metrics and metrics and report report Figure 4. Case study protocol

1. Raw data collection The main purpose of the first steps of the case study protocol is to counter the documentation problem. We collected baseline material by being present on the spot and obtaining copies of the available documentation ('action research'). Each time a CS version went operational, we took out a set of documentation, timestamped it, and safeguarded it for later analysis. The documentation mostly took the form of a complete CS description. In some instances, we had to settle for a printout of database structure or even a backup of the entire database contents.

Introduction

13

Moreover, we are not convinced that we succeeded in collecting a complete set of data, and some minor version changes may have escaped our notice. This however is not detrimental to our research. Details may go unnoticed, but the long-term evolution is still captured. Information on the dynamics of the business environment and apparent change drivers in the UoD was harder to come by. We resorted to a variety of information sources such as: - formal project and change assignments, authorized by management - problem reports and user change requests - informal statements by system engineers about important design or maintenance decisions - verbal accounts on important changes in user requirements from key people in the enterprise. Raw data collection is a preparatory stage. It may take years while piles of paper accumulate, without any progress in research; the case study actually begins only in the next step. 2. Organize the material First, we set the scope of the relevant UoD to be studied. We pin down the subject and its boundaries to a single coherent business area. Notice that this is not equal to selecting a single CS, as a single business area may be modeled by multiple overlapping CSs. Next, we select the time frame, i.e. we determine where and when the CS evolution starts and ends. Or rather, when our study of it starts and finishes, as a case study needs not start on first production date of the system. The determination of the time frame remains somewhat arbitrary, and in the end, it is primarily decided by availability and reliability of the documentation. It may be unexpected to set the scope after the actual evolution has taken place, but this is customary for most researches into past events. It does not affect the validity of our results because the time frame is set prior to the analysis of case details. The main goal of organizing the material is to reconstruct the correct sequence of CS versions. This is not as easy as it sounds. - figure 4 suggests a linear ordering of life cycle phases. This is a simplification, and most operational environments will present a more complex reality. - new CS versions are usually developed and tested while the old one is still operational. It may even be that several versions are being designed and tested simultaneously, confusing both maintenance and users as to the strict chronological ordering of versions. - some CS versions never go operational. Or in exceptional cases, a new CS version is replaced by a previous one, either by accident or because the new version contained errors or caused serious problems for system operation. - time stamps of the CS documentation and CS version may differ. Sometimes the CS documentation is prepared in advance, but other material like file printouts and database dumps is only available in retrospect Getting the sequence wrong is harmful, as the semantic changes in the CS evolution will appear to be reversed: deletions instead of insertions etc. The effect of missing out on an intermediate CS version in the time sequence is less detrimental. Minor details of the changes will go undetected, but the long-term effect is still caught in its appropriate time frame. The result of this step is a set of documentation about the CS and about the UoD that it is supposed to model. The material is in sequence, reliable, relevant, and as complete as it gets.

14

Exploring Conceptual Schema Evolution

3. Analyze per CS version This step, called pre-evolution analysis in [Ewald'95] (p.15), prepares a description of each operational CS version by analyzing the documentation. The execution of this step is dependent on the type and quality of the available material, as that is rarely up to standard. It might have been revealing to keep score of documentation quality (inconsistencies, incompleteness, and downright errors), but this is not the purpose of our research. On many occasions, this step comes down to reverse engineering of the CS [Sauter'95], [Chiang, Barron, Storey'97]. A major setback as compared to regular CS reverse-engineering efforts is that knowledgeable designers and maintenance personnel are no longer available. And anyway their recollection of the old CS is no longer reliable. But there is compensation, as we can consult other CS versions to help understanding obscure specifications. Of course, this should be done prudently. We need a reliable description of a single CS version at a time, not a mix-up of several CS versions at once. While we want the final CS description to be free of reconstruction errors, it is well possible that the CS itself contained misconceptions of real-world objects - which we want faithfully represented. It is important not to clean up the CS in hindsight, as it would disturb the observed evolution, even if we now know that the CS construction was wrong. The step results in a reliable and complete description of each consecutive CS version, denoted in the uniform data model theory and ready for the longitudinal analysis. 4. Analyze short-term evolution Once the static CS versions are well understood, this step looks at the CS changes. Consecutive CS versions are compared and analyzed for semantic changes. We then match semantic changes with the evolution of the UoD. This interpretation of semantic changes requires an intimate knowledge of the information system and the events and requirement changes in the user environment To be able to identify and isolate the single semantic changes, we need to use a uniform data model theory to prevent theoretical changes. And the data model theory must be essential, in order to prevent artificial changes (in 'the way of modeling') as much as possible. Chapter 2 will explain these issues in some more detail. Next, the origins of semantic changes must be traced and clarified, and we want to determine what caused each change, hoping to establish a cause-and-effect relationship with changes in the UoD and user requirements. Changes in (the perception of) the UoD can only be understood by referring to informal knowledge, and we inferred the structural changes of the UoD either from informal communications, or through personal knowledge of the problem domain. However, even if a semantic change is clearly outlined, it can be difficult to ascertain how it is driven by change in the UoD. Some complications encountered in the case studies are: - update anomaly if a UoD change drives multiple CS adaptations - lag time, when a CS change occurs later than, or sometimes even prior to the UoD change - parts of the CS may be unused; it is unclear what changes to those parts mean - personal perceptions of both users and engineers about 'the information structure of the UoD' may change, without there being a material change in the way of doing business

Introduction

15

The result of this step is a summary of relevant changes in the UoD, a list of semantic changes in the CS, and an understanding why each CS change is committed, be it driven by an UoD change or otherwise. 5. Apply metrics Once the consecutive short-term changes are clear, the overall characteristics of the CS evolution can be studied in detail. Chapter 4 derived a set of operational metrics for the purpose, and we now apply them to gauge the level of flexibility with respect to each hypothesis. This assesses the level of flexibility of a CS while its natural evolution is taking place; thus, we begin answering parts of our research questions. 6. Verify and report We reviewed our case descriptions with field experts in order to validate our interpretations and conclusions. While many points in the case descriptions required further clarification, all our CS descriptions and analyses of past events were confirmed. The final reports of the four case studies make up chapter 5. The reports follow a uniform layout with four sections: - general introduction - business background essential to understand the topic - stepwise evolution of the CS. This consists of diagrams of the consecutive CS versions, plus a summary of change drivers, semantic CS changes, and measurements of the evolving CS, and - highlights of the case.

1.3.5 Reliability of the case studies [Yin'88] notes how 'people who have been critical to case studies often point to the fact that a case study investigator fails to develop a sufficiently operational set of measures and that "subjective" judgments are used to collect the data' (p.41). This critique is overcome by our dual approach of first, developing objective measures in the Theoretical Track, and second, collecting data in the four case studies of the Practical Track. We do not seek for reliability in the statistical sense. The goal we set is simply to explore, as the field is too immature to set out for rigorously quantified or statistically significant results. As [Perry, Staudenmayer, Votta'94] remarked of their exploratory case study: 'our sample sizes are small and probably inadequate for statistical validity but "any data is better than none"' (p.37). To ensure reliability of the case studies in a methodological sense, we strictly adhered to the protocol. Adherence improves focus of the case studies, it counteracts omissions in data acquisition, and it prevents jumping to conclusions in the analysis. Further arguments that underpin the reliability of our work are: - all case studies were conducted by the same researcher, thus ensuring a uniform way of working without variations in personal modeling preferences - the research effort, and specifically the raw data collection, did not interfere with the ongoing business or CS maintenance. This ensures that we observe the undisturbed, natural evolution of the CSs. If users or maintenance engineers are aware that their CS is being

16

-

Exploring Conceptual Schema Evolution scrutinized, this might affect their way of handling the changes. We did not arrange for any special conditions in the case studies, nor did we demand any special documentation on changes being made in the CS. we use each operational CS as-is. We could certainly have polished up the CSs, but this would severely detract from our purpose to study the real thing, not academic examples that tend to be unrealistic and untried. And polishing up the CS would certainly affect the constructs that we want to see evolve, and thus diminish the value of the case study. We study the CS as real as possible, retaining the modeling ambiguities, overly complex structures etc.

However, the verifications with field experts were found to be of limited value. Our reviews were spent more on explaining the semantic changes than on verifying their interpretations. The field experts were not always able to oversee the complete evolution covering extended time frames. This problem of verification is inherent to every study of past events, so it is not surprising to find it here. We do not think that the verification problem diminishes the value of our results, the more so as we are engaged in an exploratory research, not in statistical confirmation.

17

Introduction

1.4 OUTLINE Figure 5 shows how we approach the problem along two separate tracks that come together in the final chapters. The theoretical track discusses the state of the art in data modeling, concentrating on maintenance of stable CSs. It sets up a framework for flexibility and uses it in a survey of theoretical approaches and current best practices advocated in the literature. The practical track explores four case studies of long-term CS evolution in the native business environments, providing an in-depth analysis and extracting some practical wisdom. Reports of such explorations have been scarce in the literature. The final chapters provide a synthesis of the results from both tracks, and draw conclusions.

1.1.Introduction Introduction

for practitioners for academicians

ical t e r o The 2.2.State Stateof ofthe theArt Art k c a r T

for others

ical t c a Pr 5. 5.Four FourCase Case k c a r Studies T Studies

3.3.Framework Frameworkfor for Flexibility Flexibility

6.6.Shortterm ShorttermPatterns Patterns and andPractices Practices

4.4.Defining DefiningMetrics Metrics for forEvolution Evolution

7.7.Long-term Long-termTrends Trends in inEvolution Evolution

8.8.Synthesis Synthesis

9.9.Conclusions Conclusions

ch1: thesis outline and reading plan Figure 5. General outline and reading plan

1.4.1 Theoretical Track Chapter 2 introduces fundamental concepts in conceptual modeling of relational databases such as the 3-Schema Architecture, the data model theory and its associated taxonomy, and quality aspects of the CS. It also reviews related topics in conceptual modeling. Chapter 3 introduces our framework for flexibility of operational CSs. It is made up of three dimensions, subdivided into 8 guidelines to enhance CS flexibility. We use this framework in a survey of a dozen design strategies, theories and best practices that claim to promote flexibility. We show that the claims are mostly based on theoretical considerations about CS flexibility, not on substantiated operational evidence. In our opinion, this is an unsatisfactory state of affairs. No strategy or best practice employs all guidelines for flexibility, and no single best theory emerges from the survey.

18

Exploring Conceptual Schema Evolution

Chapter 4 inspects some common hypotheses about stability and changes in operational CSs. Working within the framework of chapter 3, it develops theoretical metrics conducive to an inquiry into the validity of these hypotheses in evolving CSs. A subset of these metrics is then operationalized for use in the four case studies of the Practical Track.

1.4.2 Practical Track This track explores the evolution of operational CSs using the longitudinal case study approach. Readers interested only in this track can skip the theoretical track except for section 4.3 that discusses our operational metrics and appendix A that outlines our data model theory. Chapter 5 reports on four business cases of long-term evolution of CSs in their natural environment, i.e., the operational business. Reports of such time series have been, and still are, very scarce in the literature. The case studies were conducted in cooperation with two companies who supplied two cases each. We trace, identify and analyze the semantic changes as the CSs evolve in the business environment, and we apply the metrics developed in chapter 4. Engineers with expertise in the respective business areas have kindly validated the case descriptions and CS evolutions. Chapter 6 takes the short-term view, and extracts patterns and best practices that engineers use when changing a CS. Our findings include a comprehensive change process that both enterprises seem to follow to manage change in the CS and to ease its evolution. Another research finding is the notion of semantic change pattern observed in the cases. Chapter 7 extracts the practical know-how in long-term CS evolution, and makes that knowledge available to researchers and practitioners engaged in the field of information modeling. We analyze measurements for each of the operational metrics, extracting their trends and tendencies. We discuss the issue of derived data in the CS, and how to cope with it. We present best-practice recommendations to achieve a graceful CS evolution drawing on our insights in long-term evolution of the CS.

1.4.3 Synthesis Chapter 8 presents a synthesis of the Theoretical and Practical Tracks. It draws conclusions from our exploration of long-term evolution in CSs, and discusses the validity and usability of our results. Solution strategies from theory are revisited using the newly gained insights on practical CS evolution. We argue that our research results are proof of the concepts laid down in the Framework for Flexibility developed in chapter 4. We then propose five new Laws of Conceptual Schema Evolution that may guide future researchers. Chapter 9 concludes the dissertation.

1.4.4 Appendices Additional material is presented in the appendices: A. Data Model Theory B. Catalogue of Semantic Change Patterns

19

THEORETICAL TRACK

20

Exploring Conceptual Schema Evolution

21

State of the Art

CHAPTER 2. STATE OF THE ART You can skip this chapter if you know what's in it.

2.1 INTRODUCTION 2.1.1 Chapter outline This chapter outlines the state of the art, and some current directions in related research areas. Section 2 discusses aspects concerning the static CS. There are many interesting studies concerning static quality of CS designs. However, our research interest lies with the operational life cycle phase, not the initial design. Moreover, it is immaterial to our research how the initial CS design was created. Section 3 discusses the importance of the data model theory for research into CS evolution. The scientific community calls it the data model, but we will consistently use the term data model theory to avoid confusion in the business environment, where data model is a synonym for CS. Research topics regarding the CS as it changes in the business environment are outlined in section 4. Next, section 5 takes the broader view, and looks into research of evolution at the level of the information system. Section 6, which is rather brief, discusses other longitudinal researches of evolving schemas. Section 7 concludes the chapter.

22

Exploring Conceptual Schema Evolution

2.2 THE STATIC CS 2.2.1 Roles of the CS in the 3-Schema Architecture The [ANSI/X3/sparc'75] working group, when introducing the 3-Schema Architecture as 'a generalized model for the description of data base management systems', defined the Conceptual Schema only by providing a list of its purposes. The two most important roles of the CS are, first, to provide a high-quality model of the perceived UoD, the whole UoD, and nothing but the UoD. It is created as such, and it must be maintained as such. The second role is to provide the abstracted view of the stored data in the database, according to the Internal Schema. This is called data independence, and it 'insulates a user from adverse effects in the evolution in the data environment' [ANSI/X3/sparc'75] (p.II-29 and p.VIII-3).

"reality" perceived reality

Universe of Discourse

scientific progress

engineering abstractions

Limited Models

Mental Model

symbolic abstraction

Derived Models

Symbolic Model

Internal Schemas

Data Realm

External Schemas

Conceptual Schema

Conceptual Realm

"Best" Model

scientific abstraction

Figure 6. 3-Schema Architecture The Internal Schema describes the physical storage structure of the database, with complete details of data storage, security, access methods etc. More than one Internal Schema may be needed if several types of storage devices are used, e.g., paper records and disk drives. External Schemas, also called user-views or subschemas, of which there may be many, describe some part of the enterprise's information store that a particular group of users or software applications is interested in, hiding the remainder of the information from their view [Date'00], [Elmasri, Navathe'00]. In the design phase of the CS life cycle, no data store is yet operational and the role of the CS to provide the abstracted view of the stored data is

State of the Art

23

meaningless. Once the operational life cycle phase of the CS begins, its various roles merge: the abstracted representation of the stored data equals the formal model of the UoD. So in a formal sense, the Internal Schema can never deviate from the CS, and schema errors are impossible.

2.2.2 Quality of the static CS It is commonly known that CSs vary widely in quality, depending on a variety of factors such as personal preferences, level of expertise, and craftsmanship of the designer [Batini, Ceri, Navathe'92], [Batra'93], [Becker, Rosemann, Schuette'95], [Chidamber, Kemerer'95], [Moody'95], [Briand, Bunse, Daly'99], [Storey, Goldstein, Ding'02]. 1. Frameworks for CS quality Frameworks for static CS quality have been proposed for instance by [Lindland, Sindre, Sølvberg'94], [Zamperoni, Löhr-Richter'93], [Krogstie, Lindland, Sindre'95], [Kesh'95], [Moody, Shanks'98] and [Moody, Shanks, Danke'98], [Jarke, Jeusfeld, Quix, Vassiliadis'99]. Without trying to be exhaustive, most frameworks include the following aspects: - validity: the CS is the valid model of the UoD information structure, - complete and consistent: all the user requirements are captured in the CS, - correct: the CS complies with the construction rules of the data model theory, - data-independent: the CS is the abstracted view of the current Internal Schema, while being free of implementation details and technical peculiarities, and - simple: the CS is clear and simple to understand for all concerned. In addition, most frameworks also require flexibility, or stability: 'The CS is invaluable as a stable description of the database contents' [Elmasri, Navathe'00] (p 537, original italics). However, we do not consider flexibility to be an aspect of static quality, as it can only be assessed by examining the behavior of the evolving CS. 2. Redundancy in the operational CS The perceived UoD, and the CS that models it, may contain redundant data. For instance, an income tax schema will model the concept of income tax, even though taxation is derived from personal income data. Some authors disallow redundancy in the CS by requiring irreducibility: 'a schema is minimal if no concept can be deleted from the schema without losing some information' [Batini, Ceri, Navathe'92] (p.140). In particular, it requires that all derived data be purged from the CS. We do not demand irreducible CSs. We consider such a demand to be both impractical and unacceptable in real business situations. Unacceptable to users, because it would bar core concepts of the UoD to be represented in a CS just because they are derived. The demand is also impractical [Seljée, de Swart'99], as real CSs cover so many aspects of the ongoing business that it is next to impossible to identify all possible redundancies beforehand, and the effort to analyze and resolve them all would take far too much time. Chapter 7 discusses the effects of derived data on the long-term evolution of operational CSs.

24

Exploring Conceptual Schema Evolution

3. Evaluation of quality of the operational CS Regrettably, contemporary research into CS quality is mostly restricted to design experiments in isolated laboratory settings [Atkins'96], [Basili, Briand, Melo'96]. Some authors describe metrics to assess various quality aspects of an operational CS [Moody'98], [Genero, Jimenez, Piattini '00], but these metrics are not often applied to real business cases. The latter authors indicate that 'real case studies must be carried out' (p.523). [Moody'00] remarks how 'in practice, evaluation of the quality of Entity Relationship models takes place in an ad hoc manner, if at all' (p.1043). [Genero, Olivas, Piattini, Romero'01] go one step further and describe an experiment to assess whether static CS metrics can be used to predict CS maintainability. This approach is inadequate in our opinion, as it lacks an understanding of the concept of change in the CS that is essential before relevant metrics can be defined.

2.2.3 Quality of recorded data CS quality should not be confused with data quality: an information system may support data transactions to perfect satisfaction over many years, even though its CS is regarded as inferior and inflexible. The volume of daily transactions in data-intensive information systems is a measure of data volatility, not an indication of turbulence in the UoD information structure. We do not investigate the daily business of data processing to keep the database contents accurate, up to date, and correct with respect to conceptual constraints [de Brock'00]. Research in the area of data quality is found in [Redman'96], [Wang, Strong'96], [Jensen, Böhlen'00], [Caruso, Cochinwala, Ganapathy, Lalk, Missier'00].

2.2.4 Documentation Ideally, the CS is uniquely defined, isolated from the other information system components, and its formal description is a complete and correct composition from constructs provided in the data model theory. In practice, the amount and quality of available documentation is very diverse, ranging from no more than a simple diagram to extensive user manuals. Moreover, the documentation may be outdated, inconsistent [Jajodia, Ng, Springsteel'83], incomplete [Aiken, Muntz, Richards'94], or even totally absent. [Davis'90] remarks that 'A major difficulty in these data environments is that no one knows exactly what the data and the interrelationships really look like' (p.231).

2.2.5 Reverse Engineering In theory, every operational information system has a single, well-defined CS at its core. In practice however, it cannot be assumed that the CS is well documented or understood by the maintenance team. Indeed, we encountered one example where the database was built from four external schema descriptions, without consistent CS documentation. Even so, a longrunning system was created, so we should compliment the engineers with their achievements rather than reproach them. By definition, the operational CS is the abstracted view of the Internal Schema. Nevertheless, a mere inspection of the Internal Schema is rarely enough to grasp all conceptual features. Moreover, data may have been stored that does not comply with the database schema definition, a problem that can only be detected by inspection of the database contents. Often, some of the semantics and design intentions have been lost, and rigorous analysis is required to retrieve the overall CS construction. [Premerlani, Blaha'94] suggest that 'in some cases it is

State of the Art

25

impossible to produce a complete, accurate model of the database, because it never existed' (p.49). Over time, the field of database reverse engineering has grown to be a separate branch of information engineering [Chikofsky, Cross'90], [Kalman'91], [Winans, Davis'91], [Fonkam, Gray'92], [Andersson'94], [Waters, Chikovsky'94], [Sauter'95], [Aiken, Girling'98], [Hainaut, Engelbert, Henrard, Hick, Roland'98], [Deursen, Kuipers'99]. In a live business situation, it is a major effort to reverse-engineer a CS from whatever material is available, such as samples of database contents, printouts of the database management system, screen or report layouts, outdated design documentation etc. Where needed, we reverse-engineered the CS versions of the case studies using approaches similar to [Chiang, Barron, Storey'94], [Soutou'98].

2.2.6 The Majorant Schema In addition to reverse engineering techniques, we need a way to study differences between consecutive CS versions. But formally, consecutive CS versions are models of consecutive Universes of Discourse and therefore incomparable. Although the UoDs are contiguous, they are separated by a time barrier and their information structures are captured in different CSs. We overcome this problem by introducing the notion of Majorant Schema that merges the CSs before and after the change into a single schema. In a more formal sense, the Majorant Schema is defined as the result of integrating CSs that model disjoint Universes of Discourse [Wedemeijer'00may]. With the same purpose in mind, [Proper'94] introduced the notion of Extra-temporal Schema, but he defined it as the union of consecutive CSs. The notion of Majorant Schema goes beyond that, as we also demand schema integration. This is important, as we want to bring out naming conflicts and structural conflicts. The Majorant Schema also provides a suitable context to deal with retroactive events, i.e., with events that take place prior to the schema change while the corresponding information only reaches the database afterwards. The integration of schemas makes the data life cycles span across the time barrier of CS change. As a result, the software routines that access and update the data need not be aware of a structural distinction in data dating from before or after the CS change. The Majorant Schema approach is a variant of the well-known schema integration approaches [Batini, Ceri, Navathe'92], [Santucci, Batini, Di Battista'93], [Bonjour, Falquet'94]. Our approach uses the techniques, but our method is significantly different in several respects. It is required to identify, but not to resolve all schema conflicts. If the method is employed in CS maintenance, only the proposed new CS can be adjusted in conflict resolution, never the CS that is operational. And finally, notice how the Majorant Schema is not a conceptual schema in the ordinary sense, as we do not define an UoD that it is a model of. We employed the notion of Majorant Schema throughout the case studies, extending it to span across the entire succession of CS versions. Overall, the method was easy to apply when there were few semantic changes. We found it less suitable when major CS restructuring is undertaken. In such cases, the focus is on the quality of the new CS with respect to the new UoD, more than on reuse of the old CS or its data. Also, much of the old data will become virtually useless and the effort to salvage that data is not worthwhile.

26

Exploring Conceptual Schema Evolution

2.3 DATA MODEL THEORY 2.3.1 Data model theory as a determinant for change A data model theory provides the constructs and constructions that give precise and unambiguous meanings to the Conceptual Schema that models the real world as perceived by the user community. It is the language in which the CS is expressed. Many data model theories have been described in the literature that differ in many respects, e.g. suitability for use depending on the characteristics of the business area, semantic expressibility, level of abstraction, comprehensibility, analytical power and ease of use for the designer [Navathe'92], [Neumann'96]. However, [Kim, March'95] indicate that 'despite the various research findings, not much is known about the conceptual data model usage in IS practice to which those findings are supposed to apply' (p.111). We are not concerned with the determination of a best data model theory, however defined. Our focus is long-term evolution of CSs within the framework of the conventional 3-Schema Architecture, and the data model theory that we selected for this thesis (see appendix) is an ordinary variant of the traditional relational model. It is important to realize that the data model theory is also the language that expresses CS change: a real-world feature that cannot be expressed by the data model theory cannot be seen to evolve either. It would require another, more expressive data model theory to study the evolution of that particular feature over time [Bernstein'00], [Koch, Kovacs, LeGoff, McClatchley, Petta, Solomonides'00], [Su, Claypool, Rundensteiner'00].

2.3.2 CS differences are not always semantic changes A CS expressed in one data model theory will look and feel different from schemas that model the same UoD but that are expressed in another data model theory [Gangopadhyay, Barsalou'91], [Steele, Zaslavsky'93], [Venable, Gruny'95], [McBrien, Poulovassilis'98]. By implication, any subtle change in constructs of the data model theory can and will cause differences in each CS. These changes, which we call theoretic differences, do not derive from the UoD nor do they reflect evolution in the environment [Kelly, Smolander'96], [Bézevin'00], [Terrasse'01]. We consistently use a single data model theory to avoid theoretic differences in our research. Even if the data model theory is firmly fixed, CSs modeling the same UoD may differ [Knapp'98]. One reason for the differences is the use of a rich or extended data model theory [Batra, Hoffer, Bostrom'90], [Tauzovich'90], [Tari'93], [Rosenthal, Reiner'94], [Saiedian'97], [Lee, Kim'98]. Such theories offer more than one way to represent a particular real-world feature [Kalus, Dadam'95]. It is an artificial difference if a real-world feature is modeled first in one way and later in another way. These artificial differences, more commonly known as semantic heterogeneity or discrepancy do not reflect evolution in the environment [Effelsberg, Mannino'84], [Kent'91], [Sheth, Kashyap'92], [Saltor, Castellanos, Garcia-Solaco'92], [Tseng, Chiang, Yang'98]. We employ an essential as opposed to rich data model theory to avoid artificial differences.

State of the Art

27

2.3.3 Other data model theories 1. Object-Orientation There seems to be a general opinion among researchers that 'one of the remarkable differences between object-oriented database management systems and relational database management systems is support for schema evolution' [Lee, Kim'98] (p.157). Schema evolution in objectoriented environments is intensively studied [Banerjee, Kim, Kim, Korth'87], [Nguyen, Rieu'89], [Cellary, Jomier'92], [Zicari'92], [Monk, Sommerville'93], [Claypool, Jin, Rundensteiner'98], [Ra, Rundensteiner'99b]. The rather unspecific claims towards flexibility of object-oriented CSs appear to be founded on a better performance in the propagation of changes to other components of the information system [Poncelet, Lakhal'93], [Al-Jihar, Léonard'99]. However, the literature provides but little evidence in support of this claim regarding operational CSs in on-line transaction processing systems [Ling Liu'95], [Lautemann'97]. One explanation may be that the market place is still dominated by relational database management systems, which is an additional reason why we restricted our research to the conventional relational data model theory. 2. XML Although XML data is conceived as self-describing, many applications use DTDs (Document Type Definitions) to specify and enforce the structure of the XML data. DTDs thus assume a role similar to the schemata in relational databases [Tufte, Gang He, Shanmugasundaram, Zhang, Dewitt, Naughton'99], [Bird, Goodchild, Halpin'00], [Lee, Chu'00], [Kappel, Kapsammer, Rausch-Schott, Retschitzegger'00]. While changes to these DTDs will be inevitable, 'most of the current XML management systems unfortunately do not provide enough (if any) support for these changes' [Su, Kramer, Chen, Claypool, Rundensteiner'01] (p.103). [McBrien, Poulovassilis'01] also remark on this limitation, stating that 'whilst languages such as DTDs and XML Schema serve to structure XML documents, it is still the case that an essentially hierarchical model is being used' (p.330). A taxonomy of structural change in XML is found in [Kramer'01]. 3. Temporal data model theories Most, if not all data in the business environment is time-varying, and users need to know which data values were recorded and valid at which moments in time. [Tsichritzis, Lochovsky'82] already pointed out that 'time is perhaps the most cumbersome aspect of data modeling' (p.7). Several extensions to the relational model have been proposed to properly account for various notions of time implicit in user requirements. Temporal database models [Ariav'91] handle data transactions while accounting for valid-time and transaction-time. Surveys of temporally extended data model theories can be found in [Roddick, Patrick'92], [Snodgrass'97], [Zaniolo, Ceri, Faloutsos, Snodgrass, Subrahmanian, Zicari'97], [Bettini, Dyreson, Evans, Snodgrass, Sean Wang'98], [Combi, Montanari'01].

28

Exploring Conceptual Schema Evolution

2.4 THE DYNAMIC CS 2.4.1 Change in the CS 1. Joint evolution at three levels The operational CS is both the formalized model of the information structure of the UoD, and the abstracted representation of the stored data. As stated by [Elmasri, Navathe'94]: '(..) the database system is part of a much larger organizational information system. (..) we examine the typical life cycle of an information system and how the database system fits into this life cycle' (p.530). Consequently, there ought to be a joint evolution on the various levels of the architecture, as depicted in figure 7. It expresses how the changed CS, as a model of the old UoD, ought to be a proper model of the UoD after the change. Also, the changed CS is the abstracted model of the data store, and it must still be the abstracted model of the data store after the change. These intuitive rules can be formally expressed as correctness constraints for CS evolution, but we forego such formalizations and refer to [de Troyer'93] (p.129 onwards), [Proper'94] (p.73), and [Ewald'95] (p.21).

Universe of Discourse Models CS CS

ES ES

ch2: joint evolutions in operations

Universe of Discourse Models

evolution over time

IS IS

data store

CS CS

ES ES

IS IS

data store Figure 7. Joint evolution at three levels

The implication is not, that the timings of each transition must always coincide. For instance, the currency conversion to Euro is a typical example where CSs, and information systems and many other appliances, have been changed in advance of an impending UoD change.

State of the Art

29

2. Versioning of the entire CS A basic assumption in our approach is that the entire CS evolves as a single artifact: one CS version is replaced by the next CS version instantaneously and in its entirety [Roddick'95]. A partial or gradual change in the CS is impossible, although details of a CS change may go unnoticed if the CS is not properly documented. We do not consider more fine-grained approaches to CS change that are feasible: - change in an individual construct of the CS (entity or object class) [Zdonik'86], [Skarra, Zdonik'87], [Monk, Sommerville'93], - change in the individual entity instance or object [Cellary, Jomier'92], [Männistö, Sulonen'99], - change at the level of External Schema [Ra, Rundensteiner'99a] - change by multiple CS versions coexisting in the database [Lerner, Habermann'90], [Wei, Elmasri'98] Also, notice that some approaches only consider versioning of the CS during design, but not in the operational life cycle phase [Lautemann'97]. 3. Change coordination A CS change will affect other components of the information system. Change propagation is the general term for maintenance efforts to account for the impact of change in the CS [Karahasanovic'00]. We prefer the term change coordination to prevent the suggestion that structural change in the CS must always precede changes in other components of the information system. The terms data coercion, data conversion, or even schema evolution [Bommel'95] are used for the effort of adjusting the store of operational data to the new schema [Andany, Léonard, Palisser'91], [Ferrandina, Meyer, Zicari, Ferran, Madec'95], [Werner'99], [Türker'00]. The strict typing paradigm of the relational data model theory demands that data extensions must comply with their intent at all times. Therefore, a CS change must propagate to the stored data instantaneously. In contrast, object-oriented data model theories allow data conversion to be postponed temporarily or even indefinitely [Goralwalla, Szafron, Tamer Öszu, Peters'98]. By definition, users perceive either the old or the new UoD. The data conversion operation however refers to neither of these. Data conversion is extra-schematic, as it considers its relevant real world to be 'relationships among old and new objects' [Proper, Weide'94]. [Lerner, Habermann'90] remarked how 'to provide general database reorganization in a database transformer, the database administrator must describe the relationships among objects in the old version of the database and the new' (p.71). Thus, engineers charged with the data conversion effort will perceive the data in a completely different way than the users do [Sockut'85], [Broek, Walstra, Westein'94].

2.4.2 Semantic change in the operational CS CSs will differ when similar but unequal abstractions of the real world are perceived. For instance, one engineer may perceive a relationship of married-to whereas another engineer perceives a more general related-to relationship. The semantic abstractions differ, and we consi-

30

Exploring Conceptual Schema Evolution

der such semantic changes in consecutive CS versions to be a mark of true evolution, reflecting a significant change in the understanding and modeling of the UoD [Ventrone, Heiler'91]. Even if two consecutive versions of a CS are perfectly documented, it may be hard to identify the semantic changes in the evolving CS. Isolating semantic changes in a complex aggregate like a CS is difficult, in theory: - constructs that change must be distinguished from those that remain untouched. It requires a clear understanding of how the old constructs and constructions are related to those of the new CS [Siegel, Madnick'91]. - semantic changes must be discriminated from one another. Where does one change stop, and another begin ? The CS is made up of many constructs, and each one can be impacted by several changes at once. In the case studies, the majority of semantic changes was identified simply by comparing CS diagrams. Differences in the diagrams correspond to changes in references and entity structures. Additionally, changes in entity intent (definitions) had to be identified. This was done by inspecting available documentation, and whenever feasible, also by inspecting the stored data.

State of the Art

31

2.5 INFORMATION SYSTEM DYNAMICS 2.5.1 Information systems evolution According to the landmark study of software systems by [Lehman, Belady'85]: 'a large program that is used undergoes continuing change or becomes progressively less useful' (p.250). Generally, the flexibility of a composite artifact like an information system, or an architecture of information systems, exceeds the flexibility of its separate components [Allen, Boynton'91]. The additional flexibility is obtained by restructuring the way how components interact, and by replacing components by better ones [Delen, Looijen'92], [Veldwijk'95]. Therefore, many changes that intend to enhance overall systems flexibility are not targeted at the CS, but address some other systems' components that we are not concerned with.

2.5.2 CS change drivers What drives a change in the CS ? Many authors indicate that changes are driven by the major business events: - [ANSI/X3/sparc'75] points out that 'the CS is sensitive to business cycles, diversification, mergers, new interests and other dynamics of the corporation' - [King'86] noticed how the mere presence of an information system triggered increasing demands for the processing of new kinds of information - business process redesign and a rethinking of the business information architecture may bring a new perception of what is 'out there' in the real world [Galliers'93], [Kim, Everest'94], [Kallio, Saarinen, Salo, Tinnalä, Vepsäläinen'99] - [Gerard'94] suggests that the most important kind of change driver is the introduction of new types of business transaction. The new data transactions necessitate fundamental changes and additions in existing business processes [Gruhn, Pahl, Wever'95]. In our experience, the (informal) business motivation for changes is not often formalized, and consulting the stakeholders will not yield a full understanding of the practical arguments and sentiments underlying the change. Causes of semantic changes need to be determined by an intimate knowledge of the CS, the business environment and the information system. Investigations of the business decisions that act as change drivers for the CS, an area of research that belongs to the fields of business and management science, are found in [Earl'93], [Sethi, King'94], [Jarvenpaa, Ives'94], [Weber, Pliskin'95], [Banker, Slaughter'97], [Giacommazzi, Panella, Percini, Sansoni'97], [Orna'99], [Feurer, Chaharbaghi, Weber, Wargin'00].

32

Exploring Conceptual Schema Evolution

2.6 LONGITUDINAL RESEARCH OF EVOLUTION 2.6.1 Case studies of information systems evolution The longitudinal case study approach is not new in the area of information systems evolution: - [Belady, Lehman'76] employed the case study approach as a basis for their well-known research of software program evolution, - [Gill, Kemerer'91] study a number of completed software projects of a single enterprise (to exclude differences in the external environments of the projects), and analyze these for software complexity, - [Perry, Staudenmayer, Votta'94] use independent observers to discover how developers spend their time in a case study that explores organizational and social issues of software development, - [Ades'98] describes a single case study that explores the long-term effects of an advanced normal form in an evolving CS, - [Swanson'99] investigates the relationship between maintainability of an information system, and the maintenance efforts that the business spends on it, - [Burd, Bradley, Davey'00] analyze four cases of evolving software applications to elicit some software maintenance trends and best practices, - [Cartwright, Shepperd'00] study the use of object-oriented capabilities, particularly inheritance, and find that this capability is not well used in practical applications.

2.6.2 Evolution of the Conceptual Schema We located very few reports in the literature that specifically study CS evolution in the operational environment. [Marche'93] investigates the short-term evolution of seven different CSs, each going through a single evolution step. He reports an increase in the numbers of attributes per entity that are either 'essential' or 'relevant to understand what the entity is about'. On the other hand, the number of attributes per entity that are 'used to control and sustain processing needs', halves. However, we feel that these observations describe symptoms, rather than the essence of the evolution problem. Moreover, the idea to distinguish three classes of attributes is appealing, but it is unclear what basis it has in theory or accepted best practices. [Sjøberg'93] measures the number of syntactic changes in a CS evolution spanning less than 2 years. His research spans part of the CS design phase as well as the operational life cycle phase. Evidently, his approach of counting elementary changes is restricted to the syntactic level, and it is unsuitable to gain an understanding of the semantics of CS changes and the level of CS flexibility.

State of the Art

33

2.7 CONCLUSIONS 2.7.1 Conclusions Although the relationships between changes in the UoD, the semantics of changes in the CS, and actual changes as accommodated in operational CSs are not well understood, research that adopts the long-term view is scarce. Theoretical approaches towards CS flexibility mostly concentrate on initial design. Much less work has been done on the maintenance phase of the CS life cycle. In addition, what is offered by state-of-the-art theory in the area of CS flexibility has not been shown to match the flexibility that engineers need and use in their operational CSs. There seems to be little interest in the academic field to conduct longitudinal research into the flexibility of operational CSs, and we can only speculate why: - it takes time and requires patience to conduct this kind of longitudinal research; two factors that make it unattractive for academic research, - perhaps the semantic CS changes in a live business situation are too difficult to detect, - enterprises are deterred to subject their proprietary but ordinary CSs to a theoretical analysis as that may expose fundamental flaws in the designs, - the 3-Schema Architecture and the notion of Conceptual Schema are not en vogue today, or - perhaps the importance and relevance of CS flexibility is not generally recognized. In contrast, we are convinced that a longitudinal research into the long-term evolution and flexibility of operational CSs is worthwhile to pursue.

34

Exploring Conceptual Schema Evolution

35

Framework for Flexibility

CHAPTER 3. FRAMEWORK FOR FLEXIBILITY 'There is a school of thought that the initial planning for a data base system can include its ultimate content and usage. Many data base management systems are based on this principle (...which...) is economical in the long run only if one can accurately predict how the data base system will eventually evolve for its anticipated lifetime' From [ANSI/X3/sparc'75] (p.VIII-2)

3.1 INTRODUCTION 3.1.1 How do design strategies deliver flexible CSs ? Many design strategies exist that are widely accepted as 'good design practices' and that do appear to work: the delivered CSs are believed to be flexible enough to accommodate future changes, and so become a core component of operational information systems. The point is that there is no good understanding of effectiveness of the design strategies in volatile business environments. It is rare to find convincing arguments and practical demonstrations of how and why the delivered CS designs should prove to be flexible when exposed to the live business situation. In our opinion, there is no good understanding of the effectiveness of any of the proposed strategies when used in a turbulent business environment. This chapter first develops a framework for flexibility consisting of three dimensions that are further refined into guidelines for flexibility. Next, the framework is used in an analysis of the effectiveness of current design strategies. The analysis elicits what guidelines are employed to achieve flexibility of the CS in its operational life cycle phase. The analysis was previously published in [Wedemeijer'01april].

3.1.2 Iterative approach To develop the framework, we used an iterative approach and alternated it with the evaluation of design strategies. We conducted a survey of some 150 papers and textbooks, selected either for keywords such as 'Conceptual Schema', 'evolution' and 'flexibility', or based on references. These papers were scrutinized for CS design strategies and best practices that authors claim to ensure CS flexibility. We extracted the often-implicit assumptions about CS evolution that authors rest their claims on. Working from features that authors alluded to, we defined, refined, extended and adjusted the framework in a series of iterations until we were satisfied that it captured the main features of CS flexibility. We then applied the framework to evaluate how the advocated design strategies are supposed to deliver flexible CSs.

36

Exploring Conceptual Schema Evolution

3.1.3 Chapter outline Section 2 develops the framework for flexibility that captures the relevant aspects of flexibility in the operational life cycle phase of the CS. The next sections use the framework for flexibility to analyze over a dozen design strategies. We present the analysis in three parts: active, passive, and abstraction strategies. This arrangement of strategies, partly chosen because of their common traits but mostly for our own convenience, is immaterial to the outcomes of the evaluation: - section 3 analyzes active strategies that strive to improve adaptability of the CS, - section 4 analyzes passive strategies, also known as 'stabilizing the CS', that attempt to prevent the need for CS changes in advance, and - section 5 analyzes abstraction strategies that make the CS more abstract in order to capture less information that is liable to change. Section 6 presents the comparative review of the approaches and discusses main findings in our analysis. Section 7 draws conclusions.

37

Framework for Flexibility

3.2 DIMENSIONS AND GUIDELINES OF THE FRAMEWORK 3.2.1 Three dimensions of CS flexibility As the UoD is bound to change, as people will perceive the UoD in new ways, and as user requirements will change, so the CS must cope with change. We defined CS flexibility as: the potential of the Conceptual Schema to accommodate changes in the information structure of the Universe of Discourse, within an acceptable period of time Closer inspection of the definition reveals that it is made up of three dimensions (figure 8). We use the term 'dimension' to emphasize mutual independence: a CS can have excellent flexibility along one dimension, yet be inflexible in another.

Framework for CS flexibility dimensions

environment environment where changes originate

timeliness timeliness

to implement required changes

adaptability adaptability

to accommodate required changes

Figure 8. Dimensions in the framework for flexibility 1. Environment dimension Environment is where changes come from. It is primarily concerned with what the requested changes are that must be accommodated into the CS. It is the responsibility of the user community to determine which events in the environment have enough relevance to act as change driver for a CS change. Of course, if the 3-Schema Architecture is strictly adhered to, then the Universe of Discourse is the only relevant environment for the CS. Yet, this must not be taken for granted, as evidenced by the phenomenon of unjustified change: when changes are accommodated in the CS that do not arise in the UoD.

38

Exploring Conceptual Schema Evolution

2. Timeliness dimension Timeliness is the synchronization between change drivers in the environment and the changes accommodated into the CS. Timeliness is primarily concerned with the amount of time that is actually taken to accommodate a requested change into the CS. Instantaneous accommodation of structural changes into the CS is rarely demanded or expected; some leeway is always allowed for. The user community is responsible for determining the reasonable time span for change accommodation. Thus, timeliness is relative, not an absolute: a month may be too long in some cases, whereas a year may be satisfactory for other changes in the same business area. 3. Adaptability dimension Adaptability is the ease with which CS constructions can be changed to accommodate new user requirements. It is primarily concerned with how the requested changes are to be accommodated into the CS. Of the three dimensions, this one is the responsibility of the maintenance engineer.

Framework for CS flexibility dimensions and guidelines

environment environment select the best scope capture essence of UoD

timeliness timeliness minimize impact ease change propagation

adaptability adaptability keep it simple use abstraction layering model each feature once provide clusters Figure 9. Guidelines in the framework for flexibility

Framework for Flexibility

39

3.2.2 Eight guidelines for CS flexibility The three dimensions are further refined into guidelines (figure 9), providing us with a sound basis to evaluate overall flexibility of CSs. The guidelines derive from intuitions and insights into the mechanisms that a designer, or design strategy, employs to enhance CS flexibility. As is appropriate in exploratory research, we outline the guidelines in a qualitative, rather than rigorously formalized manner. In contrast to the dimensions, the guidelines are expected to have some interdependencies: improving a CS according to one guideline may lead to a deterioration of the CS for another. We outline the guidelines per dimension, but there is no ranking or preference for one guideline over another. 1. Select the best scope for the UoD This guideline's intent is well described by [Tsichritzis, Lochofsky'82] as: 'One would really like a model that captures the complete meaning of the world (..) we, in fact, can never have complete knowledge. It is important therefore, to capture the appropriate amount of knowledge as related to the desired use of the data' (p.6). The selected part of the real world must be logically complete and coherent, at least in the user's opinion. It must also be appropriate to the current information needs, regardless whether those information needs are expected to change over time [Jarvenpaa, Ives'94]. 2. Capture the essence of the UoD The essence of the UoD is captured by perceiving the abstractions of real-world objects that are most appropriate to the information needs. Indeed, abstraction is a core competence in modeling. Abstraction should be used diligently 'to capture the appropriate amount of meaning as related to the desired use of the data', just so that 'the meaning captured by a data model should be adequate for the purpose required' [Tsichritzis, Lochovsky'82] (p.6). 3. Minimize impact of change The first guideline for the timeliness dimension is to speed change by reducing its impact. A change with small impact is easier and faster to handle than a large change. This consideration is relatively unimportant in CS design, while the schema is yet uncommitted and changes are 'free of charge'. However, it becomes very important as soon as the CS is operational. Enterprises want to safeguard their investments in stored data, and so keep the impact of change down to a minimum. This demand severely restrains the maintenance engineer in determining which CS features must be changed, and how. Not only must the adapted CS be a good model of the new UoD, it should at the same time be 'as close to the old CS as possible' [Giacommazzi, Panella, Percini, Sansoni'97]. 4. Ease change propagation to other components [ANSI/X3/sparc'75] states that 'the trauma of change is inevitable', and it is better to prepare for it than to be caught unawares. The guideline advocates choosing CS constructs in such a way that changes will readily propagate to other components of the information system, such as the operational database management system and the data access routines.

40

Exploring Conceptual Schema Evolution

5. Keep the CS simple This guideline advocates keeping the overall construction of the CS simple, however defined, because simpler, more natural schemas are easier to understand and to maintain [Moody, Shanks'98]. The guideline concerns the overall CS construction, as well as its graphical representation: users find a CS simpler and more comprehensible if its diagram shows fewer symbols and less crossing lines [Nummennaa, Tuomi'91], [Nordbotten, Crosby'99]. 6. Use layering (general to specific) A large and detailed artifact such as the CS is easier to understand and to maintain if it consists of multiple abstraction layers. The guideline holds that the overall structure must be understood first, in general terms. The details should be deferred to lower layers of abstraction. This enables to change those details without affecting the more general concepts. 7. Model each feature once The guideline is stated by [Batini, Ceri, Navathe'92] as a demand that 'every aspect of the requirements appears only once in the schema. We can also say that a schema is minimal if no concept can be deleted from the schema without losing some information' (p.140). A similar statement is found in [Elmasri, Navathe'00]: 'guideline 2: design the base relation schemas so that no anomalies are present in the relations. If any are present, note them clearly and make sure that the programs that update the database will operate correctly' (p.472). Of course, in a practical situation it is far from easy to ascertain that the many different constructs of the CS never have an unexpected overlap in real-world semantics. 8. Provide a modular composition of the CS Decomposing a large and complex artifact such as a CS into independent modules makes it easier to understand and to maintain because we can study each module separately. If modules have high internal cohesion and a low external coupling with other modules, then semantic change will be encapsulated in a single module. This means that each module can be manipulated and modified with minimal impact on the remainder of the CS.

Framework for Flexibility

41

3.3 SURVEY OF ACTIVE STRATEGIES Active strategies or Adaptability improve overall flexibility of the CS by arranging its constructs in such a way that the CS can easily adapt to changing requirements. Two fundamental assumptions underlie these approaches. First, that certain types of changes are easy to accommodate if the CS is well engineered, and second, that any necessary change in the future, regardless of its origin, will belong to one of these types of changes.

3.3.1 Schema Transformation Schema transformation in our perception is the modification of one CS into another with slightly different semantics. However, it has other meanings in related areas of engineering. It stands for forward and reverse engineering mappings between the CS and the Internal Schema [Casanova, Tucherman, Laender'93], [Bommel'95]. It can also stand for schema translation, when a CS constructed in one data model theory is mapped into a semantically equivalent CS based on another data model theory [Assenova, Johannesson'96], [Monk, Mariani, Elgalal, Campbell'96], [Fong'97], [Wai, Embley'98], [Poulovassilis, McBrien'98]. 1. Taxonomy A taxonomy is a set of atomic schema transformations, each transforming only a single CS construct or construction at a time [Shneidermann, Thomas'82], [Lerner, Habermann'90], [Batini, Di Battista, Santucci'93], [Ewald, Orlowska'93]. By definition, a taxonomy is uniquely associated to the data model theory. Nevertheless, the extent of the set may vary, depending on the rules for schema correctness and data coercion that the atomic changes are supposed to comply with. For instance, to add an entity, or to drop a reference may be regarded as atomic transformation or not, depending on whether there is an imposed rule that the CS lattice must be connected. 2. Schema transformations as a toolkit for change A standardized set of CS transformations, perhaps coinciding with the taxonomy, can be used as a toolkit to implement CS changes, enabling the engineer to specify CS change, and to control the change impact while safeguarding existing data [de Troyer'93], [Kahn, Filer'00], [Claypool, Rundensteiner, Heinemann'00nov]. Some commercial database management systems provide such a toolkit for changes at the internal schema level [Foppema'96], [Werner'99]. Notice however that tools alone cannot be relied upon to maintain schema quality [Rosenthal, Reiner'94]. In addition, some types of semantic change that an engineer may want to accomplish may not be provided for by the set of schema transformations. For example, most taxonomies assume type persistence and do not provide schema transformations that violate it: 'for instance, an object type may not evolve into a method, and a constraint may not evolve into an instance (..) as a result, an application model history can be partitioned into the history of its constituent object types' [Proper, van der Weide'94] (p.342). CS changes that violate type persistence cause semantic heterogeneity between the old and new CS versions, e.g., the promotion of attribute values into attributes, or attributes into entities [Saltor, Castellanos, GarciaSolaco'92], [Sheth, Kashyap'92], [Hull'97]. We return to the issue of type persistence in chapter 7.

42

Exploring Conceptual Schema Evolution

3. Analysis and guidelines invoked Schema transformations and taxonomy provide us with a useful syntax of CS changes. However, it is syntax only, and semantic changes in operational CSs will generally not coincide with a single clear-cut transformation from the taxonomy. Nevertheless, an understanding how the syntactic changes may impact the operational CSs yields insight in how to minimize impact, and how to ease change propagation. The schema transformation approach affects the dimensions and guidelines of our framework for flexibility in the following ways: - the environment dimension is not affected. The change driver is taken for granted, and the transformations have no influence on it, - the timeliness dimension is improved. Impact of change is controlled and change propagation is facilitated by the use of standard schema transformations, - the adaptability dimension may be negatively affected. Most schema transformations only comply with basic rules of the data model theory, but ignore or even violate the more advanced rules. Quality aspects such as CS validity, understandability and data independence must be safeguarded by the maintenance engineer.

3.3.2 Normalization 'A random grouping of attributes (lack of cohesiveness) will make the E-R model difficult to maintain; however, the database accuracy is not seriously compromised' [Kesh'95] (p.685). A CS is normalized in order to minimize update anomalies on the level of data instances. Usually, this is formulated as the demand that the CS be in Boyce-Codd Normal Form [Teorey'94]. 1. The classic technique Normalization is founded on two ideas. First, the Universal-Relation assumption is that 'the kinds of facts about entities are predefined and are expected to remain quite stable' [Kent'79] (p.114). Second, functional dependence [Makowsy, Ravve'98], [Date'00], [Rivero, Doorn'00] is an additional construct characteristic of relational data model theories. The designer is expected to identify all relevant functional dependencies between properties in the UoD, and to disregard all other interrelationships that may exist between the data attributes. However, to determine a complete set of uniquely defined attributes and dependencies in a turbulent business environment is a major effort, perhaps even prohibitively so. 2. Eliminating derived data from the schema Derived data in the CS is deterministically dependent on other data. This can be regarded as a special case of functional dependence [Winter'93]. The elimination of derived data removes an update anomaly from the CS. This improves adaptability, as the derivation formula can now vary without affecting the stored data. Some authors even doubt that derived data are permissible in the CS at all [Kesh'95]. However, when considering actual schema evolution, things are more complicated. For instance, what is source data at one time may later reduce to derived data, or the other way around. The issue is further discussed in chapter 7 and in [Wedemeijer'00aug]. 3. Analysis and guidelines invoked Most authors seem to assume that all violations of normalization will be corrected as a matter of course. [Buelow'00], writing from personal experience, denies the assumption. Further-

Framework for Flexibility

43

more, functional dependencies can change over time, like other constructs can. For instance, users can store data with functional dependencies that the CS does not capture. Authors do not indicate how to protect the valuable investment in normalization of the CS once the job is done. Normalization is regarded as a one-time effort, and no provisions are taken to safeguard Boyce-Codd Normal Form as the CS evolves. Normalization affects the dimensions of our framework for flexibility in the following ways: - the environment dimension is not addressed. In particular, normalization takes the Universal Relation for granted, - the timeliness dimension of schema changes is not affected, - the adaptability dimension is addressed only in the 'model each feature once' guideline. The common experience that normalization, which aims to decrease update anomaly at the instance level, also seems to decrease the need for updates at the structure level, can be attributed to this guideline. The other guidelines that make up this dimension, viz. layering, clustering, and simplicity of understanding, are not affected by normalization.

3.3.3 Modularization of the Conceptual Schema Modularization is the clustering of entities and relationships into more comprehensive units. It is claimed that 'clustered entity models are easy to maintain and comprehend whilst being resilient to change' [Feldman, Miller'86] (p.349). Modules are generally chosen to be disjoint, but this is not essential as long as user understanding and maintenance of the CS are improved. The process of clustering does not enrich the CS information content, nor is information lost in the process. 1. Modularization based on distance measures A formal approach to modularization is the use of a distance or affinity measure [Batini, Lenzerini, Navathe'86], [Castano, De Antonellis, Fugini, Pernici'98]. The distance metric can be built from semantic and conceptual criteria, but other features like data access and application behavior may also be involved [Deursen, Kuipers'99]. A fine-grained distance function can be based on data manipulation of individual attribute values instead of on entire entity populations. At this level of detail, modularization and normalization become indistinguishable. 2. Packages: using the lattice structure to designate modules Is-a and Has-a reference hierarchies in the CS can be exploited to create so-called packages [Rauh, Stickel'92], [Blaha, Premerlani'98] (p.62). A closely related modularization approach in Object-Role data model theory is described in [Campbell, Halpin, Proper'96]. Packages are created by way of transitive closure. Entities without owners are designated as major entities that 'are often characterized by having an inherent stability and an easily discernable, though often complex "life-cycle"' [Feldman, Miller'86] (p.351) while at the same time, 'major entities are very complex with involuted relationships, a number of subtypes and complex key structures' (p.359). Every other entity in the CS is then aggregated in a single package, depending on the major entities that it is connected to by following references upward. 3. Other clustering mechanisms Apart from the modeling primitives of generalization and aggregation, other properties may be used as a criterion for clustering:

44

Exploring Conceptual Schema Evolution

-

dominance clustering [Teorey'94] (p.65), where modules are organized around entities that are dominant in some way, an idea closely related to the notion of major entity - relationship clustering [Jaeschke, Oberweis, Stucky'93] - constraint clustering [Pels'88] - clustering based on organizational considerations such as business functional areas and functional data ownership [Alstyne, Brynjolfsson, Madnick'95], [Ebels, Stegwee'92]. In large CSs, multiple levels of modules are conceivable, perhaps using multiple clustering criteria [Teorey, Wei, Bolton, Koenig'89]. Leveling must be done prudently, as multi-leveled CSs are felt to be rather complicated, frustrating the very purpose of improved understanding. 4. Module as a construct in the data model theory Modularization adds a level of abstraction on top of the otherwise flat CS, and conveys knowledge that is not readily extractable from the flat CS. For this reason, some authors consider the notion of module to be a high-level modeling construct that should be part of the data model theory [Papazoglou'95], [Marcos, Vela, Cavero, Caceres'01]. [Dittrich, Gotthard, Lockemann'86] discuss a notion of complex entity as 'the representation of one miniworld entity - whatever its composition out of other entities may be - by exactly one database construct' (p.422). Regrettably, they also state that a complex entity is 'simply a boundary line drawn around a set of object and relationship types in the schema' (p.425). Our data model theory does not include the module construct, and the modularization of evolving CSs is not investigated in our research. 5. Analysis and guidelines invoked Modules are supposed to cluster those constructs that will evolve coherently later on. [Feldman, Miller'86] write how 'change is usually confined to one information area' and 'the clusters minimize the impact of change' (p.359). The CS changes are encapsulated, and this improves both the timeliness as well as adaptability of change. However, it is not evident that all of the above clustering criteria are suited to the purpose. Modules are created in design without regard for future developments, and it is unknown whether the designed modules will remain optimal as the CS evolves. As the modularization criteria and the designated modules are fixed once and for all, the effect of modularization may well be a decreased level of flexibility of the CS. Finally, notice that modules in the CS created by the designers need not correspond to clusters in the real world as perceived by the users. This mismatch may be a cause of future misunderstandings between user and designers. Modularization has the following effects on the dimensions of our framework for flexibility: - the environment dimension is not affected, - the timeliness dimension of flexibility is affected by the restricted impact of change, - the adaptability dimension is affected in three ways. First, modules are provided that can be changed in isolation from the remainder of the CS. Second, the modules are an added abstraction layer on top of the otherwise 'flat' CS. Third, internal complexity of modules can be hidden from view so that the overall CS seems simpler and more understandable.

3.3.4 Reflective approach 'Reflection is the ability for a system to manage information about itself and to access (or reason about) this information through the regular access primitives of the model' [Peters, Tamer Öszu'93] (p.34). The reflective approach assumes that the structural level of the CS can

45

Framework for Flexibility

be queried and changed. For example, the query 'which connections exist between person X and person Y' can be answered from the stored data, whereas the variant 'what kinds of connection may possibly exist between person X and person Y' refers to the information structure, and is answerable only through inspecting schema semantics. In the context of conventional 3-Schema Architecture, the ability to reason about something requires that something to be part of the UoD. In other words, the CS is perceived as part of the UoD (figure 10) and it becomes self-describing. Evidently, the distinction between UoD and CS is blurred in this approach. There is the threat that the reflective capability is taken too far, as users may want to reason about other kinds of information as well: data distribution and fragmentation, security, resource allocations etc. The final CS will no longer capture conceptual features of the UoD only, and quality aspects such as simplicity, data independence, and completeness are compromised.

ch3: CS perceived as part of the UoD and it becomes self-describing

regular Universe of Discourse

Universe of Discourse extended Univ with self-describing CS

"reality" perceived reality

Models CS

ES

IS

Figure 10. the CS, when perceived as part of the UoD, becomes self-describing 1. Integrating the data model theory into the CS The high road to reflectivity is to extend the UoD and consider the CS to be part of it. The challenge for the designer is to capture the CS information structure in the CS. Luckily, the information structure of the CS itself is well defined: it is the data model theory. Hence, it suffices to incorporate all concepts of the data model theory in the CS [Proper'94], [López, Olivé'00]. Structural change in the UoD drives CS changes. Because the CS is now part of the UoD, the CS changes must be handled as if they were regular changes in the UoD [McLeod'88]. It is an example of semantic discrepancy [Saltor, Castellanos, Garcia-Solaco'92] as well as update

46

Exploring Conceptual Schema Evolution

anomaly. For instance, a plain attribute 'name' of an entity PERSON is captured once in the ordinary way, and again in the 'person.name' instance of the CONCEPTUAL-ATTRIBUTE entity. This duplicity increases impact of UoD changes and causes updates to be more tedious and time-consuming. 2. Incorporating the database storage model theory into the CS The low road to reflectivity is to incorporate the running database, or its Internal Schema, into the UoD [Boogaard'94], [Veldwijk'95]. As the resultant CS will be fully compatible with the Internal Schema, this approach to reflectivity is attractive to database vendors as evidenced by the [IBM'99] announcement '(..) to include full model-in-the-model capability for extending the information model specification' (p.5). A major problem is that non-conceptual features like data fragmentation scheme or access paths will turn up in the CS, seriously compromising data independence [Bakker'93]. Change in any of those features must be considered unjustified, as it is not driven by true conceptual changes in the perceived UoD. 3. Analysis and guidelines invoked The ability of a CS to manage information about itself and to use this information in CS evolution is attractive, yet difficult to exploit. On the one hand, structural changes in the real world can be handled just like regular changes, and the reflective capability enables to quickly assess the impact of proposed changes. On the other hand, change accommodation is hampered because structural information is captured on two levels of abstraction. The impact of changes is increased and data coercion becomes more complicated. We conclude that the dimensions of our framework for flexibility are affected by the reflective approach as follows: - the environment dimension is affected in the 'select the best scope for UoD' guideline, as the UoD is expanded to incorporate the CS (or the Internal Schema). Engineers using the reflective approach should be aware that their perception of the UoD differs considerably from the user-perceived UoD, - the timeliness dimension may be improved as changes can be made faster and easier, but the inevitable semantic discrepancies in the CS will have a deteriorating effect, - the adaptability dimension is negatively affected. Features of the UoD are captured on two different levels of abstraction. The resulting CS may not be simple, readable, understandable, and maintainable in a real business situation.

Framework for Flexibility

47

3.4 SURVEY OF PASSIVE STRATEGIES The passive design strategy, sometimes referred to as 'stabilizing the CS' or 'risk reduction' [Blum'94], aims to decrease the need for, and the impact of future change in the CS. The ultimate challenge, but it is of course beyond reach, is to create a zero-maintenance CS, i.e. a CS that is a good model of the current UoD and that is free of change, remaining static no matter what [Boogaard'94].

3.4.1 Anticipate future developments As phrased by [Levitin, Redman'95], this design strategy is taken to be: 'the major avenue for strengthening a view's robustness is to anticipate changes and accommodate them into the view's original design' (p.85). 1. The basic idea The experienced designer models not only the current Universe of Discourse requirements in the design, but also includes future ones, anticipating on what might possibly change. Incorporating the extra requirements will prevent later rework [Land'82], [Orna'99]. 2. Future change in relationship cardinality and other constraints [Wilmot'84] suggests that relationship cardinalities are particularly susceptible to change: 'It is likely that "rules" set by management or other political bodies will change more frequently and quickly than inherent properties, and that rule changes will more frequently affect relationships among entities than the related entities themselves' (p.1241). The advice is to 'avoid inclusion of foreign keys in a record type in which they are not part of that record type's primary key' (p.1241) and to use an intersection-record instead to prepare for future change in cardinality. [Davis, Arora'88] distinguish inherent constraints that are determinants of the CS structure, from explicit real world constraints that are not embedded in the CS structure. They advocate to model all real-world constraints as explicit constraints because 'these types of constraints are subject to change more frequently' (p.272). 3. Analysis and guidelines invoked The idea is to expand the scope of the UoD to the foreseeable future, but what is foreseeable ? Changes may be predicted that never materialize, and the effort to prepare for such changes never pays off. The dilemma is expressed in [Land'82]: 'building flexibility into systems can also be expensive, both in terms of design effort and operational performance. The designer is involved in a trade-off between the extra development and operational costs of designing a system which is adaptable and flexible - a very general system - or of designing a very specific system dedicated to the needs existing at the time of implementation' (p.67). To anticipate on future changes will affect the dimensions of our framework for flexibility as follows: - the environment dimension is addressed when one goes beyond present needs to select the best scope of the UoD through prediction and educated guesses, - the timeliness dimension is affected because the impact of change has been reduced in advance whenever a predicted change materializes, - the adaptability dimension ought not to be affected. However, the scope of the UoD is enlarged, and therefore the CS design will be larger and less simple without an immediate need or user requirement to justify it.

48

Exploring Conceptual Schema Evolution

3.4.2 Schema Integration approach Corporations generally use more than one information system. If the corresponding UoDs overlap, then an integrated CS is called for to ensure a valid, complete and consistent model of the combined UoDs [Goodhue, Wybo, Kirsch'92], [Johannesson'93], [Cheung, Hsu'96], [McBrien, Poulovassilis'98]. A comparison of schema integration methods is found in [Batini, Lenzerini, Navathe'86]. 1. Integration as design method Papers on schema integration usually describe the integration procedure as a single, massive design effort [Bonjour, Falquet'94], [Mirbel, Cavarero'96], [Hars'98], [Lehmann, Schewe'00]. A stepwise approach to schema integration is described in [Filteau, Kassicieh, Tripp'88]. The literature is unspecific as to which CSs should be integrated, and the order and degree (depth) of integration [Stickel, Hunstock, Ortmann, Ortmann'94]. The general opinion seems to be that the decision comes naturally, and a capable designer is able to decide the issue intuitively from user demands and current business needs. 2. The Global Conceptual Schema approach A single Conceptual Schema can be envisioned that models all the data being processed in the enterprise. We call it the Global Conceptual Schema [Batini, Ceri, Navathe'92] to emphasize that it is still a Conceptual Schema. Many other names are in use, such as: Global Data Model [Brancheau, Schuster, March'89], Corporate Data Model [Feldman, Miller'86], [Shanks, Darke'99], or Enterprise Model [Aiken, Yoon, Leong-Hong'99], [Jarke, Jeusfeld, Quix, Vassiliadis'99]. [Persson, Stirna'01], in an explorative study of integrated model usage, find that two business purposes are served: either to develop the current business, or to ensure its way of dealing with information. The global CS is larger, more complex and more difficult to understand than the individual CSs, but it provides a common language for communication about the valuable data assets of the enterprise. 3. Database federation Database federation aims to provide users with a single user interface to access heterogeneous files and databases that are operational in the enterprise [Sheth, Larson'90], [Rusinkiewicz, Sheth, Karabatis'91], [Kent, Ahmed, Albert, Ketabchi, Shan'92], [Radeke, Scholl'94], [Leymann'99], [Josifovski, Risch'99], [Chiang, Lim, Storey'00]. A prerequisite in database federation is a common understanding of the data, as well as automatic resolution of conflicting data [Agarwal, Keller, Wiederhold, Saraswat'95], [Neiling'99]. The database federation effort may require adjustments in local CSs and stored data to enforce the single consistent perception of reality [Albert'00]. These CS adjustments must be qualified as unjustified, as there are no corresponding changes in the real world. 4. Analysis and guidelines invoked Early authors suggest that 'the stability of the enterprise schema allows it, hopefully, to survive through changes in user views of the data and even changes in the DBMS' [Tsichritzis, Lochofsky'82] (p.175). However, experience has shown that the effort to create the global CS is an almost insurmountable task with disappointing results [Thompson'93], [Gerard'94].

Framework for Flexibility

49

[Goodhue, Kirsch, Quillard, Wybo'92] state that: 'In spite of strong conceptual arguments, (..) strong interest expressed by senior IS executives, and the use of Strategic Data Planning in many organizations, empirical research has found more evidence of problems than of success' (p.12). Creating a global CS may take years of meticulous planning [Ebels, Stegwee'92], and the result may be outdated even before it is finished due to unremitting change in the business environment. The new interdependencies between previously independent and autonomously maintained schemas necessitate protracted change procedures to ensure the integration of the local CSs. As stated by [Crowe'92]: 'Without a significant commitment to planning and design, integration projects aimed at increasing flexibility will, in all probability, reduce flexibility; integration is by no means synonymous with flexibility' (p.33). The dimensions of our framework for flexibility are affected by the schema integration approach as follows: - the environment dimension is involved because a single UoD is envisioned that unites a number of local UoDs, - the timeliness dimension will be negatively affected. Many stakeholders with possibly conflicting interests must approve the changes, and this will reduce the speed of change, - the adaptability dimension is positively affected in two guidelines. UoD features are modeled only once in the integrated CS. The integrated CS is a natural layer of abstraction over and above the local CSs. However, CS simplicity is negatively affected.

3.4.3 Repository tools Repository systems [Ehrensberger'77], [Winkler'89] are indispensable in CS design and maintenance. A tool organizes CS documentation, making it more easy to understand and amenable for analysis. 1. Passive documentation tools Initial data dictionary systems were intended as a passive means for storing and retrieving information about the CS [Batini, Di Battista'88], [Siegel, Madnick'91]. This is a too limited view, and it has since been realized that tools, called Information Resource Dictionaries, must capture the close interaction between the CS and the other information systems components [Steele, Zaslavsky'93], [Hsu'96], [Nissen, Jarke'99]. In addition, such tools have to provide excellent access to, and organization of CS documentation [Santucci'98], [Ritter, Steiert'00]. Repository content must be kept up to date whenever a CS changes. The manual approach is to have designers and maintenance engineers enter the changed descriptions manually. Content may also be extracted from development tools such as CASE tools and software generators. A last resort is to extract a system description from the operational information system by using reverse-engineering techniques. This effort is generally expensive and timeconsuming, and unlikely to be repeated [Aiken, Muntz, Richards'94], [Cheung, Hsu'96]. 2. Analysis and guidelines invoked Theoretically, tools ought not to affect stability of a well-conceived CS. In practice, repository systems are indispensable. A tool can make maintenance on a bad CS manageable, but we disagree with [Navathe, Kerschberg'86] who claim that 'data dictionary systems will aid in better database design' (p.22). The mere presence of a repository tool does not force engineers to exploit the full potential for change built into the CS [Fox, Frakes'95]. We therefore feel that claims about improvements in CS flexibility need to be better substantiated. Repository tools have the following effects on the dimensions of our framework for flexibility:

50 -

-

Exploring Conceptual Schema Evolution the environment dimension is unaffected by tools, the timeliness dimension is improved as impact-of-change analysis is faster. Some repository tools can also generate data conversion routines and new application sourcecode, thus easing change propagation, the adaptability dimension is improved. Repository tools generally make the CS documentation easier to understand and analyze.

3.4.4 Proven pattern (re)use If so many CSs have been developed by so many enterprises, why not use them as a source of robust and proven solutions that have stood the test of time ? Patterns represent the experience of the past in an abstracted form [Di Battista, Kangassalo, Tamassia'89], a passive knowledge that many experienced engineers apply intuitively in their work. 1. Libraries of patterns Several authors develop the idea of a component library of design patterns for reuse [Ahrens, Prywes'95], [Hay'95], [Kwon, Park'96], [Han, Purao, Storey'99], [Fernandez, Yuan'00]. The patterns are either created from scratch, or extracted from existing schemas by way of identifying good conceptual constructions that apply to a broad range of conceivable UoDs [Wohed'00may].

• create the library • stock up with reusable patterns library of patterns

ch3: Pattern reuse

• understand user requirements • search the library • select appropriate pattern(s)

• adjust pattern to fit the problem • deploy the pattern in the CS • account for impact of change information system Figure 11. Pattern reuse 2. Disclosure and deployment A pattern library alone is not enough to achieve flexibility through pattern reuse (figure 11).

Framework for Flexibility

51

If the engineer cannot locate a satisfactory pattern quickly and easily, he or she will revert to the old ways and solve the problem from scratch. [Castano, De Antonellis, Zonta'92] describe a method to organize and classify the suitable constructions once they are isolated. The library of patterns must be searchable for appropriate patterns [Wohed'00nov]. Next, the selected pattern must be extracted and tailored to fit the problem at hand. At the same time, the necessary modifications may not harm the original intent and quality of the pattern. The literature remains vague on how to execute these steps. 3. Analysis and guidelines invoked A well-stocked component library will contribute to standardization of definitions and schema designs throughout the enterprise, saving considerable time and effort. The library will serve as a 'tool kit of spare parts', helping to reduce time for response when a problem appears. This way of deploying patterns might even be considered an active strategy. Regrettably, the current state of affairs is that libraries are lacking, that methods for selection and retrieval are underdeveloped, and that pattern deployment is an uncharted territory. Quality is a serious concern in pattern deployment, as it may be speculated that every pattern is good, if used in its proper context. However, if there are no bad patterns, how does one select the most appropriate pattern for a particular contingency ? Overall, it is not evident how [Di Battista, Kangassalo, Tamassia'89] can conclude that 'as a result the quality of applications will increase' (p.257). Pattern (re)use affects the dimensions of our framework for flexibility as follows: - the environment dimension is favorably affected. The use of patterns helps the designer in getting the real-world scope right, and to understand and capture its essential features, - the timeliness dimension is improved when new user requirements are accommodated quickly and correctly through a proven pattern, - the adaptability dimension generally remains unaffected, although the patterns may be expected to be well structured and have some positive influence on each guideline.

3.4.5 Procedural approach In our experience, designers and maintenance engineers use a multitude of intuitive procedures to do maintenance as fast as possible. 1. Change as little as possible A common business practice in maintenance is to change the CS as little as possible, even to the point where existing CS constructs are deliberately misused for new purposes. For instance, an existing data field can be slightly twisted to implement a new attribute. This way of satisfying current user requirements is bound to endanger future maintenance [Premerlani, Blaha'94]. A similar practice is to persuade users not to press for change, by convincing them that their change requests are not really worth the trouble of changing the CS. 2. Follow procedures [Ra, Rundensteiner'99a] observe that 'in a typical environment, a developer must consult with others to figure out the impact of a desired schema change on other programs. This decision process of even a small schema change can be long and tedious' (p.2). They conclude that 'schema update capabilities are hence more limited by their impact on existing programs rather than by the power of the supported schema change language'

52

Exploring Conceptual Schema Evolution

Proper change procedures, such as ITIL, will improve quality and swiftness of change propagation [El Emam, Höltje, Madhavji'97]. The procedures, once in place, ensure that the full impact of change is accounted for in the CS, as well as in other components of the information system [Ewald, Orlowska'93]. On the business level, the procedures must ensure that changes are well coordinated with user training, software application changes, system interfaces etc. 3. Analysis and guidelines invoked Procedures intend to speed up the process of change, but should have no effect on the environment or adaptability dimensions. On the other hand, simplicity and understandability decrease whenever existing CS constructs are deliberately misused. The consistent use of organizational procedures for CS change has the following effect on the dimensions of our framework for flexibility: - the environment dimension is not affected, - the timeliness dimension is affected. The procedural approach may decrease impact of change, increase speed of change propagation, or both, - the adaptability dimension is generally unaffected.

Framework for Flexibility

53

3.5 SURVEY OF ABSTRACTION STRATEGIES A third strategy is to put less information into the CS, and to abstract from the level where actual changes occur. As an abstracted CS models fewer features, there will be less changes and overall stability and flexibility of the CS improves. It is a fundamental problem to know in advance what real-world features are essential to achieve the best flexibility in the abstract CS, and to know which features in the UoD are irrelevant. Design decisions on those features still need to be made, but the decisions are postponed until the Logical or Internal Schema is designed. A threat to abstraction strategies is that the abstract CS is neglected in maintenance, because the potential for change in the low-level schemas is sufficient for maintenance.

3.5.1 Ontological approach Ontology is 'the branch of philosophy which deals with what is "out there" in the world' [Wand, Monarchi, Parsons, Woo'95] (p.287). Because 'ontology can provide a generalized (that is, non domain-dependent) semantics for conceptual modeling languages' (p.300), an ontology can be established without relying on UoD information structure. Therefore, the ontology is time-invariant, as changes in the operational environment do not affect it. 1. Semantic classification The main approach is to classify relevant concepts in a semantic network. The classifications can be based on a distance measure [Bonjour, Falquet'94], ontological categories [Gruber'96], or other criteria. [Storey, Ullrich, Sundaresan'97], using eight discriminators, come up with 57 categories. By design, classification of each concept is permanently fixed: a real-world object cannot transgress the criteria over time and still be the same [Delcambre, Langston'96]. 2. Natural language classification [Hars'98] uses natural language as the mechanism for classification. To provide a common basis for information analysis, a general-purpose dictionary of some 23000 English words is developed that does not vary over time, nor across enterprises or departments. This dictionary is implemented in a tool that detects synonyms, inconsistencies in names, and semantic ambivalences in a CS. [Overmyer, Lavoie, Rambow'01] follow a related approach when providing a tool to support conceptual modeling by linguistic analysis. 3. Analysis and guidelines invoked [Wand, Monarchi, Parsons, Woo'94] point out that 'ontology does not provide guidance on identifying or organizing the important concepts in a certain domain or on analyzing the dynamics of the domain' (p.300). Ontology grasps at essential features of the objects in the UoD to provide well-organized and abstract concepts, but it limits the scope of how those real-world objects, or its user perceptions, may evolve over time. The abstracted concepts constitute a level of high abstraction in the CS, which in practice may turn out to be too abstract for use. The ontological approach affects the dimensions of our framework for flexibility as follows: - the environment dimension is affected because ontology grasps at essential features of real-world objects. Also, the ontology puts a limit on how entities may evolve over time, - the timeliness dimension is not affected, - the adaptability dimension is affected in one guideline: an ontology-based CS will have abstract ontological layers built in.

54

Exploring Conceptual Schema Evolution

3.5.2 Abstract data model theory A CS is made more abstract either explicitly by the designer, or implicitly because an abstract or 'semantic' data model theory is used that does not provide the constructs for undesired features. Authors often claim that a CS founded on their favorite data model theory will prove to be better in some way than if it were founded on another data model theory [ter Bekke'93], [Boogaard'94], [Hitchman'00]. 1. Semantic data model theory Semantic data model theories offer only constructs that are capable of capturing the essential, or semantic features of the UoD [Hammer, McLeod'81], [Abiteboul, Hull'87], [Borgida'91], [Chudziak, Rybinsky, Vorbach'93], [Ram'95], [Masood, Eaglestone'98]. Technical features such as data access and storage considerations are banned. Other constructs that ordinary data model theories offer are also eligible for elimination or simplification, such as the notion of primary key, attribute storage formats, and reference cardinality constraints. For instance, [Halpin'91] claims that CSs created in Object-Role Modeling theory are more stable because a single, unified construct is used to model both attributes and relationships. Comparisons of semantic data models are reported in [Hull, King'87], [Peckham, Maryanski'88], [Laender, Flynn'93], [Saiedan'97]. 2. Data model theories that account for CS evolution Many data model theories do not account for the temporal aspects of data [Tsichritzis, Lochovsky'82], perhaps because 'if the temporal dimension of the data is not supported by the data model then the task of data management is simplified' [Ling, Bell'90] (p.217). Indeed, information processing is much more complex if temporal aspects on the level of the data structure are taken into account, i.e. evolving CSs. Some work in this area is found in [Gal, Etzion'95], [De Castro, Grandi, Scalas'97], [Wei, Elmasri'98]. Related work in Object-Role Modeling theory is found in [Oei, Proper, Falkenberg'94], [Proper, van der Weide'94]. 3. Adaptable data model theory Taking adaptability one step further, the data model theory can be made to vary [Tresch, Scholl'92]. The idea is to include those constructs in the higher abstraction levels of the data model theory that show a low susceptibility to change. The data model theory on a lower level of abstraction is extended with those constructs that are more susceptible to change. [Atkinson, De Witt, Maier, Bancilhon, Dittrich'90] discuss dynamic extensions of the data model theory. [Englebert, Hainault'99] propose to use an adaptable data model theory only in the design phase, but to keep it fixed in the later life cycle phases. Currently, full adaptability of all constructs of the data model theory is rarely considered: 'the meta model definition is versioned to support its own development, but runtime support for switching meta schemas is not seen as viable' [IBM'99]. 4. Analysis and guidelines invoked A more abstract CS has fewer constructs than a lower-level one, so it is presumably simpler and less prone to change. It is assumed that abstract data model theories will deliver CSs that are relatively stable. Regrettably, there is no convincing evidence of operational CSs that have superior flexibility thanks to the built-in potential for flexibility generated by the abstract data model theory. The dimensions of our framework for flexibility are affected as follows:

Framework for Flexibility -

-

-

55

the environment dimensions is affected because abstract data model theories capture only the essential features of the UoD. A fundamental assumption is that it is known precisely which features must be abstracted from to obtain the best flexibility of the resulting CS, the timeliness dimension is negatively affected. Change in a CS based on an abstract data model theory requires a longer chain of translations for change propagation to InternalSchema and operational data, the adaptability dimension is affected. A more abstract data model theory will cause less details of the UoD to be perceived. The ensuing CSs will have less constructs, be smaller and easier to understand than lower-level ones.

3.5.3 Abstraction layers in the Conceptual Schema When large CSs are developed, e.g. in global schema integration, the sheer volume of information becomes problematic. [Delen, Looijen'92] indicate that the burden of maintenance is largely determined by three factors: data complexity, level of abstraction, and growth rate in the enterprise's data resource. An intuitive approach to bring complexity and abstraction under control is to build layers of abstraction into the CS. Proceeding top-down, a high-level CS is created first, and the more volatile details such as constraints are added later on. 1. The basic approach Abstraction layering is supported if the data model theory provides constructs with a built-in abstraction layering. Or, if a series of data model theories is used that provide constructs at progressively lower levels of abstraction. The lower-leveled theories extend the higher-level ones by using more, or more specialized constructs and constructions. Thus, the most abstracted layer of a CS is a view of that very CS at its lowest level of detail. 2. Abstracting from the attribute and constraint constructs The idea of having layers of abstraction in the CS is found in [ANSI/X3/sparc'78], [Mylopoulos, Fuxman, Giorgini'00]. Simply excluding attributes and constraints from the CS creates the 'upper' conceptual layer, which assumes that entities and relationships have superior stability. [Davis'90] distinguishes three abstraction layers. One tier consists of entities and attributes only; the second tier adds the relationships and functional dependencies. Other interdependencies and constraints are captured in the model on the third layer, coinciding with the usual notion of CS. Her approach assumes that some constraints are more stable than others. 3. Analysis and guidelines invoked Abstraction layers can be defined in a CS for any number of reasons, e.g. to enhance clarity of CS diagrams, to represent areas of responsibility, to delineate system and subsystem boundaries etc. They can also be chosen for stability. The approach to employ multiple levels of abstraction in the CS affects the dimensions of our framework for flexibility as follows: - the environment dimension is affected as the most essential and stable features of the UoD are captured first, at the highest levels of abstraction. Lower levels of abstraction capture successively more details of the UoD, - the timeliness dimension is improved because the impact of changes can be studied at exactly the right level of detail, with the higher abstraction levels remaining stable, - the adaptability dimension is affected in the guideline to use abstraction layers in the CS.

56

Exploring Conceptual Schema Evolution

3.5.4 Data architecture [Goodhue, Wybo, Kirch, Quillard'92] write that: 'though many authors have argued that a goal of strategic data planning is to develop an architecture of data, the term "data architecture" is not well defined' (p.14). 1. Data architecture as synonymous with Global Schema Integration One interpretation is that data architecture is a gross structure of the corporate information resource. Because of the high level of abstraction, this data architecture is not susceptible to change in the corporate environment or in the way of doing business. Potential benefits of this data architecture for the operational business are: - it sets a company-wide standard. It predefines both the scope for, and the essential UoD features to ensure that designers all have a common understanding of the business [Adriaans'93], [Looijen, Vreven'98] - it is a blueprint for subsequent development. It provides a starting point that is updated through incremental development [Filteau, Kassicieh, Tripp'88] - it provides a point of reference for quality assessments of operational CSs [Gomaa'95], [Ekering, Buitenhuis, Croon'98] 2. Architectural style Another interpretation, which we call the architectural style, grasps at implicit assumptions and perceptions hidden in the enterprise's information models. Maintenance best practices are one manifestation of style. A summary of past design decisions regarding the perception and modeling of corporate data is another. Style consistency is expected to ease the readability and understanding of CSs. [Monroe, Kompanek, Melton, Garlan'97] state that 'an architectural style serves as the "conscience" for a system as it evolves. By characterizing the crucial design assumptions, a good architectural design gives direction to the system enhancement process, indicating what aspects of the system can be easily changed without compromising the system integrity' (p.44). 3. Analysis and guidelines invoked Many enterprises have devised overall data architectures, with varying success [Earl'93], [Smits, Poel'96], [Chan, Huff, Copeland'97]. However, the usual approach is to capture the essence of the core business in a single, abstracted CS once and for all. Hence, these attempts suffer the same kinds of drawback that we discussed under the Global Schema Integration approach [Galliers'93]. The notion of architectural style is abstract, and does not contribute to schema quality in a straightforward way. Instead, it is intended to provide guidance for maintenance. As such, the guidelines in our framework may be interpreted as an architectural style with a view to long-term flexibility of the CS. The data architecture approach has the following consequences for the dimensions of our framework for flexibility: - the environment dimension is affected. A data architecture, once established, fixes both the scope of the UoD, and the perception of which features are essential, - the timeliness dimension is not affected, - the adaptability dimension is improved. A data architecture enables a better and faster understanding of schemas, and it tries to capture each feature of the UoD in only one CS in order to ensure that future CSs will integrate seamlessly.

57

Framework for Flexibility

3.6 ANALYSIS RESULTS 3.6.1 Overview Table 2 summarizes our findings on the flexibility guidelines that the reviewed state-of-the-art design strategies invoke to enhance CS flexibility. The table shows that every guideline is relevant to one strategy at least. The results of our analysis have several useful applications. It may give designers and maintenance engineers new reasons to stick to their old ways of working, or it may suggest new approaches to improve their current ways of working. dimensions. environment timeliness and. select capture miniease guidelines. best the mize the propascope essence impact gation active strategies 1. Schema Yes Yes Transformation 2. Normalization 3. Modularization of the CS 4. Reflective CS in approach the UoD passive strategies 1. Anticipate Yes developments 2. Schema InteYes gration approach 3. Repository tools 4. Proven pattern Yes Yes (re)use 5. Procedural approach abstraction strategies 1. Ontological Yes approach 2.Abstract data Yes model theory 3. Abstraction Yes layers in the CS 4. Data architecYes Yes ture

Yes may be worse Yes

Yes

adaptability keep it use abeach provide simple straction feature clusters layering once

Yes Yes

Yes

worse

may be worse may be may be worse worse Yes

Yes worse

added layer

Yes Yes

Yes

Yes may be worse

Yes Yes Yes

Yes Table 2.

58

Exploring Conceptual Schema Evolution

3.6.2 Practical implications The results of our analysis have some practical implications for designers and maintenance engineers. A first important implication is that there is no single best strategy. Generally speaking, every strategy has enough body to be applicable in an operational business situation. A knowledgeable engineer can use each of the strategies in developing or maintaining a CS to good result. The downside is that there are no clear guidelines that tell the engineer under what conditions a certain strategy works best, or when it should be avoided. Design strategies do not provide clear-cut arguments to underpin that a delivered CS will have adequate flexibility in its operational life, nor do they enable reviewers to pinpoint a flexibility deficit. This is a point of serious concern, as the analysis reveals that several strategies actually have a deteriorating effect on some of the guidelines for flexibility. Table 2 shows that no design strategy employs all of the guidelines for flexibility. Thus, an engineer who focuses on a single strategy while excluding all other approaches will probably miss out on some of the useful ways to enhance flexibility. Evidently, there is some potential to combine strategies, if they address different guidelines of the framework. This observation may perhaps defuse debates about which design strategy an enterprise should adopt: an interesting debate from a methodological point of view, but not very effective business-wise. The finding may also indicate new directions for method engineers to enhance existing design strategies or develop new ones. Experienced designers will intuitively combine several strategies to obtain the best result and to counter any negative effects on flexibility. Again, it is unclear which combinations of strategies will yield the best flexibility of the CS or what the critical determinants for success are. For instance, anticipation of future developments frustrates normalization to some extent; an abstract data model theory does not go well with the reflective approach. We did not, in our survey of literature, come across any evidence to support claims of superior flexibility of operational CSs in real business environments. We think this is a significant finding. Proponents of a particular strategy will claim that their strategy, if applied correctly, will deliver CS designs with a good potential for change, but their claims are invariably founded on arguments from theory. And the majority of these theoretic arguments are concerned with design only, ignoring the operational life cycle phase when the superior flexibility or inflexibility will become apparent (figure 12).

59

Framework for Flexibility

design design

build, build, test, test,& & implement implement

use use& & maintain maintain

Current Design Practices:

What Flexibility Is About:

¾once-and-for-all strategy

¾operational schemas

replace replace

¾does not cover maintenance ¾changes in ongoing business ¾no proof of effectiveness Figure 12. CS flexibility matched to the CS life cycle phases We noticed in our survey that many authors, when putting a claim towards flexibility of a particular design approach, do not explain their concept of flexibility. This omission is harmful, as it may give rise to unwarranted expectations about the flexibility of CS designs produced with that approach. Overall, we find that there is a need for a well-founded and well-accepted definition of what CS flexibility is, or should be. This lack of a common denominator prevents progress in the field. Theoreticians are unable to make objective comparisons of alternative approaches, and engineers are deprived of the opportunity to select a design strategy based on rigorous objective arguments.

3.6.3 Justification of the framework A first, important argument for validity of the framework is in how it was constructed. Our working definition of CS flexibility from chapter 1 lies at the basis. From this, we developed three dimensions and eight guidelines that are expected to promote CS flexibility. The focus is on what should be done to obtain CS flexibility, not on how or when an engineer must apply the guidelines in order to achieve flexibility. Thus, the guidelines apply equally well in the design phase as in the operational life cycle phase when maintenance on the CS is done. The table provides us with an internal justification of the framework because the three dimensions and eight guidelines are seen to be both indispensable and adequate. The guidelines adequately cover the mechanisms embedded in the design strategies to achieve CS flexibility. Of course, this is no surprise, as we developed and enhanced the framework iteratively while evaluating the design strategies.

60

Exploring Conceptual Schema Evolution

Table 2 demonstrates that design strategies address one or two, but never all three dimensions of flexibility: - active strategies primarily focus on the timeliness and adaptability dimensions, with some fringe effects in the environment dimension - passive strategies put their main efforts in the environment and timeliness dimensions, and pay little or no attention to adaptability - abstraction strategies are concerned with the environment and adaptability dimensions, but generally ignore the timeliness dimension This observation supports our claim of mutual independence of the three dimensions. External justification of the framework rests on the observation that many of the variations on the flexibility theme found in the literature implicitly refer to our dimensions and guidelines. We listed many relevant references in the detailed analysis. Some examples of more or less explicit references are: - [Crowe'92], working in the area of operations management, defines a manufacturing system to have a high degree of flexibility if it 'can efficiently and in a timely manner adjust to changing (..) product requirements' (p.26). All three dimensions are referred to. - [Boogaard'94] indicates 'the existence of several types and aspects of flexibility, and the relevance of several attributes of flexibility, like the ability to change, the pace of change, the direction of change, the ease of change, etc.' (p.109). The listed aspects are covered by our dimensions and guidelines. - [Oei, Proper, Falkenberg'94] define 'information systems which are able to evolve to the same extent and at the same pace as their underlying organizations are called evolving information systems' (p.354, italics added). All three dimensions are referred to in this definition. - [Jaegers'97] distinguishes 'external flexibility', i.e., the environment dimension, and 'internal flexibility', which we subdivided into the adaptability and timeliness dimensions - [Moody, Shanks'98] state that 'evaluating flexibility requires identifying what requirements might change in the future, their probability of occurrence and their effect on the data model' (p.104). This is well covered by our environment and adaptability dimensions. Timeliness is hinted at when they discuss the importance of 'reducing the time required to respond to market opportunities' (p.104) - [Furukawa'01] defines flexibility as 'the ability of a MIS to absorb the demands made on it to cope with changes originating from inside and outside a company' (p.831). This definition refers to environment and adaptability. The timeliness dimension is implicitly referred to when flexibility is evaluated in terms of time and incurred costs.

Framework for Flexibility

61

3.7 CONCLUSIONS 3.7.1 Summary This chapter developed a framework for flexibility based on the definition of CS flexibility introduced in chapter 1. The framework consists of three dimensions: - environment where the need for CS changes originate, - timeliness, i.e. time required to alter the CS itself, and to propagate the CS changes to any other components of the information system, and - adaptability, i.e. the potential of the CS to accommodate changes in its structure. These three dimensions were further refined into eight guidelines for flexibility, each formulating a widely accepted assumption or best practice regarding CS flexibility. Next, we presented a comparative review of over a dozen design strategies that are claimed to enhance flexibility of operational CSs, thereby demonstrating relevance of the framework. Three categories of design strategies were discussed: Active flexibility or Adaptability is the strategy to improve the design by arranging the constructs of the CS in such a way that it is easy to modify (the 'Engineering Abstractions' arrows in figure 13, copied from figure 6). Notice that no matter how easy the CS description can be modified, it is the operational environment that will resist the implementation of change. Ignoring the environment dimension, the active strategies assume that certain types of change are easy to accommodate if the CS is well arranged, and that any future changes will be of exactly those types. Some fundamental questions regarding active flexibility are: - what principles must be applied to select the right engineering abstractions, - how to detect and resolve conflicts in arranging the constructs, and - how to match the perception of reality to a CS that arranges its constructs in special ways. Passive flexibility, or 'stabilizing the CS', aims to decrease the need for future CS changes. The experienced designer uses this strategy when he or she incorporates more requirements into the design than those originating from the current UoD (the 'Perceived-Reality' and 'Scientific Abstraction' arrows in figure 13). The assumption is that accommodating probable requirements will reduce the need for future change; most passive strategies ignore the adaptability dimension. Some important questions to be addressed in using a passive strategy are: - how far ahead should future requirements be anticipated, and - which requirements are relevant, and which are irrelevant for consideration. The Flexibility by abstraction strategy puts less information into the CS design, thus making it more abstract. More features of the UoD are deemed non-conceptual, increasing data independence. Fundamental in the approach is that it is known which features must be abstracted from, and that any changes in those features will not affect the CS. Abstraction strategies reduce the need for change, but generally do not consider timeliness if the need for change should arise. This approach comes under the 'Perceived Reality' arrows in figure 13. Some of the questions regarding the application of abstraction strategies are: - how to decide on the best level of abstraction, and - how to transform the abstract CS into workable External and Internal Schemas.

62

Exploring Conceptual Schema Evolution

"reality" perceived reality

Universe of Discourse

scientific progress

scientific abstraction engineering abstractions

Limited Models

Mental Model

symbolic abstraction

Derived Models

Symbolic Model

Internal Schemas

Data Realm

External Schemas

Conceptual Schema

Conceptual Realm

"Best" Model

ch3: 3SA (full)

Figure 13. roles of the CS in the 3-Schema Architecture Validity of our framework for CS flexibility was argued in several ways. Construct validity derives from the fact that the framework was developed from the definition of CS flexibility from chapter 1. Internal validity is based on the observation that the design strategies are well covered by three dimensions and eight guidelines. External validity is argued by the many references in the literature that implicitly refer to our dimensions and guidelines. This chapter has brought out that current methods for the design and maintenance of CSs are not based on insight into the actual evolution of the CS over time. The design strategies are not geared towards their practical application, do not consider their impact on the operational CS life cycle phase, and their effects in maintenance are unknown. We conclude that there may be many well-conceived arguments to support claims towards flexibility of current information modeling methods, but whether the arguments meet the needs for evolving schemas in operational environments is yet an open question. We think it is significant that we could not locate references in the literature where the actual flexibility of a CS in its operational phase was demonstrated. We are convinced that it is only by investigating CSs in operational business environments that we can expect to learn about the effectiveness of the design strategies for CS flexibility. The research must investigate changes occurring in operational CSs, and analyze and assess them by way of objectively defined metrics for CS evolution.

Defining Metrics for Evolution

63

CHAPTER 4. DEFINING METRICS FOR EVOLUTION 'as expressed by Lord Kelvin: ... when you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge but you have scarcely, in your thoughts, advanced to the state of science, whatever the matter may be ...' From [Lehman'91] (p.254)

4.1 INTRODUCTION 4.1.1 The need for metrics It is generally believed that a well-designed, flexible Conceptual Schema will remain stable over time. However, current literature rarely addresses how the level flexibility should be observed and measured in the operational business environment with evolving information needs and database structures. It is clear that the notion of flexibility is too general and unspecific to be of value in assessing the quality of a CS design, and does not contribute to an understanding of the evolution of the CS. The main problems with the concept of flexibility are both in the dependence on future events, and in its lack of specificity. We overcome the first problem, dependence on future events, by investigating past events and assessing the level of flexibility, or rather stability achieved in the past operational life. Paraphrasing our definition of flexibility, a working definition for the concept of stability may be: the observation that the Conceptual Schema has been accommodating changes in the information structure of the Universe of Discourse over an extended period of time The definition of stability refers to the environment and adaptability dimensions in the same way as does the definition of flexibility. The difference is in the timeliness dimension. Flexibility demands 'acceptable' speed of change in the future; the fact that the CS was changed at all indicates that the user community did accept the required amount of time (and effort) for change accommodation. To overcome the second problem, lack of specificity, we need sound and objective criteria that can be measured and researched. The need for such metrics was already realized by [Navathe, Kerschberg'86], who, in discussing Business Information Plans and related models, remarked that 'it would be very desirable to have some objective measure of the degree of flexibility achieved, and the causes of the lack of flexibility' (p.28).

64

Exploring Conceptual Schema Evolution

There are yet no generally accepted methods and metrics to measure CS flexibility. The same holds true in other areas of the information management as observed by [Swanson'99]: 'there is no commonly accepted measure of software maintainability. While maintainability is asserted to be important, most organizations do not in fact monitor it' (p.66). It is the goal of this chapter to develop metrics specifically for CS flexibility. General notions about CS flexibility expressed in the literature serve to develop hypotheses on how schema stability ought to be expressed in operational environments. With each hypothesis, we associate a metric that may be used to test the hypothesis for evolving conceptual schemas in operational businesses. Each metric is based on straightforward measurements of conceptual features, produces objective (i.e. repeatable) outcomes, and shows the desired tendency for the associated hypothesis.

4.1.2 Chapter outline The chapter is organized as follows. Section 2 takes generally accepted hypotheses about CS stability, and proposes metrics for them. This part was previously published in [Wedemeijer'00sept]. Section 3 operationalizes eight metrics for application in our longitudinal field studies of evolving CSs, with procedural instructions that cope with operational measurement problems encountered in the business environment. We also outline why operationalization of other metrics did not succeed. Section 4 argues validity of the set of metrics from a theoretical point of view, by associating them to the dimensions and guidelines of our framework for flexibility. Operational validity will be demonstrated by the longitudinal case studies in the practical track of this thesis. Section 5 draws conclusions and outlines directions for further research.

Defining Metrics for Evolution

65

4.2 METRICS FOR CONCEPTUAL SCHEMA EVOLUTION 4.2.1 Justified change By definition, a CS is a valid, complete and correct model of the information structure of the UoD, and nothing else. As long as the business activities of the organization are perceived to remain unchanged, the information needs remain the same. It follows that a coherent and meaningful change in the CS is only justified if a change in the UoD information structure is causing it. Any change in the CS that cannot be linked with some driving cause in the UoD is by definition an unjustified change or instability. For instance, the CS should be indifferent to increasing transaction volumes or to the installation of additional infrastructure. So our first demand that must hold in quality CSs is: Hypothesis: every change in the CS is justified To establish whether a change is justified, we need to - determine every single CS change, and - associate each one with the appropriate change driver(s) from the UoD. The metric for justified change is the ratio of single CS changes that can be associated with an appropriate change driver, over the total number of CS changes (either with, or without change driver). Ideally, the ratio is equal to 1. The metric is sensitive to the definition of 'single CS change'. We use the notion of semantic change, as introduced in chapter 1. Others simply identify the single changes with elementary changes as defined in the taxonomy, which is a simplification that may be misleading when assessing justifiability of CS changes. Many taxonomies consider only transformation of a single construct or construction at a time, while the actual semantics may be a single, coherent change in several schema constructs at once. For instance, [Batini, Ceri, Navathe'92] suggest the fragmentation of one entity into a number of new, unrelated entities. Such an isolated change will never be observed in an operational CS because existing references in the CS lattice will be affected as well. The metric is also sensitive to the demarcation of the UoD. Selecting the right scope for the UoD is an important topic in design and will receive much attention of the designers. But once the design phase is finished, the scope of the UoD is fixed. From then on, the CS is by definition the complete and correct model of the UoD information structure and vice versa: the UoD is exactly what is modeled by the CS. Consider for instance an enterprise that operates an integrated customer database. If in design it was decided to exclude the internal organization of the enterprise from the UoD, then it is unjustified to change the CS in order to record regional offices, or the subsets of customers being handled by the regional offices.

66

Exploring Conceptual Schema Evolution

4.2.2 Proportional size of change In physics, the property of stability is defined for a system in (near) equilibrium as: any disturbance in the system's state will cause a reaction that is proportional to the size of the disturbance. Analogously, we want a small change in the UoD to cause a proportionally small change in the CS, assuming the change is justified. To wit: a CS may well be called inflexible if some relatively small change in the UoD triggers an avalanche of changes in the CS. We want our metrics to be sensitive to such inflexibilities, so we conjecture: Hypothesis: every change in the CS will be proportional in size to the change in the UoD information structure that causes it To establish whether a change is proportional to its change driver, we need to measure: - the size of the change in the CS, and - the severity of the change driver. The metric for proportional change is established as the ratio of size of CS change over the severity of the change driver. Ideally, the ratio should have a low upper bound. We need to relate the size of the semantic CS change to the severity of the corresponding change in the UoD. Regrettably, there is a problem in measuring severity of change in the real world. The concept is not easily formalized or rigorously quantified, for the same reason that the notion of information structure of the UoD cannot be formalized without referring to some conceptual representation. For instance, [Gruhn, Pahl, Wever'95] acknowledge that changes in formal models are related to change in the UoD, but they do not quantify this relation. [Banker, Davis, Slaughter'98] investigate the relationship between software quality (complexity) and maintenance productivity, but ignore the environment dimension. Hence, the relation to severity of UoD change is missing in their work. One alternative would be to let users appraise the change requirements [Kaplan, Norton'92]. But it is blatantly incorrect to let the maintenance engineer decide on this: the severity of the UoD change will then of course be judged by its impact on the CS ! We decided to evade the issue and use another alternative by setting the severity of change in the UoD equal to 1. This means that the metric reduces to a mere 'size of CS change'. The size of CS change is easily determined as the number of affected constructs. Ideally, each UoD feature is captured by a single CS construct at most, in keeping with the guideline to 'model each feature once'. [Lerner, Habermann'90] discussed how 'the approach taken by other researchers is that all desired database changes can be addressed by making changes local to individual classes. This assumes that the overall design of the database is correct (..and it..) does not allow for an overall redesign and reorganization of the database' (p.71). [Gomaa'95] pointed out how a change 'may be localized to one object. However, a larger change can have a ripple effect that affects several objects' (p.195). In practice, the number of affected constructs can and will be higher than just one, even if we do not consider changes in conceptual attributes and constraints that we exclude from our research.

Defining Metrics for Evolution

67

4.2.3 Proportional rate of change Likewise, it can be said that a system constantly undergoing some kind of change is not very stable. An operational CS supporting many user applications must have an acceptably low rate of change. But what rates are acceptable, what is not low enough ? Users will generally relate the intensity of changes in the CS to the business environment that is being modeled. A turbulent environment will go through frequent changes in its information structure, and users will accept a correspondingly high rate of change for the CS that models it. That same rate will probably not be accepted in a stable environment, such as a company engaged in the growing of a forest. So we have: Hypothesis: the rate of change in the CS will be proportional to the rate of change in the UoD information structure First, one has to measure the rate of change in the CS. This derives from two measurements: - the difference between old and new CS, i.e. the number of changes made in creating the new CS version, and - the lifetime of the CS versions, i.e. elapsed time between subsequent versions. The rate of change is then calculated as the ratio of the number of differences between versions, and the version lifetimes. The CS stability expressed in this rate of change improves over time if either the duration of CS versions increases - this may also reflect rigidity - or if the number of changes between versions decreases. Next, a measure for rate of change in the UoD must be devised that is targeted at changes in information structure. We are not concerned with changes in information handled by the ordinary transactions and data updates. In a similar fashion as above, we propose a rate of UoD measurement to be the ratio of two numbers: - the difference between old and new user requirements, i.e. the number of changes made in the user requirements concerning the UoD information structure, and - the lifetime of the consecutive sets of user requirements. The turbulence in the UoD information structure is expressed as the ratio of the number of changes in requirements, to the lifetime of requirements. Of course, this is a somewhat hypothetical measurement. When confronted with real business situations, it will be next to impossible to come up with an exact and verifiable count of differences in requirements. A first approximation is to have it substituted by the number of change drivers, as discussed above in the metric for justified change. The metric for proportional rate of change is established as the ratio of both measurements: - rate of change of the CS, to - rate of change in the UoD. The metric can be simplified by setting the lifetimes of user requirements equal to the lifetimes of the CS versions. It then simplifies to the ratio of: - the difference between old and new CS, i.e. the number of changes made in creating the new CS version, and - the difference between old and new user requirements, i.e. the number of changes made in the requirements deriving from the UoD.

68

Exploring Conceptual Schema Evolution

Ideally, the ratio should have a low upper bound. Too high a rate is an unstable system, and users and management will not tolerate this for long. On the other hand, a CS with a very low rate of change may not keep abreast with changing business requirements, and might even be too rigid to change at all. This holds especially for fragile legacy systems where engineers are afraid to touch anything because it might trigger an avalanche of unexpected side effects. Generally, there is no way to discern between schemas that are stable, which is fine, and schemas that are rigid, which is bad. This metric offers no solution, as it measures only the observed rate of change, not the desired rate. There is a caveat, because the rate of change measurement is biased. It will appear to be better for small CSs than for highly integrated CSs. If the UoD is larger, then more features of the UoD can change, so the rate of change in the CS will appear to be higher. Imagine dividing a large and complex CS in two halves: the rate of change for each part is only half the rate of the integrated CS. The hypothesis must not be interpreted as an advice to fragment large CSs. Other features of a non-conceptual nature may also influence the rate of change, such as the capacity of the maintenance department, or lack of adequate maintenance facilities. The rate will be significantly lower if the department is understaffed. This metric assumes that the entire CS is versioned [Roddick, Craske, Richards'93]. Some approaches use other versioning mechanisms, e.g., O-O data modeling theories allow versioning per construct [Andany, Léonard, Palisser'91]. The hypothesis may still hold, but the metric, and some others to follow, will not yield useful outcomes.

4.2.4 Compatibility User requirements change over time, and users are keen to state their desires for the new CS. An additional user requirement, often not expressed in the demand for change, is to ensure that current data and structures are to be minimally affected by the change. This is compatibility, the demand to keep the impact of change as small as possible. It is a natural drive towards stability. It will ease schema evolution because the need for complex data conversions is intentionally minimized. We define a new CS to be compatible with the old one if all recorded data, for any construct in the old CS, can be fitted into the new CS, without the need for manual adjustments in the data. Designers will go out of their way to prevent human intervention, as it will considerably increase the overall cost, time and effort of change. Thus, a CS may change in a compatible way while CS quality aspects are compromised. For instance, a familiar way to implement a CS change in a compatible way is by the misuse of a conceptual attribute to serve a different purpose. So we conjecture: Hypothesis: the rule is compatible change, the exception is incompatibility at specific places in the schema To establish at what locations a CS change is incompatible, we must look at the general pattern of changes in data instances, and ignore for the time being changes in schema constructs. The data that needs attention must be separated from the data that can be left unchanged.

Defining Metrics for Evolution

69

The set of data to be edited defines a temporary External View on the old CS. A measure of compatibility for CS change can be based on the relative size of that External View, so we count per type of construct: - the number of constructs in the 'data-to-be-edited' External View, and - the number of constructs in the old CS. The level of compatibility is then calculated as 1 - the ratio of these two counts. Equivalently, it is calculated as the number of constructs in the old CS not affected by the change divided by the total number of constructs in the old CS. Ideally, the ratio is equal to 1, meaning that all data instances of the old schema fit seamlessly into the new CS. Whereas the previous rate of change metric was biased towards small CSs, this compatibility metric is biased towards large CSs. If the same change is accommodated in two different CSs in the same way, then the metric produces a more favorable outcome for the larger one. Compatibility is closely related to logical and physical data independence [ANSI/X3/sparc'75] [Date'00]. A method to improve compatibility is developed by [Jensen, Böhlen'01], but their approach is limited to changes in a single entity (or rather, relational table). Manifestations of incompatibility are data instances that must be edited during system conversion, attributes and/or instances that are moved from one entity into another, and references that must be relinked manually. Semantic discrepancy [Sheth, Kashyap'92] is a less common variant of incompatibility in CS evolution. Notice how the user charged with resolving a semantic discrepancy by editing the data must have a clear understanding of the discrepancy, while the discrepancy itself is captured in neither the old nor the new CS.

4.2.5 Extensibility Several authors assume that new ways of doing business will augment, not replace current business procedures or methods [Perry'94], [Kallio, Saarinen, Salo, Tinnalä, Vepsäläinen'99], [Franconi, Mandreoli'00]. It follows that when information requirements change, new requirements are added to those that were already accounted for in the old CS. Thus, it suffices to extend the CS with new constructs, reducing the old CS to an External View of the new CS. Another type of extension, one that often goes unnoticed, is extension of entity intent: the entity definition is fundamentally altered, but its name and composition are left untouched. Broadening the entity intent enables to capture a broader UoD and to record more data instances for it. An example is when a 'person' is first defined as current customers only, but after intent extension, it is also assumed to include former customers. Notice how this change is fully compatible at the data level, as discussed above. This leads us to formulate: Hypothesis: the rule is schema extension, the exception is modification of existing constructs To establish whether a change in the CS is an extension, we elaborate on the compatibility metric. For each type of construct in the new CS we count: - the number of pure additions, and - the number of constructs in the new CS that differ from the old CS in any way at all. The metric for extension is established as the ratio of the first to the second count. Ideally, the ratio equals 1, meaning that there are only additions and no other changes.

70

Exploring Conceptual Schema Evolution

The metric is insensitive to the deletion of constructs, because a deleted construct does not show up in either count. This is unfortunate because a CS change may appear to be a pure extension while actually, the new construct is a replacement for some construction that is deleted simultaneously. Taxonomies are often based on the idea that change in a construct is achieved by combining a construct deletion and subsequent construct addition [Lerner, Habermann'90], [Ferrandina, Meyer, Zicari, Ferran, Madec'95]. This may be true on the level of constructs, but it does not hold at the level of data instances because data will be lost as soon as a construct is deleted. [de Troyer'93] introduced the notion of lossless transformations to exemplify the importance of safeguarding relevant data while transforming the CS.

4.2.6 Complexity It is generally agreed that complexity is a main determinant in maintenance of any product, be it hardware, software, or a conceptual schema [Delen, Looijen'92], [Henderson'96], [Banker, Davis, Slaughter'98]. As businesses depend increasingly on information systems, and as changes to information systems are designed to augment the business operation, it can be expected that the overall size and complexity of information systems will increase. Authors point out that the complexity of a system has a negative impact on its overall quality and maintainability. As stated by [Feldman, Miller'86] '"the more relationships the less comprehension" is possibly due to the accompanying increase in complexity' (p.348). We are not interested in complexity as such, but in the effects of complexity on CS evolution over time. The general idea is that as complexity of a CS is greater, change is more difficult as maintenance engineers will generally shy away from messing with complex and incomprehensible constructions. So we conjecture: Hypothesis: a more complex CS will change less frequently A metric for this hypothesis requires measures for the notions of schema complexity and frequency of change. So if we can decide on objective measures for - the complexity of each CS version, and - the duration of each CS version. then their ratio is a first characterization for this hypothesis, assuming a linear dependence between the two. Next, the hypothesis should be tested. These ratios should be compared for a number of CSs with equal and/or different complexities. A prerequisite in this hypothesis is an objective indicator of CS complexity. Surprisingly, the concept of CS complexity is often discussed only intuitively. Schema size is a first indicator of complexity but, as has been observed by [Marche'93], 'this assessment of complexity ignores the number of relationships, named and unnamed, in a given model' (p.41). Intuitively, complexity has to do with the combined effect of both a large number of entities and the interdependence between them, the result being a difficulty to understand the entire schema. Moreover, complexity of a CS is not only dependent on the information structure of the UoD alone. Other factors are of perhaps greater importance, such as ease of use of the data model

Defining Metrics for Evolution

71

theory, capabilities of the designer, restrictions due to demand for compatibility etc. [Gill, Kemerer'91], when applying McCabe's metric in seven case studies, 'suggest that maintenance productivity declines with increasing complexity density' (p.1287). This is in agreement with our hypothesis. However, closer inspection reveals that the suggestion derives from a single outlier point in their small set of case studies, so their argument is not strong. A few remarks are in order: - we need to consider what the minimal complexity can be, and set this equal to 0, - we must take into account that a CS may be made up of several unconnected lattices, - in a more complex lattice, the number of references will exceed the number of entities, and more referential integrity constraints are required to ensure overall data consistency. We suggest a simple measure for complexity by regarding the CS as a lattice with entities as nodes, and aggregation references as edges. Our CS lattice complexity metric is defined as: number of lattices ( unconnected subschemas, usually 1 ) - number of nodes ( entities ) + number of edges ( references ) This metric of lattice complexity is not new. McCabe's measure of cyclomatic complexity for software code follows the same line of reasoning [McCabe'76]; it can even be traced back to the mathematician Euler (1707-1783). A simple lattice of two entities connected by a single reference has a lattice complexity equal to 0. A lattice of three entities interconnected by three references is slightly more complex, with lattice complexity 1. This number has a sound interpretation: one connection constraint may suffice to ensure data integrity in this lattice. Our operationalization of the hypothesis on complexity focuses on lattice complexity only and ignores other aspects of complexity in the CS such as schema size, data dependencies, etc. In this respect, our metric is superior to the metrics proposed by [Genero, Jimenez, Piattini'00]. Considering the many aspects that contribute to complexity, it can be doubted that a single number suffices to express overall complexity; [Calero, Piattini, Genero'00] propose a suite of five metrics. We forego complexity due to specialization hierarchies, as our case studies do not require it. The CSs in our cases rarely show two or more levels of specialization, and overall semantics is always simple and readily understood. Attempts at understanding and clarifying specialization hierarchies have been described in [Chen, Li'86], [Jianhua Zhu, Nassif, Pankaj, Drew, Askelid'91], [Dvorak'94], [Lammari'99], [Jones, Rundensteiner'99].

4.2.7 Abstraction reduces the need for change The notion of CS abstraction is markedly similar to that of complexity. Like complexity, the level of abstraction is an important design consideration. [Delen, Looijen'92] state that the stability of a CS depends on its level of abstraction. The general idea is that a more abstract design will have superior stability, because less abstract CSs need to be changed in more places to adapt equally well to a new requirement. So we conjecture: Hypothesis: a more abstract CS will go through fewer changes

72

Exploring Conceptual Schema Evolution

A metric for this hypothesis should include: - the level of abstraction of the CS, and - the number of constructs in the CS that change over time. Their ratio is a first characterization for the hypothesis, assuming a linear dependence between the two. In order to test the hypothesis, ratios should be compared for a number of CSs. Like complexity, a metric for abstraction ought to build on a generally accepted and well-defined measure of abstraction, which again is found to be lacking. A few remarks are in order: - first, it is evident that a CS with a higher level of abstraction should have fewer constructs with more instances. A lower level of abstraction results in a CS with more constructs and fewer data instances. - second, abstraction in the CS is strongly related to the data model theory in use. Some data models theories (e.g. those based on ontological principles [Wand, Monarchi, Parsons, Woo'95]) are considered to be more abstract than others. - third, CS designs are often documented on multiple levels of abstraction [Mistelbauer'91], and the metric ought to show consistently better outcomes on the higher levels. Unlike complexity however, these remarks do not point the way to a simple metric for abstraction. As a result, we could not operationalize a suitable metric for the hypothesis.

4.2.8 Susceptibility to change It is a common assumption that attributes of an entity will change more frequently than the entity as a whole, and descriptive attributes will change more often than primary key attributes. In a related context, [Amikam'85] uses the term 'sensitivity'. [ANSI/X3/sparc'78] already speculated that susceptibility to change would be low for entities, while constraints will have a high ratio. Change in attribute data format / data granularity is perhaps the most frequent change of all, i.e., attributes exhibit the highest susceptibility to change ratio. [Wilmot'84] suggests that 'it is likely that "rules" set by management or other political bodies will change more frequently and quickly than inherent properties, and that rule changes will more frequently affect relationships among entities than the related entities themselves' (p.1241). [Siegel, Madnick'91] write that: 'In fact, our experience leads us to believe that changes in the semantics of data are more common than changes in structure' (p.143). As changes in conceptual constraints and attributes are beyond the scope of our research, we are unable to contest or confirm claims like the one by [King'86] that 'growth comes from increasing average logical record size rather than increasing numbers of records' (p.47), or the finding by [Sjøberg'93] who reports of conceptual attributes that 'every relation has been changed. At the beginning of the development almost all changes were additions. After the system provided a prototype and later went into production use, there was not a diminution in the number of changes, but the additions and deletions were more nearly in balance' (p.39). The general opinion is that some types of constructs are more stable than other types, and this serves as an argument to advocate top-down design approaches. Entities and references are thought to have best stability and hence are modeled first. Attributes and reference cardinali-

Defining Metrics for Evolution

73

ties are modeled in a next stage. Integrity constraints and business rules are the most volatile and are added to the schema as late as possible. So we conjecture: Hypothesis: some types of construct in the CS are more susceptible to change than other types of construct Obviously, metrics for this hypothesis must differentiate between the types of construct provided by the data model theory. A simple measurement will include: - per type, the total number of constructs that is present in the CS, and - the number of changed constructs of that type. The susceptibility to change per type of construct is calculated as the ratio of the number of changed constructs to their total number in the CS. These ratios can then be compared between types. Like the metric for compatibility, this metric is biased towards large CSs. However, we are investigating only the entity and reference constructs, and only CSs of moderate size. It so happens that the total numbers of these constructs are approximately equal, thus canceling out of the comparison. We can therefore simplify the measurements and count only the numbers of entities and references that change. This simplification is inappropriate if larger CSs are studied, or if the hypothesis is investigated for the other types of constructs, i.e., for attributes and conceptual constraints. In that case, only the ratios of changed over total numbers of constructs may be used to draw a comparison of susceptibility to change between types. An implicit assumption in the metric is that constructs and constructions in the CS before and after the change have a recognizable relation, even if the modeled real-world feature is subjected to a drastic change in information structure. The hypothesis on susceptibility to change goes one step further: type persistence is assumed. Once a UoD feature has been captured by one particular type of construct, it will be modeled forever with that same construct type. If a feature is modeled first with one type of construct, then with another, then that change cannot be traced at all. Instead, a construct of one type disappears from the CS in a semantic change that happens to coincide with the appearance of an unrelated construct of another type. Type persistence is often assumed, e.g. [Lerner, Habermann'90], [Lautemann'97]. [Proper, van der Weide'93] argue that 'an object type may not evolve into a method, and a constraint may not evolve into an instance' (p.357). [Proper'94] formulates this as an evolution separation axiom and deduces that 'an application model history can be partitioned into the history of its object types, its constraints, its methods, its populations, and its concretisations' (p.71). Some authors concede that a construct might change its type [Skarra, Zdonik'86], [Lerner'96], but this is not covered by our metric. The assumption of type persistence puts a heavy burden on the designer to be infallible in choosing types. However, there is no intrinsic reason why type persistence should hold. It is up to the maintenance engineer to decide on the best way to represent a UoD feature in the new CS version, and the choice of construct can differ from one CS version to the next.

74

Exploring Conceptual Schema Evolution

4.2.9 Preservation of entity identity As we use a relational data model theory, the previous hypothesis can also be made to apply to the important features of candidate key and functional dependency. Preservation of (the set of) candidate keys is important. Keys provide a sound understanding of the entity, they serve to distinguish each entity instance from all other instances, and thereby keys ensure that every real-world feature is accurately captured and recorded in the database. So we conjecture: Hypothesis: the rule is no change in candidate-key compositions, the exception is change in some candidate key The metric should account for composite keys as well as the possibility of more than one candidate key. What needs to be established is the composition per entity of all candidate keys present in the old and the new CS, and then determine: - the number of candidate keys that have been changed from the old CS, and - the total number of candidate keys for each entity in the new CS. The ratio of keys changed to the total number of keys is an indication of the susceptibility to change of the candidate keys, and thus of the entity identity itself. If keys are stable, then none will change, and the ratio is equal to 0. It is to be expected that this metric is closely related to the susceptibility-to-change metric of entities: candidate keys will change only if the entity itself is observed to change. In a live business environment, it may be difficult to establish beyond doubt what constitutes a change of entity identity. For instance, if an Employee entity is defined, is entity identity compromised when it is suddenly used to record data on temporary help ? A careful count is required that detects homonyms, synonyms and other alterations in the composing attributes. The count must establish beyond doubt whether any one of the candidate keys is affected by such alterations. On other occasions, entity intent may be unchanged while identity is not preserved for some non-conceptual reason. For instance, if more instances have to be recorded than initially expected, then the data format of a key attribute may need to be expanded. This was the case in the massive changeover of Dutch telephone numbers some years ago, an example of change in attribute format that remains outside the scope of our research. We consider the notion of primary key to be a matter of implementation, and our semantic data model theory does not provide for it. This is important in the metric because of weak key implementations in dependent entities. A change in primary key of an owner entity will cascade to the implemented weak keys, but entity identity of the dependent entities may be preserved notwithstanding.

4.2.10 Change is restricted to a single module Many authors suggest that a modular CS has better stability than a CS without modules. The modules are expected to absorb changes and to isolate other modules from the impact of change comparable to the information-hiding property in O-O approaches. So we conjecture: Hypothesis: a single UoD change will cause change in only a single CS module

Defining Metrics for Evolution

75

We can use the previous measurements and apply them to establish a metric for this localization property after each module and its exact boundaries has been determined: - identify each single change driver in the UoD, and - determine the number of modules where a change is made as a result. The metric is the ratio of sum of change drivers over the sum of affected modules. Ideally, this ratio will be equal to 1. The literature is vague on the definition and handling of the module construct. There is no outstanding best practice to determine good modules for a CS, nor can soundness or optimality of a preferred modularization method be assessed. This is a serious drawback. Size (granularity), complexity and even more so the criteria for clustering are critical issues in the designing of schema modules, but it is rarely explained how the right choice of modules will enhance schema stability. Our hypothesis is based on the assumption that a correct modularization will encapsulate changes. Thus, impact and propagation of change will be confined to only a single module. Another possibility is that well-defined modules have better stability than the CS as a whole. The hypothesis would then be that changes in the CS affect not the modules themselves, but their interfaces, i.e., how the modules cooperate in the CS. Either way, it is important to define precise boundaries between modules. We feel that it is an unjustified change if some feature of the UoD is first modeled in one module, but drifts into another one later. Further research is needed to determine which modularization methods are most beneficial to CS stability, to clarify how the benefit is achieved, and to suggest best practices to select a sound set of CS modules for particular modeling problems.

4.2.11 Modules are stable Once it is decided to decompose a CS into a set of modules, there will be a feeling that each module has 'a life of its own'. That is, each module is the valid and complete model of an isolated part of the UoD, and satisfies all the usual quality requirements such as correctness, understandability, validity, etc. The logical implication is that each module can and will evolve as an independent unit within the CS, its evolution can be traced over time, and therefore, a well-chosen modularization will contribute to CS stability. So we conjecture: Hypothesis: modules in the CS are stable The module does have an internal structure, but that remains hidden from outside the module. Some authors take the concept of module even so far that the module is redefined as a single entity [Akoka, Comyn-Wattiau'96], [Dittrich, Gotthard, Lockemann'86], [Urtado, Oussalah'98]. It is a variant of the familiar concept of information hiding in O-O approaches. However, we feel that the idea cannot be easily extended to the relational model, because it infringes upon some of the basic axioms on which the relational data model is built. Furthermore, stability of one module does not mean that the CS as a whole is stable. Other metrics such as rate of change and level of complexity and abstraction can also vary greatly among the modules. Actually, this hypothesis brings us back to where we started: measuring stability of the CS in order to understand its flexibility. Only now, the hypothesis concerns modules in the CS, and not the CS as a whole. All of the previous hypotheses and metrics can be used to study the stability of the separate CS modules.

76

Exploring Conceptual Schema Evolution

4.3 OPERATIONALIZING THE METRICS 4.3.1 Applicability in the business environment The metrics as developed above focus on the need to measure when faced with CS evolution, and on what should be measured. However, the metrics are still on a rather conceptual, theoretical level. To apply them to evolving CSs in their operational life cycle phase, and to obtain reproducible outcomes, we need to refine the metrics and ensure the consistency of measurements.

dimensions and guidelines

environment environment select the best scope capture essence of UoD

timeliness timeliness minimize impact ease change propagation

adaptability adaptability keep it simple use abstraction layering model each feature once provide clusters

ch4: operationalized metrics 1. Justified change 2. Size of change

3. Compatibility 4. Extensibility

5. Lattice complexity 6. Susceptibility, entity 7. Susceptibility, reference 8. Preservation of identity

Figure 14. Operational metrics

4.3.2 Operational metrics The eight metrics depicted in figure 14 are applied in the four case studies of the practical track. To operationalize the metrics for use in the case studies, we need to make some adjustments and simplifications to circumvent practical problems. And we give procedural instructions in order to cope with problems of measurement that we encountered in the business environment. These procedures ensure that verifiable and repeatable outcomes are procured that are independent of the observer and the time of observation:

Defining Metrics for Evolution

77

1. Justified change Inspect the Universe of Discourse and check for contemporary developments that may explain the change. If there is a proper change driver in the UoD, state 'yes', otherwise 'no'. 2. Size of change (but not: proportional) Count the number of constructs (entities plus references) involved in the CS change. Base the count on the CS as it was before the change, including the affected constructs of the CS after the change to achieve a neutral and realistic count for proportionality. In this way, insertions and eliminations will have equal counts, which we think is proper. If a change involves a specialization, count either that specialization, its generalization, or the is-a reference. 3. Compatibility Draw up the 'data-to-be-edited' External View, i.e. determine for which entities instances were manually adjusted to fit the new schema. If the view is empty, indicate that 'all' entities are compatible. If the view is not empty, indicate that 'manual' adjustments took place. Base the measurement on how the change was actually accomplished in the business situation. Do not consider hypothetical situations, if a data incompatibility might have been present, but did not actually occur in the business solution. 4. Extensibility Check each construct in the new CS to see if it is identical to a former construct, similar to a former construct but with an extended intent, or if it is a newly added construct. If the answer is always 'yes', then extensibility holds. Extensibility does not hold, stated as 'no', if at least one construct is altered, restricted in intent, or eliminated altogether. 5. Lattice complexity Indicate by how much the overall lattice complexity increases or decreases. Do not count specializations or associated is-a references, as specialization is irrelevant in lattice complexity. As we are dealing only with coherent CSs (no unconnected parts), the change in lattice complexity is easily calculated for each separate semantic change as: + added references - eliminated references - added entities + eliminated entities 6. Susceptibility to change: per entity Count all entities of the old CS that are affected in their semantics by the changeover to the new CS. Do not reckon an entity changed if some of its references change but both its intent and extend remain unaffected. If a size-related metric is desired, the total number of entities present in the old CS must also be stated. 7. Susceptibility to change: per reference Count all references of the old CS that are affected in their semantics by the changeover to the new CS. Do not count a change of reference if only its referring or referred entity is affected. Do count a change of reference if its intent alters, or if the reference is relocated to other entities. If so desired, also state the total number of references present in the old CS. 8. Preservation of entity identity Indicate 'yes' if all entities retain their candidate keys. Indicate 'no' otherwise.

78

Exploring Conceptual Schema Evolution

4.3.3 Inoperative metrics Refining the metrics also brings out that some are yet too impractical for use for lack of theory, lack of case study data, and/or lack of comparative case study material. Some hypotheses are of a comparative nature, which calls for a comparison between different CSs that evolve in similar environments: - a more complex CS will change less frequently - a more abstract CS will go through less changes - some types of construct in the CS are more susceptible to change than other types of construct We succeeded in eliminating the comparative aspect in the complexity and susceptibility-tochange metrics by making some simplifying assumption. However, our case studies cover different UoDs, a comparison of the levels of flexibility is out of the question, and we had to drop the metric for level of abstraction. Metrics that we could not operationalize are: 1. Proportional rate of change For want of a sound and repeatable measure of 'turbulence in the UoD'. It may be expressible as a ratio of the number of changes in requirements, to the lifetime of those user requirements. But in practice, this is impractical. When confronted with real business situations, we found no way to come up with a reliable 'count' of differences in the rather informally stated requirements. We considered simplifying to an 'incidence of change' metric, or 'life span of CS version', but we did not pursue the idea when we found our measurements of the CS version life spans to be unreliable due to possibly undetected CS version changes. 2. Abstraction reduces the need for change For want of a sound measure of 'level of abstraction in the CS'. Moreover, we explained above how this hypothesis involves a comparison between different CSs that evolve in similar environments, which is impossible in our exploratory research. 3. Susceptibility to change: for conceptual attributes and constraints Because the attribute and constraint constructs are outside the scope of our research. We do not compare the susceptibilities to change across all types of construct as this hypothesis requires. 4. Change is restricted to a single module For want of sound modularization. There is no rigorously defined body of theory, nor are there well-accepted business practices of how to modularize a CS to support its future flexibility. 5. Modules are stable Again, for want of sound modularization.

Defining Metrics for Evolution

79

4.4 RELIABILITY OF THE METRICS 4.4.1 Association with the framework for flexibility Figure 15 is a first demonstration of soundness, as each metric can be associated with one guideline of the framework.

dimensions and guidelines

environment environment select the best scope capture essence of UoD

ch4: association of metrics with framework Justified change Proportional size of change Proportional rate of change

timeliness timeliness

Compatibility

minimize impact ease change propagation

adaptability adaptability

Extensibility

Complexity

keep it simple use abstraction layering model each feature once provide clusters

Abstraction Susceptibility to change Preservation of identity Change per module Modules are stable

Figure 15. Association of the metrics with the framework No metrics are associated with the 'ease change propagation' guideline. The implication is not, that the guideline is unimportant and should have been left out. Rather the reason is the strong focus of our research on only the evolving CS. As the guideline refers to other, non-conceptual components of information systems, metrics for this guideline cannot be restricted to the UoD or CS only, and therefore it is beyond the scope of our research. The 'provide a modular composition' guideline also has no associated metrics. This is partly due to the lack of a theory on how to determine sound modules. A further reason is that we found that no CS in our case studies was modularized with a view to CS maintenance. Although some documentation described the schemas part by part, the partitioning was meant to improve understanding, never to enhance CS flexibility by encapsulating change. Although we associate each metric with one guideline only, some measurements may hold significance for other guidelines as well. It is reassuring that the metrics cover the dimensions and guidelines of the framework rather well. Coverage is not complete, however, we do not

80

Exploring Conceptual Schema Evolution

seek complete coverage. The metrics serve our purpose and provide useful insights into the characteristic behavior of the evolving CSs of our case studies. To the best of our knowledge, a comprehensive set of metrics for schema evolution has not been reported before.

4.4.2 Validity An important reliability test is to demonstrate that the metrics display the correct tendencies. Thus: a more flexible CS should display more favorable measurements, while a CS with inferior flexibility demonstrates worse outcomes. However, this test of reliability is not an issue now. The metrics we are dealing with are based on theoretical conjectures that have not been exposed to operational tests before, and their trends in natural schema evolution are yet unknown. The fundamental question of our research is in how to assess the level of flexibility in an operational CS in the first place. We claim that our set of metrics target only conceptual properties of the operational CSs. It is easily verified that the metrics do not explicitly refer to non-conceptual characteristics of the business environment or database system. Moreover, the outcomes of measurements will be independent of the observer and time of observation, insofar as any observation can be free of personal interpretations. To name a few characteristics that are irrelevant to our metrics: - types of data access, an area covered for instance by CRUD analyses [Pels'88], [Ebels, Stegwee'92], or cohesion in methods [Etzkorn, Davis, Li'98] - data volatility, the intensity of data access and/or number of daily update transactions, and the number of users and user applications that access the database - data distribution, data fragmentation across multiple sites, and other implementation features of the software- or hardware-architecture, and - the preferred design approach and the organizational / architectural design strategies. In particular, our dimensions, guidelines and metrics for flexibility do not refer to schema size in a straightforward manner. Size is important in understanding a CS, but not in its flexibility and evolution over time. Indeed, our metrics are applicable for CSs covering small, mediumsize, up to very broad business domains. However, some of the operationalized metrics refer to CS size in a more indirect way, which may cause a bias in their outcomes. Chapter 7 will discuss how CS size evolves in the analysis of long-term behaviors of the metrics. The above arguments should not be read as an assertion that CS flexibility depends only upon conceptual features. Non-conceptual features and business decisions often influence CS changes in indirect ways, as we already pointed out on several occasions. For instance, there is an implicit dependence on the personal capabilities and experience of the maintenance teams, on demands for compatibility, on the imposed restrictions that limit the types, sizes, or numbers of changes in a new CS version, etc. We want such influences to be reflected in the metrics in order to gain an understanding of the CS evolution in the common business environment and to learn about CS flexibility in the long run. We do not claim that the metrics are independent from one another. Indeed, there may be overlap in the metrics, or their underlying hypotheses may be interrelated in ways that are not yet understood. Further research is needed to disclose and clarify such relationships, which may lead to enhancements of the proposed framework for flexibility.

Defining Metrics for Evolution

81

4.5 CONCLUSIONS 4.5.1 Summary The main contribution of this chapter lies in the realization that measuring is an essential first step in order to understand and improve current theories and best practices in CS design and maintenance. Based on generally accepted hypotheses about flexibility, we developed metrics by which to evaluate CS flexibility in the operational business environment. Or rather the CS stability, as metrics can only be applied by looking backward in time, whereas flexibility is defined as a future potential. It is important to realize that, in order to come to understand CS flexibility, we must investigate not the future, but the past evolution of the CS. On the other hand, verifiable and objective metrics can only assess to what extent stability has been achieved over the past operational life of the CS. Stability of a CS is evidence that any required changes have been accommodated and that flexibility has indeed been delivered. Which of course is no guarantee that it will remain stable in the near or distant future: future stability equals flexibility. We operationalized eight of these theoretical metrics for application in an ongoing business environment, as we are conducting longitudinal field studies of evolving CSs. Validity and fitness for use of the metrics was argued from a theoretical point of view, by associating the metrics to the dimensions and guidelines of our framework for flexibility. But the true test of validity of metrics is in their application to evolving CSs. As stated by [Basili, Briand, Melo'96]: 'Empirical validation aims at demonstrating the usefulness of a measure in practice and is, therefore, a crucial activity to establish the overall validity of the measure' (p.752).

4.5.2 Concluding the Theoretical Track Chapter 1 stated the first objective of our research as: - to develop a framework for CS flexibility that captures relevant aspects of flexibility in the business environment and demonstrate its relevance. This framework was developed in chapter 3 of the Theoretical Track, and its relevance was amply demonstrated. At this point, we feel that theoretical notions about the flexibility of CSs in the business environment have been sufficiently covered: - what are the three constituent dimensions of CS flexibility: environment, timeliness and adaptability, subdivided into eight guidelines, - how do design strategies propose to achieve CS flexibility in the operational business, and - how we should investigate CS flexibility by measuring the semantic changes in an evolving CS and by applying the appropriate metrics. We now proceed to the Practical Track of this thesis, chapters 5 through 7, to conduct an operational research of four operational CSs evolving in their natural environment. Readers who are less interested in the details of the exploratory research can skip the Practical Track and turn to chapter 8. It presents the synthesis of theoretical notions of CS flexibility developed in this Theoretical Track, with the findings of the Practical track.

82

Exploring Conceptual Schema Evolution

83

PRACTICAL TRACK

84

Exploring Conceptual Schema Evolution

Four Case Studies

85

CHAPTER 5. FOUR CASE STUDIES 5.1 CASE STUDY: BENEFIT ADMINISTRATION 5.1.1 Introduction This case study outlines the evolution of a highly integrated Conceptual Schema in its business environment. The enterprise where this case study has been conducted is Europe's largest pension fund that we will refer to as the Pension Company. Our case study concerns their Benefit Administration information system. This highly integrated transaction processing system supports most of the daily business processes of the Benefit Administration department in varying degrees of automation. Our subject for investigation is the CS at the core of this information system. The system uses proven technology dominating today's marketplace, i.e. mainstream graphical user interfaces and a relational database management system. In keeping with its high level of integration, the CS has grown to well over a hundred entities (not counting specializations) and it is still growing. Obviously, this is not a comfortable size for our research purpose. We therefore limit the scope of the case study to the pension benefit concept. We trace how this real-world concept is perceived and represented as the CS evolves.

CS versions

dominant business changes

january 1996 -october 1996 --

pension scheme for Early_Retirement innovated

july 1997 --

information strategy revisited

february 1999 --

facilities for benefit-exchange extended information strategy revisited

september 1999 -august 2000 -time

ongoing maintenance

ch5: Ben dominant business changes Figure 16. CS versions and dominant business changes

The time series of CS versions that we include in the case study is shown in figure 16. Design and implementation of the system and its CS began in 1994, going operational end of 1995. The case study covers the period 1996-2000, but the system is expected to run at least until

86

Exploring Conceptual Schema Evolution

2005. The time intervals between consecutive versions vary between a half and one-and-a half year. Actually, there were some intermediate versions, but we could eliminate them from our analysis. It was found that those intermediate versions were targeted at other concepts than pension benefit; remember that we are dealing here with a highly integrated CS. A preliminary version of this case description is published in the Annals of Case studies in Information Technology [Wedemeijer'02].

5.1.2 Business background 1. The Company Pensions provide financial coverage for old age, death, and early-retirement of an employers workforce. The Pension Company currently manages the pension benefits of over a million (former) employees and family members. The net current value of pension benefits is more than 100.000 million Euro, and the monthly paycheck to pensioners is over 500 million Euro. However interesting these financial aspects, we will concern ourselves with the data management aspect as pensions require meticulous and complicated record keeping. 2. Business Functions Figure 17 shows the (simplified) chain of primary business functions. It shows how employers submit data about their workforce (usually some copy of the payroll) and pay in their financial contributions. These two inflows are carefully checked against each other. The accepted data are then transferred to Benefit Administration for further processing. Three Claims-and-Payments departments process all claims. The case study concerns the business function of Benefit Administration only; we will not study the integration of this business function with its neighboring functions.

Employer

Pension company

submission of payroll data

data acquisition

benefit administration

claims and payments

financial contribution

accounts receivable

asset management

accounts payable

Figure 17. Chain of primary business functions

Four Case Studies

87

3. Management structure The Pension Company is functionally organized. There is a single Data Acquisition section and Benefit Administration section. The business function Claims-and-Payments is performed by three 'spending' departments. Old-Age Pensions is responsible for payments to old-age pensioners, Dependents-Payments takes care of death benefits for dependent family members upon decease of the (former) employee, and Early-Retirement Payments handles all earlyretirements. 4. Daily operations Responsibilities and activities of the functional departments are broadly as follows: - Data Acquisition collects data about the participants in the pension scheme (employees and dependant family members) from external sources. The main data source is employer payrolls. Tape copies of monthly payrolls are received from all employers, and matched with the Accounts Receivable department collecting the pension contributions. A second data source is the municipal registry offices ('city hall') that supply addresses and family relationships by electronic data interchange. All acquired data are first validated, then reformatted and recorded in the various Pension databases. - Benefit Administration keeps complete records on all pension benefits. It involves recording all job switches, changes in wages or part-time percentages, changes in the types of benefits etc. It also involves recording marriages and divorces, because pension benefits is legally a joint property that has to be divided upon divorce. Most, but not all of the data processing is fully automated. - If a benefit is due, customer data is transferred from the Benefit Administration to the particular Payments department. The information systems are loosely coupled, and claim processing begins by isolating and taking out a full copy of all the relevant benefit data. 5. Modeling Guidelines The Information Management department formulated guidelines and best practices on data information modeling to provide guidance in the design and maintenance of the information systems. Important modeling guidelines for the CS to which we will return later are: - single unified view and therefore single point of maintenance the Benefit Administration department demands a single highly integrated database to support all business functions in all variants. The database is not partitioned into separate modules or local databases that allow independent maintenance. It is felt that disintegration would cause problems to coordinate the local changes and their reintegration into a single consistent view. The consequence is that department-wide priority setting and maintenance deadlines are crucial management decisions that are indispensable but very time-consuming. - high level of generalization and therefore low maintenance the CS should rise above the level of ad-hoc implementation features and focus on persistent properties instead. This guideline is intended to steer designers away from quick solutions (that are often not only quick but dirty as well) towards long-term, more stable solutions.

88

Exploring Conceptual Schema Evolution

It is a typical feature of life insurance, and pensions in particular, that future benefits are based on past history of the policy holder / employee. Moreover, once a customer has been notified of his pension -either the current or projected amounts- the Benefit Administration department must be able to reproduce the exact same amounts later. These features lie at the base of two more modeling guidelines: - snapshot data if possible, historical data where obligatory temporal capabilities are called for that most of today's databases are yet incapable of delivering. Hence, the required temporal capabilities must be modeled explicitly in the CS, which may result in overly large and complex models. The business guideline is to try convincing users not to demand historical data wherever possible, and so to keep CS complexity down. - representing derived data in the CS apart from the temporal issue addressed by the previous guideline, an important issue is storage of base data, intermediate results, and the outcomes of calculations. The customary option is to not rely on the system's capacity to exactly reproduce the original calculations. Therefore, the outcomes and important intermediate results of the calculations are explicitly represented in the CS and secured in the database. 6. Design and maintenance In practice, the Benefit Administration system is always under development, to accommodate the latest set of new laws and legal exception rulings, changes in system and data interfaces with adjacent business functions etc. Due to size, complexity, broad scope and large number of users, system maintenance has grown to be a well-established but slow managerial process. First, there is a rather informal part where new developments, change requests from users, problem reports etc are merely assembled onto what is called the Maintenance stock list. Topics on the list are periodically reviewed, and top priorities are submitted to the Pension management team that decides on priority-setting, budgeting and resource allocation. Once upper management approves a change, a formal change procedure based on ITIL standards is initiated. The procedure cascades through the usual steps of information systems maintenance: design, implementation, testing, and operation. These steps are coordinated with other necessary changes, such as user training, programming of data conversion routines, and adaptation of standard letters and forms that the system sends out to customers.

Four Case Studies

89

5.1.3 The CS evolution: 1996 - 2000 1.

January 1996: initial production release

ch5: Ben 1996.Jan initial

Insurance Product Customer Insured Party

Relationship Policy

Participation

Participation Trail Successor

Trail Premium / Reduction

Benefit

Benefit Premium / Reduction

Figure 18. Initial production release The CS at the start of the evolution has a rather simple structure. This first production release of the CS lasts for about 9 months. The core concept of the CS is BENEFIT. It records the exact amounts due for a PARTICIPATION, i.e. what is due to a beneficiary under a particular pension scheme. The POLICY entity records the insured coverage of an employee. BENEFIT amounts are computed from PARTICIPATIONTRAIL data that in turn are derived from historical information on employment and salary. Notice that the CS itself does not support the derivations as such. The applicable rules and regulations are rather complicated, and they are completely relegated to the applications level. The initial production release of the CS provided three kinds of pension benefit, recorded as the occurrences of the conceptual entity PRODUCT. Old-age pension and Dependents pension make up the default insurance scheme. The third kind is Separated pension, which is a consequence of Dutch divorce laws: whenever a marriage ends, the accumulated pension benefits are to be divided among the ex-partners. The part due to the ex-partner can then be recorded as a separate policy for old-age benefit, with the ex as insured party. Of course, the pension benefits of the other party in divorce are reduced accordingly. Actually, a fourth variant of pension benefit was implemented in the operational database, combining the Old-age pension and Dependents pension benefits. This implementation strategy was efficient, because member instances of PARTICIPATION-TRAIL, PARTICIPATION and BENEFIT were identical in the vast majority of pension benefits and the combination lowered the amount of redundancy.

90

Exploring Conceptual Schema Evolution

Although the information guidelines advocate a single unified view on the level of the entire enterprise, this CS does not capture early-retirement pensions. The reason is that at the time, early-retirement pensions were solely based on employment, not on accumulated benefits and participation trails. Moreover, early-retirement pensions and payments were handled through another information system. It was decided to let that be, and to exclude early-retirement pensions from the scope of the new system under design. Lattice complexity for this CS is 3, indicating that the CS requires 3 connection constraints for overall consistency: - the two CUSTOMERS involved in a single RELATIONSHIP occurrence must be different; the INSURED-PARTY must be known and identified as a CUSTOMER but the spouse may be currently unknown in the database, - the CUSTOMER who is beneficiary of a PARTICIPATION may be the INSURED-PARTY of the POLICY (as in old-age pensions), or it may be somebody else (as in dependent pensions), - a SUCCESSOR instance is derived from one or more PARTICIPATION-TRAIL instances, with valid-time intervals that are adjacent and non-overlapping.

Four Case Studies

2.

91

October 1996: major extension Insurance Product

ch5: Ben 1996.Oct Customer Insured Party Relationship

Policy

Exchanged ER.benefit

Participation Trail Successor

Trail Premium / Reduction

D

Benefit obtained by ER.Exchange

Participation

Benefit

Trail for level 3

Early-Ret. Benefit level 1

A

Benefit Premium / Reduction

Early-Ret. Benefit level 2

B

C

Early-Ret. Benefit level 3

P articipation Trail (Early-Ret. level 2)

Figure 19. Major extension in October 1996 Change drivers A major driving force in the business develops 9 months later. Completely new facilities and benefits regarding early-retirements are introduced. The old ways of handling early-retirement by the Pension Company, and the information system that supported those ways of doing business, became obsolete almost overnight. Two new business processes are pressed into service, and all other business process improvements are postponed for the time being: - the administration of early-retirement benefits, and - the administration of benefit exchange. If an employee does not retire early (or if the insured party dies prematurely) the early-retirement benefit is not cashed in. The old age benefit (or the dependents benefit) is increased in exchange. Changes in the CS The CS is expanded. The CS changes are all justified by the pending changes in early-retirement pensions, and the overall effect is a considerable increase in lattice complexity, which rises from 3 to 8. Actually, we can discern four coherent groups of additions: (A) EARLY-RETIREMENT-BENEFIT-LEVEL1. This addition is straightforward. (B) EARLY-RETIREMENT-BENEFIT-LEVEL2 associated with PARTICIPATION-TRAIL-FOR-LEVEL2. (C) EARLY-RETIREMENT-BENEFIT-LEVEL3 with an associated entity TRAIL-FOR-LEVEL3. (D) EXCHANGED-EARLY-RETIREMENT-BENEFIT and BENEFIT-OBTAINED-BY-E.R.-EXCHANGE.

92

Exploring Conceptual Schema Evolution

Measurements metrics →

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity ↓ change change tible sion plexity entity refer (A) add ER.BENEFITyes 2 all yes 0 0 0 all LEVEL1 (B) add ER.BENEFITyes 6 all yes +3 0 0 all LEVEL2 and more (C) add ER.BENEFITyes 6 all yes +1 0 0 all LEVEL3 and more (D) add the exchange yes 5 all yes +1 0 0 all of ER.-BENEFITs The CS changes actually precede the real-world changes, and do not coincide with them. The new early-retirement rules and regulations were contracted in the course of 1996, but the new rules only took effect in the spring of 1997. The time lag was used to prepare business processes, to train personnel, to inform employers and employees of the innovation in pension benefits etc. Most importantly, it was necessary to adjust the corporate information systems. Please notice how the way of modeling has now become inconsistent across the Benefit-like entities. In particular, the new entity E.R.-BENEFIT-LEVEL-2 is not merged with the semantically very close entity BENEFIT, thus compromising the guideline about the high level of integration in the CS. It is motivated by the business demand for uninterrupted services and compatibility: no changes were permitted in existing structures. A final observation (not depicted in the diagram) concerns the PRODUCT entity that recorded three instances in the previous CS version: Old-age pension, Dependents-pension, and Separated-pension. The same entity in the new CS version records a fourth instance: Earlyretirement-pension. Thus, the real-world change is accommodated by changes at both the instance level of data, and the structural level of the schema. This must be considered an update anomaly.

Four Case Studies

3.

93

July 1997: ongoing changes

ch5: Ben 1997.July

Insurance Product Customer Insured Party

Relationship Policy Separation

Participation Trail (new)

Participation Trail Successor

Trail Premium / Reduction

Participation Trail (new) for Benefit

F

Exchanged ER.benefit

Benefit obtained by ER.Exchange

Participation

Benefit for ex obtained by ER.Exchange

Participation Trail for Benefit

E

G

Trail for level 3

H Benefit

Early-Ret. Benefit level 1

Participation Trail of ex-spouse

E

Benefit Premium / Reduction

Early-Ret. Benefit level 2

Participation Trail of ex-spouse (Early-Ret. level 2)

E

Early-Ret. Benefit level 3

P articipation Trail (Early-Ret. level 2)

Participation Trail of ex, Premium / Reduction

E Figure 20. Ongoing changes in July 1997

Change drivers The early-retirement innovation still acts as an important driving force nine months later. The several business process changes that were postponed earlier on are now being implemented. Two of those changes now affect the CS of our case study: - have the legalistic peculiarities of benefit division for divorce apply to early-retirement. - introduce various elective benefit options under the pension scheme. Again, there is a strict business demand for uninterrupted services, and the elective benefit options are to be implemented with minimal impact on existing structures. Changes in the CS As before, the CS expands, becoming much more complex, and lattice complexity increases up to 14. Four semantic changes are discerned. Two justified changes implement some of the business process innovations that were postponed earlier: (E) to accommodate divorce regulations, SEPARATION and several associated specializations and references are intricately woven into the CS, increasing overall complexity. (F) PARTICIPATION-TRAIL-(NEW) and PARTICIPATION-TRAIL-(NEW)-FOR-BENEFIT are added that are highly redundant with PARTICIPATION-TRAIL and PARTICIPATION-TRAIL-FORBENEFIT. The new entities were envisioned to absorb the old PARTICIPATION-TRAIL entities in a future CS version. For now, the new entities only capture the elective benefit options, except for four specific benefit options that were already dealt with by using attributes (not depicted) of the BENEFIT and E.R.-BENEFIT-LEVEL-2 entities. Thus, UoD

94

Exploring Conceptual Schema Evolution changes are accommodated in an inconsistent way by changes in the CS on separate levels of abstraction.

We consider the next change to be unjustified: (G) the complex derivative reference how BENEFIT is related to PARTICIPATION-TRAIL data was absent from the initial CS. It is now captured by the PARTICIPATION-TRAIL-FORBENEFIT entity. The purpose of capturing these derived data in the CS is to accelerate future derivations that use the same data. A further change is not justified by change in the UoD, but reflects an improvement in the way of modeling: (H) cardinality of the BENEFIT↑PARTICIPATION reference is relaxed from 1:1 to N:1. Measurements metrics → ↓ change (E) accommodate SEPARATION (F) insert ...-TRAILFOR-BENEFIT-(NEW) (G) add...-TRAILFOR-BENEFIT (H) raise reference cardinalities

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer yes 12 (+1) all yes +4 (+1) (+1) all yes

5

all

yes

+1

0

0

all

no

3

all

yes

+1

0

0

all

no

1

all

no

0

0

1

all

Four Case Studies

4.

95

February 1999: stabilized Insurance Product

ch5: Ben 1999.Feb Customer Insured Party Relationship

Policy

J Participation Trail (new)

Trail Premium / Reduction

Participation Trail (new) for Benefit

I

Policy Attribute

Participation Trail Successor

Separation

Exchange

Benefit by by ER X excha.

Participation

Trail for level 3

ER

I

Benefit for ex by by ER X excha.

Participation Trail for Benefit

Benefit

I

exchanged benefit

Participation Trail of ex-spouse

Participation Trail of ex, Premium / Reduction

Early-Ret. Benefit level 1

Benefit Premium / Reduction

Early-Ret. Benefit level 2

exchanged ER.benefit

I

Participation Trail of ex-spouse (Early-Ret. level 2)

Early-Ret. Benefit level 3

P articipation Trail (Early-Ret. level 2)

ER exchanged Early-Retirement benefit Benefit by X benefit obtained by any kind of exchange by ER benefit obtained by E.R.exchange excha.

Figure 21. Stabilized in February 1999 (the lower corner explains entity names that are abbreviated in the schema) Change drivers For over a year and a half, there are no important changes in the pension company business. This does not mean that the business is at a standstill. Rather it means that current ways of doing business are satisfactory. The relative quiet in the UoD is reflected in the CS. While several intermediate CS versions were implemented, we can ignore them because they do not concern any features of our CS. Only one change is announced that will become effective as of summer 1999: - new legislation forces all pension companies to offer their insured parties the option to exchange various kinds of pension benefits. Changes in the CS The CS version of February 1999 is impacted by the upcoming legislation in the UoD. (I) in response to the broadened concept of exchange, a generalized EXCHANGE entity is introduced. It subsumes the former EXCHANGED-E.R.-BENEFIT and affects various other entities and/or specializations. Notice how the EXCHANGE↑BENEFIT reference shows a 1:1 cardinality whereas the subsumed EXCHANGED-E.R.-BENEFIT↑BENEFIT reference was N:1 cardinality.

96

Exploring Conceptual Schema Evolution

An unjustified change in the CS is made in anticipation of a future need of differentiating pension products across market segments and employers, but no clear need or user requirement drives the change now: (J) POLICY-ATTRIBUTE is added. Measurements metrics → ↓ change (I) EXCHANGE is generalized (J) add POLICYATTRIBUTE

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer yes 11 all yes +1 5 6 (1) no

1

all

yes

0

1

0

all

(1) at the CS level, entity identity is preserved for the instances of the subsumed entities. At the implementation level, the specializations inherit a new artificial key from the new generalizations.

Four Case Studies

5.

97

September 1999: simplification Contract

Contract Conditions

ch5: Ben 1999.Sept

Insurance Product

K

Product in Contract

Customer Insured Party Relationship

Policy Separation

O Policy Attribute

Participation Trail before '96 Successor

P

Trail Premium / Reduction

L

Exchange

Benefit obtained by exchange

Participation

Benefit exchanged benefit

Benefit for ex obtained by exchange

Participation Trail of ex-spouse

M

Participation Trail of ex, Premium / Reduction

N

Trail for level 3

Early-Ret. Benefit level 1

Benefit Premium / Reduction

Early-Ret. Benefit level 2

exchanged ER.benefit

Participation Trail of ex-spouse (Early-Ret. level 2)

Early-Ret. Benefit level 3

M

Figure 22. Simplification in September 1999 Change drivers Apart from the new legislation, there is only one business change for 7 months. Even then, it is internal to the enterprise, a new line of thinking on the strategic management level that causes the business perception of pension products to change: - pension products should be differentiated across market segments and employers, instead of being the same for all kinds of customers. We also notice an important reversal in one of the modeling guidelines: - the CS is to record only original source data, and data that are required by law. All other derived data is to be eliminated from the CS and relegated to the Internal-Schema level. Changes in the CS The new philosophy in product engineering has little relevance for the current way of recording benefit data, but it does have an impact on key integrity of the Policy entity: (K) the concept of CONTRACT and a dependent entity CONTRACT-CONDITIONS are introduced, and the POLICY↑PRODUCT reference is redirected via an intermediate PRODUCT-INCONTRACT entity. Previous CS versions recorded the derived relations between BENEFIT and PARTICIPATIONTRAIL data, but the new CS version does away with all this. While functionality and complexity now have to be accounted for on the application level, the pay-off at the CS level is a remarkable simplification, decreasing lattice complexity from 15 to 11:

98

Exploring Conceptual Schema Evolution

(L) derived-data entities PARTICIPATION-TRAIL-(NEW) and PARTICIPATION-TRAIL-(NEW)-FORBENEFIT are eliminated, as their source data (not depicted in this CS) is still available. (M) PARTICIPATION-TRAIL-FOR-BENEFIT as well as PARTICIPATION-TRAIL-FOR-EARLYRETIREMENT-LEVEL2 are eliminated. The previous CS version introduced the EXCHANGE entity subsuming EXCHANGED-E.R.BENEFIT. This CS version goes one step further: (N) the subsumed EXCHANGED-EARLY-RETIREMENT-BENEFIT specializations are suppressed. Finally, there is some restructuring in the CS as derived-data entities are about to be eliminated, while data in member entities had to be safeguarded. The changeover was carried out in two steps, and the restructuring in this CS version paved the way for a simplification in the next CS version: (O) the BENEFIT-OBTAINED-BY-EXCHANGE↑PARTICIPATION-TRAIL reference is redirected to its owner, and becomes the BENEFIT-OBTAINED-BY-EXCHANGE↑POLICY reference. (P) the PARTICIPATION-TRAIL entity is restructured. The entity intent is reduced to contain only instances for which the original source data is no longer available. This comes down to instances with a start-date before January 1, 1996. We renamed the entity in our CS evolution for clarity, but the old name was retained in the operational CS. A final detail is the redirection of the TRAIL-PREMIUM/REDUCTION↑PARTICIPATION-TRAIL reference to SUCCESSOR. Measurements metrics → ↓ change (K) introduce the CONTRACT concept (L) eliminate PARTICIPATION-TRAIL-NEW (M) eliminate regular ..TRAIL-FOR-BENEFIT (N) suppress three specializations (O) redirect reference to owner POLICY (P) restructure PARTICIPATION-TRAIL

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer yes 2 all yes 0 1 1 (2) yes

5

all

no

-1

2

3

all

yes

7

all

no

-3

2

5

all

no

3

all

no

0

3

0

all

no

1

all

no

0

0

1

all

no

1

all

no

0

0

1

all

(2) POLICY becomes a member entity of PRODUCT-IN-CONTRACT, instead of PRODUCT. In the CS, this is a minor change in reference and entity identity. Not so in the operational system. There was much deliberation how this fundamental change of key could best be accommodated into the Internal Schema, and how stored data were to be converted to the new key. The problem was that the primary key to POLICY is part of the weak-entity keys of many member entities. In the end, an intricate solution was decided upon that circumvented the extensive conversions of all member entity primary keys.

Four Case Studies

6.

99

August 2000: ongoing maintenance Contract

Insurance Product

ch5: Ben 2000.Aug Customer

Contract Conditions

Product in Contract

Insured Party Relationship Policy Separation

Policy Attribute

Q

Participation Trail as of 1-1-'96

Trail Premium / Reduction

Exchange

Benefit obtained by exchange

Benefit for ex obtained by exchange

Participation Trail before '96

Participation

Benefit exchanged benefit

Participation Trail of ex-spouse

Trail for level 3

Early-Ret. Benefit level 1

Benefit Premium / Reduction

Early-Ret. Benefit level 2

exchanged ER.benefit

Early-Ret. Benefit level 3

Participation Trail of ex-spouse (Early-Ret. level 2)

Participation Trail of ex, Premium / Reduction

Figure 23. Ongoing maintenance in August 2000 Change drivers There are no material business changes for the next year. What changes there are, do not concern the integrated benefit administration as we study it. Changes in the CS Again, the CS is simplified while a change driver is lacking. It is a deferred effect of the decision to eliminate derived data from the CS: (Q) PARTICIPATION-TRAIL-BEFORE-'96 is further restructured. A subset of instances of the SUCCESSOR specialization, those that are timestamped at 1-1-1996, is promoted to entity. We renamed it to prevent homonyms; the original CS documentation uses the old name. The reference to POLICY is retained for the specialization but eliminated for the generalization. The reflexive reference evolves into an ordinary aggregation reference. Measurements metrics →

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity ↓ change change tible sion plexity entity refer (Q) restructure PARno 2 all no -1 1 2 (+1) (3) TICIPATION-TRAIL (3) the promoted entity can no longer inherit its key from its former generalization.

100

Exploring Conceptual Schema Evolution

5.1.4 Highlights of the case Having described the long-term evolution of this single CS in considerable detail, we now look at the overall picture as the CS evolves over time. 1. CS size increases The first and foremost observation is that the CS has successfully adapted to past changes, while its overall composition has remained relatively stable over almost half a decade. entity count reference count lattice CS version add delete change add delete change complexity January 1996

9 (+2) 7

0

0

October 1996

0

7 (+1)

0

0

1 (+5)

0

3

13 (+1)

0

1

2 (+5)

0

6

23 (+3)

February 1999

24 (+8) 3

4 (+3)

0

September 1999

3

8 (+3)

0 (+1)

0

0 24 (+4)

0 (+1)

23 (+2)

8

36 (+3)

14

38 (+8)

15

33 (+5)

11

3

23 (+5) 1

3

0

16 (+2)

July 1997

August 2000

12

11 (+2)

1 33 (+4)

10 Table 3. numbers in parenthesis in 'entity' columns indicate specializations numbers in parenthesis in 'reference' columns indicate corresponding specialization-to-generalization injective references

Table 3 summarizes the additions, deletions and changes of both entities and references in the evolving CS. A fast expansion is seen from January 1996 to July 1997, corresponding to implementation of the Benefit Administration system and subsequent innovation of the EarlyRetirement pension scheme. In this period, size and complexity of the CS increase, while understandability and consistency decrease. Thereafter, additions and deletions are more evenly balanced. During this time of relative quiet, most business developments are accommodated with minor changes of the CS. 2. CS hierarchy is stable When reviewing each entity of the CS, we find that a real-world concept, once it is modeled in the CS, does not evolve very much. Intent definitions of most entities remain unchanged. In addition, the position of entities relative to all other entities in the CS is stable, as evidenced by the fixed locations of all entities in the diagrams. At the same time, all entities have changed in some way or another, except CUSTOMER and BENEFIT-REDUCTION/PREMIUM. While references to owner entities rarely change, references to member entities are more volatile. An entity can aggregate one or two member entities at one time, but six or seven at another. This reflects a common practice in maintenance. When inserting a new entity, it is easier to create a new reference to, rather than from an existing entity. The latter operation has immediate consequences for referential integrity and it is usually avoided.

Four Case Studies

101

This underpins the importance of the compatibility requirement to safeguard prior investments in existing CS constructs and applications. This demand constitutes a strong drive towards stability. The disadvantage is that it also restraints maintenance engineers in their search for alternative solutions. They must often settle for a design solution with some flaw in it, resulting in loss of quality in the long term. 3. Contributions of the modeling guidelines to the CS evolution Earlier we mentioned four of the enterprise's modeling guidelines relevant for the engineers who worked on this case. We now discuss how each has affected the overall CS evolution. - single unified view and therefore single point of maintenance This guideline was well adhered to. The Benefit Administration department used several other information systems based on different CSs, and we find that the various CSs are all in full agreement. While the guideline as such does not address stability, it has minimized schematic discrepancies and contributed to CS stability. - high level of generalization and therefore low maintenance This guideline has not been well adhered to. One reason is that long-term intangible goals such as this one are often overridden by business pressures for quick solutions. Another reason why the guideline was not adhered to is suggested by looking at the notion of exchange. It was first introduced for Early-Retirement benefits in the 1996 version, but it was generalized to other kinds of benefits in later CS versions. Contrary to the guideline, the initial level of generalization appears to have been too low. However, the guideline provides no guidance as to the best possible level of generalization. We therefore feel that the guideline as it stands is impractical for business usage. - snapshot data if possible, historical data where obligatory This guideline's contribution to stability is uncertain on the level of the overall CS. It has primarily affected the timestamp attributes of various entities, but we did not clearly detect an impact of the guideline on entities or references. - representing derived data in the CS This guideline itself changed over time. Therefore, it is not surprising that the guideline did not contribute to long-term stability of the CS.

102

Exploring Conceptual Schema Evolution

5.2 CASE STUDY: SETTLEMENT OF PENSION BENEFITS 5.2.1 Introduction This case concerns settlement of pension benefits when people break up their relationship. The series of changes begins in 1994 in response to a law that took effect as of May 1, 1995. To meet legal deadlines, the initial CS changes were quickly implemented, recording only the benefit amounts receivable by the ex-partner. These amounts were calculated according to the rules and regulations laid down in the new law and recorded afterwards. However, the source data as well as the derivation processes were beyond the scope of the initial CS, and the amounts on record could not be reproduced from other data captured in the CS. Later CS changes brought the source data within the scope of the CS, and the final evolution step eliminated the derived data entities. There was no instantaneous switch to a new, optimal and totally integrated CS. Rather, the CS gradually evolved through a series of consecutive changes.

CS versions

dominant business changes

mid 1994 -end 1994 --

mid 1997 -spring 1999 --

pension settlement introduced systems federation special-purpose system phased out

time Figure 24. CS versions and dominant business changes Figure 24 shows the time series of CS versions of the case study. The intervals between consecutive versions vary between a half and two years. To reconstruct the sequence of CS versions of the Pension Settlement system, we used the available design specifications plus incremental printouts of the database management system. A preliminary version of this case study was presented at the INTERSYMP 2000 conference [Wedemeijer'00aug].

Four Case Studies

103

5.2.2 Business background 1. The Company This business case has been obtained from the same pension fund as the Benefit case. 2. Business Functions The basic idea behind settlement is that both partners are equally entitled to a pension benefit, even if only one of them is an employee participating in the pension scheme. When a relationship is broken up, then doing nothing equals granting the entire pension benefit to the employee with benefit insurance, depriving the non-working partner of his (or in the majority of cases: her) future benefits. As this has a decidedly discriminatory effect, a law was passed to remedy this. The new law forced all life insurance companies to offer pension settlements to their customers. In our case, the pension settlements began to be recorded in the course of 1994. The new law defined two distinct methods to settle pension benefits. The majority of cases is settled by dividing the future allowance that the insured party will receive after retirement. The amount due to the ex-partner is calculated and recorded; this amount will be deducted from the insured party's pension allowance in the future. Notice that as a side effect, a relationship between the ex-partners continues to exist through the settlement. Therefore, another, less frequently used method of settlement is offered enabling ex-partners to sever all ties. Instead of merely receiving a share, the formerly dependent partner becomes an independently insured party in the pension scheme and a separate benefit is recorded for her or him. The mandatory contribution for this separated benefit is instantly deducted from the employee's benefit. The main difference from the previous method is that a separated benefit for the ex-partner will continue to be paid, even if the insured party dies. 3. Information Technology and Modeling Guidelines The CS of this business case is carried by two different information systems. The principal information system is the Benefit Administration system. This integrated system was under development for some time before 1994, but it did not yet incorporate the concepts of pension settlement and separation. Indeed, the timing of the new law was considered to be unfortunate, as a massive system (re)engineering effort was going on at the time. Therefore, a delay strategy was decided upon and a small special-purpose information system called 'Pension Settlement' was developed. The two systems were loosely coupled at first, and no electronic interfaces were provided. Data exchange was manual, and the users were responsible for consistency checking across the two systems. Over time, the Benefit Administration system gradually expanded to cover more aspects of the pension settlement business functions. The CS evolved to capture more data, both source- and calculated-data. Once the source data were completely and accurately recorded in the Benefit Administration system were the derived data entities removed from the CS and the special-purpose information system discontinued.

104

Exploring Conceptual Schema Evolution

5.2.3 The CS evolution: 1994 - 1999 1.

Before introduction of Settlement, mid-1994 ch5: Settle 1994.mid before

Customer Insured Party

Pension Policy Participation

Relationship

Benefit

Figure 25. Before introduction of Settlement, mid-1994 This is the CS before introduction of settlement laws and regulations. A pension is modeled by the single hierarchy INSURED-PARTY, POLICY, PARTICIPATION, BENEFIT. The secondary PARTICIPATION↑CUSTOMER reference records the customer who is beneficiary of the pension benefit. This may be the INSURED-PARTY of the POLICY, as in old-age pensions, or it may be the partner, as in dependent pensions. The RELATIONSHIP entity records current partners (married or living together). The entity is asymmetric by definition: a single realworld relation will be recorded twice if both partners happen to be INSURED-PARTY. No provisions in the CS are made to record information about former relationships. The CS permitted to retain such information, but users were not obliged to record and maintain the data. To the contrary, it was generally felt that such historical data served no purpose other than to pollute the database.

Four Case Studies

2.

105

Introduction of Settlement, end of 1994 ch5: Settle 1994.end introduct

Customer Insured Party

A

Ex

Sum of Reductions

Ex Pension Policy

Partner

A

Sum of Divided Benefits

Participation

Relationship Settlement Separation

Benefit Separated Benefit

A

B

Figure 26. Introduction of Settlement, end of 1994 Change drivers In 1994, the law on pension settlements is passed introducing the variants of division and separation. To meet the deadline set by the new law, a special-purpose 'Pension Settlement' system is created and pressed into operation covering only the most essential concepts. Changes in the CS Change impact is kept down to a minimum, compromising overall CS quality: (A) SETTLEMENT is a new specialization of RELATIONSHIP. The intent of RELATIONSHIP is extended to cover all terminated relationships that have been settled under the new law. EX-PARTNER becomes an explicit specialization in the induced specialization PARTNER. The total allowance for an ex is recorded as SUM-OF-DIVIDED-BENEFITS; thus accounting for the fact that a person may be involved in more than one settlement. On the opposite side, SUM-OF-REDUCTIONS records the total amount taken off the employee's benefit. (B) the business function of Settlement introduces SEPARATION as specialization of SETTLEMENT, as well as SEPARATED-BENEFIT as specialization of BENEFIT. The two new specializations are connected by a compulsory reference. Measurements metrics → ↓ change (A) introduce SETTLEMENT (B) introduce SEPARATION and reference

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer yes 5 all yes 0 2 0 all yes

3

all

yes

+1

2

0

all

106

3.

Exploring Conceptual Schema Evolution

Expanded schema of 1997 Customer Insured Party Ex

Ex

Sum of Reductions

Pension Policy

Partner

ch5: Settle 1997.mid expand

Sum of Divided Benefits

Participation

Relationship

D F

Settlement Separation

Benefit Separated Benefit

E

C Benefit Deduction

Divided Benefit Adjustment

Figure 27. Expanded schema of 1997 Change drivers Previously, all calculations were performed manually, without recording the relevant sourceand intermediate data. The CS is now extended to capture more data, both source- and calculated-data. This improves the level of federation and data consistency between the two information systems carrying this CS. Although the changes affect all relevant entities, the added entities were implemented in the Benefit Administration system only; the Internal Schema of the special-purpose system was left unchanged. The changes are rather complex, and require extensive adaptations in both the user applications and the CS in order to safeguard compatibility in the face of incomplete historical data. Changes in the CS Although there is no material change in the UoD, the CS gradually expands. This is a belated effect of the change in legislation, and therefore we consider it unjustified: (C) DIVIDED-BENEFIT and a dependent entity ADJUSTMENT are braided into the CS. These entities record some intermediate data while the customer request for pension settlement is being processed. Depending on customer circumstances, the ADJUSTMENT entity records special provisions that may apply. (D) a reference SETTLEMENT↑SUM-OF-DIVIDED-BENEFITS is added to record which settlements contribute to a certain sum. It is recorded for new entries only, not for existing ones. If the ex-partners opt for separation rather than division, then the reference instance is deleted upon completion of the request processing. (E) BENEFIT-DEDUCTION is inserted. It contains specifics for SEPARATED-BENEFIT when expartners opt for separation. As the specification is not always present in the database, the reference cardinality is modeled as optional-to-1. This is an example of the 'temporarily

Four Case Studies

107

specified' situation as will be discussed in chapter 7. Moreover, BENEFIT-DEDUCTION is redundant with information contained in the SEPARATION entity. (F) a reflexive reference is created on the BENEFIT entity to capture which regular BENEFIT is deducted by which SEPARATED-BENEFIT. It is modeled as optional for the entire BENEFIT entity, but is only applicable for the SEPARATED-BENEFIT specialization. Moreover, the new reference is equivalent with the reference path from SEPARATED-BENEFIT via BENEFIT-DEDUCTION to BENEFIT but for the lack of the appropriate BENEFIT-DEDUCTION instances that are often lacking. This is an example of the 'survives' temporal situation. Measurements metrics →

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity ↓ change change tible sion plexity entity refer (C) append DIVIDEDno 6 all yes +2 0 0 all BENEFIT and ADJUST (D) connect with no 1 all yes +1 1 0 all SUM-OF by reference (E) append BENEFITno 4 all yes +2 0 0 all DEDUCTION (F) put a reflexive no 1 all yes +1 1 0 all reference on BENEFIT

108

4.

Exploring Conceptual Schema Evolution

Improved schema of 1999 ch5: Settle 1999.spring improved

Customer Insured Party

G Pension Policy

H

G

Participation

Relationship Settlement Separation

Benefit Separated Benefit

Benefit Deduction

Divided Benefit Adjustment

Figure 28. Improved schema of 1999 Change drivers Some time later, source data for almost all of the SUM-OF-DIVIDED-BENEFITS and SUM-OFREDUCTIONS instances have been recorded in the Benefit Administration database. As the special-purpose system is now redundant, its usage is terminated causing the final changes in the evolution of our CS. Changes in the CS The two SUM entities are removed from the CS, actually being moved to the level of external schemas as the data remains available to the user community, henceforth to be derived on demand. Simplicity of the CS is improved, and lattice complexity goes down from 9 to 6: (G) SUM-OF-DIVIDED-BENEFITS is eliminated. The related PARTNER specialization reduces again from explicit to an induced specialization, just as it was prior to the introduction of the settlement law. (H) SUM-OF-REDUCTIONS is eliminated, which has minor impact on the three related entities. Measurements metrics → ↓ change (G) eliminate SUMOF-DIVIDED-BENEFITS (H) eliminate SUMOF-REDUCTIONS

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer no 3 all no -1 3 2 all no

4

all

no

-2

4

3

all

Four Case Studies

109

5.2.4 Highlight of the case The highlight of this case is the presence of derived data in the CS, and its effect on schema evolution. Chapter 7 will discuss the issues related to derived data in more detail. The SUM-OF-REDUCTIONS and SUM-OF-DIVIDED-BENEFITS entities case must be considered derived-data entities after the entities DIVIDED-BENEFIT and BENEFIT-DEDUCTION have been introduced into the CS by changes (C), (D) and (E). The two aggregate entities are only removed in the next CS version, thus providing ample time to migrate data on old pension settlements from paper records into the database. Even so, not all redundancy is removed. For instance, the SEPARATED-BENEFIT↑SEPARATION reference is redundant with the reference induced by DIVIDED-BENEFIT. The redundancy is not eliminated for two reasons. One reason is that semantically, each BENEFIT instance is understood to be the full record of that benefit; hence, a SEPARATED-BENEFIT instance is considered incomplete if it does not refer to the appropriate SEPARATION. A more practical reason is that some BENEFIT-DEDUCTION instances are absent from the database, which is another example of the temporal 'survives' situation.

110

Exploring Conceptual Schema Evolution

5.3 CASE STUDY: SITES REGISTRATION 5.3.1 Introduction This case outlines the evolution of a CS from 1983 until 1999, carried by two consecutive information systems. The life cycle of the CS can be traced back to 1983, a time when databases were just being introduced in the business area. The CS had been evolving ever since, adapting to the ever-changing business needs. Our case study focuses on the concept of site, defined as 'location where company business is conducted in the direct interest of the general public'. The remarkable feature in this case is in how this concept is modeled. Initially, it was modeled in an abstracted manner, but it slowly evolved to a rather detailed way of modeling. This tendency is the opposite of commonly advocated best practices in design. The case demonstrates how the user need for a simple, explicit and understandable CS can outweigh the theoretic advantage of an abstracted, and supposedly flexible way of modeling. Figure 29 shows the time series of CS versions in the case study, with time intervals between consecutive versions that vary from 2 to 4 years. We reconstructed the CSs from whatever documentation we could salvage from systems maintenance, the bulk of the material being printouts from the database management system. Additionally, we used contemporary material clarifying how and why the database changed at the time.

CS versions

dominant business changes

1983 -state-awareness in the data

1986 --

schema restructuring

1989 --

1993 --

focus on core business extensions

1996 -1998 -time

market orientation ch5: Sites dominant business changes Figure 29. CS versions and dominant business changes

A preliminary version of the case was published in the Dutch journal INFORMATIE [Wedemeijer'99oct].

Four Case Studies

111

5.3.2 Business background 1. The company This business case has been obtained from the national Dutch mail company and its subsidiary company Postkantoren B.V. The company provides the public with familiar services like mail delivery, collection of mail and small packages, document handling, and basic financial transactions. The business is run through a fine-grained network of ordinary post offices. In addition, many sites perform vital functions but are not open to the public, such as the mail sorting and distribution centers, company head quarters and branch offices, the mainframe computer center, technical support units etc. 2. Business functions The business function considered in this case is the registration and organization of business sites. Before the introduction and implementation of this CS, every information system had to define, build, and maintain its own unique view of the organization of sites. The purpose of this CS was to provide an integrated and consistent view of all business sites in their hierarchical context, and thus to provide unique site identifications with complete, correct and upto-date descriptive information about each site. For over a decade, the database operated as the central data store on site information for all company information systems handling sitedependent information, such as marketing systems (what is the daily sales volume per site), personnel database (who is employed at each site), general ledger (where was a financial transaction generated) etc. 3. Management structure The Dutch mail company and its subsidiary Postkantoren B.V. are centrally organized. The main variable in management structure is business hierarchy: how does the chain of command run through the enterprise. As the management hierarchy was liable to change, the CS was demanded to be highly flexible, permitting changes in the management structure with minimal impact on the CS structure and stored data. In 1983, the management structure was based on geography, and this is reflected in the CS structure that is dominated by the geographic decomposition into so-called domains. Each site is allocated to a domain, e.g., a province, and therefore it managed all sites that are allocated to enclosed domains, e.g., cities. In later CS versions, geography became less dominant and the hierarchy in units becomes more functionally oriented. The latest development that we were able to capture in the case study is the reorganization of the management structure according to commercial marketing formulas. 4. Daily operations Daily operations concerns updating the database, and taking care that the updated data are distributed to other systems. Although values for site attributes such as name or address rarely change, there are dozens of updates per week due to the large number of sites. Furthermore, a change of a unit at a higher management level (in practice: when units merge) results in a cascade of changes down the hierarchy of units. As the Sites system is vital for correct operation of the many information systems that access the central database for site-dependent information, data maintenance is taken care of by the central Information Systems department.

112

Exploring Conceptual Schema Evolution

5. Information Technology and Modeling Guidelines Over time, two consecutive information systems have carried the CS of our case study. The initial information system and database used mainframe technology: the Adabas database management system and its proprietary Cobol-like programming language Natural. It went operational in 1983 and it was used well into the nineties. The primary database was downsized in 1993 to a network-oriented PC environment. It used a Dbase-like database management system and programming language under the mainstream DOS operating system. Our case description tracks the evolution of the CS as incorporated in this downsized information system. Meanwhile, the old mainframe database was kept up and running, henceforth serving as the front-end to interface with the many mainframe-based applications accessing the central database. This mainframe database was phased out when all legacy mainframe applications were either terminated or downsized themselves. As to modeling guidelines, the prevailing one is the demand for full compatibility in every evolution step. As the Sites database is the single data store being accessed by many, if not all core business applications, any structural change in the Sites database will immediately affect all those applications. Moreover, there are so many interfaces that it is virtually impossible to change all of them simultaneously. Therefore, the demand for maintenance is not, that change is prohibited, but that impact of change is always kept down to the bare minimum. 6. Design and maintenance We already mentioned how the central Information Systems department is responsible for data maintenance. It is also responsible for system maintenance, and changes to the CS. Ordinarily, changes to information systems are requested and budgeted by line management. However, change requests for the Sites system are issued by the system engineers working on systems interfacing with the Sites database. Thus, the Information System department is in the unique position to issue its own maintenance assignments. Nevertheless, maintenance is only done with reluctance, to prevent any unnecessary and costly disruption of dependent systems and taking great care that changes on the CS level are well coordinated with ongoing updates on the data instances.

Four Case Studies

113

5.3.3 The CS evolution: 1983 - 1998 1.

Initial schema in 1983

Domain

ch5: Sites 1983 initial

partof

allocated-to

Class

new class-of

old class-of

managedby

Unit operational dept. unit colocated-with

Figure 30. Initial schema in 1983 The initial CS is simple, due to its abstract way of modeling. The generalized UNIT and DOMAIN entities, combined with reflexive references, are expected to provide a high level of versatility in this CS. Specific business constraints, implemented at the application level, ensure that all stored data comply with current business rules for organization hierarchy. The references in this CS diagram, and the ones to follow, are labeled for ease of reading and to express the differences in the reference semantics. Notice how the managed-by reflexive reference on units is a close match, but not an exact copy of the part-of geographic decomposition of domains. The OPERATIONAL-UNITs perform operational business functions on a daily basis, while departments, i.e., staff and support units, do not contribute to production. DEPARTMENTs are co-located with operational units. Lattice complexity in this initial CS is 4. The schema is simple enough to write down the associated constraints. The two reflexive references part-of and managed-by are constrained to be non-cyclic hierarchies. The old- and new class-of references are not associated with a conceptual constraint, and also the co-located-with reference has no particular restriction imposed on it.

114

2.

Exploring Conceptual Schema Evolution

State-awareness in 1986

ch5: Sites 1986 state-awareness

Domain partof has-a

A

Domain period allocated-to

Class has-a

managedby

Unit

A

Class period

new class-of

old class-of

Unit period

A

has-a

operational dept. unit period peri.

colocated-with

Figure 31. State-awareness in 1986 Change drivers The structure of the UoD changes in a single respect only: all information needs to be stateaware. It is not enough to record the current organization of the company; users also need to know the situation in the (recent) past. Consider for example monthly sales summary by domain. Adding up sales for all sites that currently belong to the domain may produce a wrong answer: a site may be closed down, or be relocated to another domain in the course of the month, etc. Changes in the CS The initial CS was not designed to meet this simple, yet pervasive user demand, and an extensive change was called for: (A) all entities are duplicated. One part records time-invariant data, i.e., its unique key plus start- and end-date of its lifetime existence. All other data is time-varying and is recorded in a state-aware dependent entity. Measurements metrics → ↓ change (A) introduce stateawareness

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer yes 6 all yes 0 3 0 (1)

Four Case Studies

115

(1) the change was implemented in a rather complicated way to minimize impact of change: A strict compatibility demand was imposed that a data access routine was not allowed to be altered if it worked fine under the old Internal Schema. Therefore, the change in the Internal Schema did not follow the pattern of change in the CS. Instead, the original database tables were left untouched, reinterpreting these as recording only the present state of affairs. Historical data, i.e., not currently valid, was recorded by duplicating all database tables and extended these with appropriate valid-time attributes. Primary keys for these tables were constructed by concatenating the original key with a valid-time attribute. However, the implemented software removed instances from the currently-valid table into the historical table as soon as an end-date was recorded. This wrongly assumed that an instance with a known end-date cannot be currently valid, making it virtually impossible to record upcoming data updates and future end-dates. Querying the data histories was also hampered, as relevant instances could be located in either of the two database tables depended on whether the instances were still currently valid.

116

3.

Exploring Conceptual Schema Evolution

Restructuring in 1989

ch5: Sites 1989 restructuring

D Managerial unit local managedunit

Class

by has-a

has-a

Class period new class-of

E

F

Managerial unit period

Operational unit

B

Type-of-ops

C managed-by

Operational unit period

has-a

has-a

Type-of-ops period

performs type-of

Operational type period Figure 32. Restructuring in 1989 Change drivers The previous organizational structure of the UoD permitted a unit to be an operational unit and carry responsibility for subordinate units at the same time. The new corporate perception is that managerial and operational responsibilities ought to be differentiated. A new business rule was introduced that forbids duplicity of roles: - an OPERATIONAL-UNIT cannot manage another unit. Enforcing this rule affects the running organization, and it took a while to reorganize the business, create new managerial units and get operational units organized in the new way. A further business decision is: - to record the operational duties of each unit centrally. Whereas the exact duties of each unit were previously known only locally, these are to be recorded explicitly in the Sites system. Changes in the CS The CS was restructured even though restructuring was unnecessary to accommodate the UoD changes. It would have sufficed to adjust the specific business constraints implemented at the application level. Two coherent groups of changes in the new CS are associated with these business changes:

Four Case Studies

117

(B) the new business rule is accommodated by separating out OPERATIONAL-UNITs. There is a change of definition, as former OPERATIONAL-UNITs were allowed to carry managerial responsibilities that are denied to it under the new business rule. MANAGERIAL-UNIT represents units that do not perform operational duties. Notice the various side effects: - the DEPARTMENT specialization is absorbed by MANAGERIAL-UNIT, - a new specialization LOCAL-UNIT is introduced that manages OPERATIONAL-UNITs, - the reference to CLASS is restricted to OPERATIONAL-UNIT only; consequently, the entity intent of CLASS is restricted. (C) TYPE-OF-OPERATIONS and associated ...-PERIOD entities are appended as a specification of the OPERATIONAL-UNIT-PERIOD. The restructuring of the CS went further, and three more changes are seen in the CS that we could not associate with a definite UoD change driver: (D) the DOMAIN and its associated DOMAIN-PERIOD concepts, relating to geography, are removed from the CS. Apparently, the management structure is no longer dictated by geography only. It is still guided by it, as we find that units are consistently created in such a way that coherent domains are being covered. (E) the old-class-of reference, which has been obsolete for several years, is finally dropped. (F) while the co-located-with reference is still a valid reference from a conceptual point of view, it is discovered to be irrelevant. No data was recorded for it and as there were no complaints, it appeared to be unused. It is now eliminated. Measurements metrics → ↓ change (B) operational vs. managerial (C) insert TYPE-OF-

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer yes 4 manual no 0 2 6 (2) (2) yes 6 all yes 0 0 0 all

OPERATIONS

(D) drop geography

no

5

all

no

-1

2

3

all

(E) old-class-of reference dropped (F) co-located-with reference dropped

no

1

all

no

-1

0

1

all

no

1

all

no

-1

0

1

all

(2) OPERATIONAL-UNIT and MANAGERIAL-UNIT entities now have distinct candidate keys. For reasons of compatibility, the former primary-key attribute of UNIT, which is still used by many legacy applications, is retained in both new entities.

118

4.

Exploring Conceptual Schema Evolution

Focus on core business in 1993

Administrative unit

Area has-a

has-a

Admin. unit period recorded at

rec. unit

N

has-a

rec. unit period Trade has-a has-a

Trade period

Franchiser

Contract period

I

Area period

M

located

Zone

Outlet

G

previous

has-a

has-a

Zone period

J H

K

recorded-at

L

ch5: Sites 1993 focus on corebusiness

managed-by run-by

has-a

Outlet period Figure 33. Focus on core business in 1993

Change drivers Like many other companies in the nineties, the company refocuses on core business: - the focus is restricted to one type of operational unit only: Post Office, as in shop outlet. Other operational units (i.e. mail sorting and distribution centers) are disposed of. - administrative tasks like bookkeeping, logistics, and scheduling are concentrated in new ADMINISTRATIVE-UNITS, taking these back-office chores away from the operational and managerial units. The tasks are shared among a number of RECORDING-UNITS. - the corporate management structure was standardized into a two-tiered structure. A third tier, consisting of corporate staff and support units, was not captured in the CS. The company pursued another trend of the nineties: the downsizing of mainframe systems, resulting in various other changes in the CS: - the Sites database must record which outlets are run by franchisers in order to feed that information into the downsized Franchiser system. - it was found that Marketing and Operations had different view of business. Particularly, there was a difference in opinions about what constituted the life cycle of outlets. Changes in the CS The combined effects of refocusing and downsizing presented engineers with an opportunity to make major changes in the CS, decreasing the level of abstraction considerably:

Four Case Studies

119

(G) the concept of OPERATIONAL-UNIT is narrowed down in scope, and renamed to OUTLET. (H) TYPE-OF-OPERATIONS and its associated entities are dropped. In the previous CS, 'outlet' was an instance of the TYPE-OF-OPERATIONS entity, one type of operation among the many that an operational unit could perform. After refocusing on core business, it is the only one and no longer needs to be captured in the CS. It is eliminated accordingly. (I) ADMINISTRATIVE-UNIT and RECORDING-UNIT are inserted. By combining both recorded-at references from an OUTLET-PERIOD, it can be derived to which ADMINISTRATIVE-UNIT a RECORDING-UNIT belongs. While these new units are specializations of MANAGERIALUNIT, the recorded-at reference is not a specialization of managed-by. (J) AREA and ZONE replace MANAGERIAL-UNIT. The reflexive managed-by reference is replaced by two references, and lattice complexity decreases. The change is unjustified because specializing the MANAGERIAL-UNIT entity could have accommodated it. (K) an optional FRANCHISER and its associated CONTRACT-PERIOD are inserted. (L) for each shop outlet, the kind of business trade is recorded such as tobacconist, or bakery. For outlets run by the company itself, the shop trade is recorded simply as post-office. (M) a reflexive reference labeled previous is created for the OUTLET entity to meet the extended view of outlet life cycle held by the Marketing department. Last, no clear change driver in the UoD could be identified for the change: (N) the Class entity is dropped. Measurements metrics →

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity ↓ change change tible sion plexity entity refer (G) scope restricts to yes 3 all no 0 2 0 all OUTLET (H) TYPE-OFyes 6 all no 0 1 0 all OPERATIONS dropped (I) ADMINISTRATIVEyes 4 all yes 0 1 0 all and REC.UNIT added (J) MANAGERIALno 4 manual no -1 3 4 (3) UNIT replaced (3) (K) FRANCHISER yes 5 all yes +1 1 0 all inserted (L) TRADE inserted yes 4 all yes 0 1 0 all (M) previous reference inserted (N) CLASS dropped

yes

1

all

yes

+1

1

0

all

no

4

all

no

0

1

1

all

(3) the new entities each have their own candidate keys. Again, the former artificial primary key attribute is retained because many legacy applications still use it. The demand for compatibility has by now given rise to a complicated constraint across five entities to ensure unique values of the artificial attribute.

120

5.

Exploring Conceptual Schema Evolution

Extensions in 1996

Administrative unit

Area has-a

has-a

Admin. unit period recorded at

rec. unit

Class

Q

has-a

has-a

Class period

rec. unit period

Trade new

class-of has-a has-a

P

Franchiser has-a

Contract period

recorded-at

Town Trade has-a period Town has-a period

ch5: Sites 1996 extensions

Area period located

Zone

previous

Outlet

has-a

Zone period managed-by run-by

Outlet period

Service range

O has-a

has-a

Service range period delivers

has-a

Service assortment period Figure 34. Extensions in 1996

Change drivers After the numerous changes in the previous years, a relative quiet sets in as the information structure of the UoD does not change much. There is a clear focus on core business, which makes it more important, and easier, to record specifics of that business: - service ranges are differentiated. Small outlets offer only the basic services, a standard range is available in medium-sized units, and the full service range is offered only in major outlets located in large cities and shopping centers. Changes in the CS The general CS layout remains unchanged, and we only see minor CS changes affecting the OUTLET-PERIOD entity. One change is justified: (O) The CS is extended to model SERVICE-RANGE and its associated dependent entities. One change could not be associated with any definite UoD change driver: (P) address was, and still is recorded in OUTLET PERIOD. Nevertheless, TOWN becomes a separate entity. Notice how TOWN is a variant of the earlier DOMAIN entity. A third CS change corrects an apparently premature elimination in the previous CS version: (Q) the CLASS entity returns in the CS.

Four Case Studies

121

Measurements metrics → ↓ change (O) SERVICE-RANGE inserted (P) TOWN moved out into new entity (Q) CLASS entity returns

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer yes 6 all yes 0 1 0 all no

4

all

yes

0

1

0

all

no

4

all

yes

0

1

0

all

122

6.

Exploring Conceptual Schema Evolution

Market Orientation in 1998

Administrative unit

Area has-a

recorded at

rec. unit

Class has-a

has-a

Class period

rec. unit period

Trade new

class-of has-a has-a

Franchiser has-a

Contract period

Formula has-a

has-a

Admin. unit period

Area period located

Zone

Formula period directs

Region

has-a

has-a

Zone period

Region period

R

recorded-at

Town Trade has-a period Town has-a period

ch5: Sites 1998 market-orientation

managed-by run-by directs has-a

Outlet period

previous

Outlet

Service range has-a

Service range period delivers

has-a

Service assortment period Figure 35. Market orientation in 1998

Change drivers The organization shifts towards a more market-oriented way of operating, and there is an awareness that large outlets require a different management approach that small ones. Therefore, an additional management hierarchy is introduced: - new market-oriented managerial units are created that will supplant the tasks and responsibilities of zones and areas. Changes in the CS The single change in the CS is justified by the UoD change, but notice how the abstract UNIT entity of earlier CS versions could have absorbed the change with no impact: (R) new managerial units called REGION and (Marketing) FORMULA are inserted in the CS. Measurements metrics → ↓ change (R) FORMULA and REGION inserted

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer yes 5 all yes 0 1 0 all

Four Case Studies

123

5.3.4 Highlights of the case Having described the long-term evolution of this single CS in considerable detail, we now look at the overall picture as the CS evolves over time. 1. Abstract versus concrete way of modeling The initial CS uses a single abstracted entity called UNIT. While the goal of abstract way of modeling is to make a CS less prone to change, it is clear that this goal has not been achieved. Through a number of change steps, the UNIT entity gradually dissociates to the point where each type of unit is modeled by a separate entity. The single reflexive reference of UNIT is differentiated and embellished in conjunction with the changes in the entity. At first, it only captured the managed-by reference. In the 1993 version of the CS, there are two more types of reference between units: run-by and recorded-at. The 1998 version inserted directs as a fourth reference type. 2. Temporal data The 1986 CS version introduced state-awareness and temporal data. One might hold that the designer of the 1983 schema did a bad job in neglecting the temporal aspect. However, the CS design was approved by management and it was successfully used for several years. In this respect, the initial CS version has proven its quality by way of its fitness-for-use. The previous reference introduced by change (L) adds a new kind of life cycle. The change driver is that marketers have a different view of business, and hence of outlet life cycles, than operational managers. For instance, an outlet may be closed down, and another outlet started some time later at a different location. The marketing view is that the new outlet is a successor to the earlier outlet. This brings out that even something as simple as historical data is in practice not straightforward but depends on subjective views and business needs.

124

Exploring Conceptual Schema Evolution

5.4 CASE STUDY: FRANCHISER COMPENSATION 5.4.1 Introduction The case outlines the evolution of a CS in the last decade of the 20th century. The remarkable feature in this case is the mix of source- and derived data, and how that mix changes over time. The initial CS of the case study was carried by a legacy mainframe information system. As most of its original documentation has been lost over time, we reconstructed the CS versions from available prints of record-layouts.

CS versions

dominant business changes

1990 -minor changes

1991 -1993 --

redesign

1994 --

1997 --

2000 -time

improvements

extensions exception handling ch5: Franchise dominant business changes Figure 36. CS versions and dominant business changes

The information system was downsized in 1993, with new documentation providing good descriptions of the system's technical composition and software components. Regrettably, there is no clear, complete, and uncontested description of the new CS. Several External Schemas were described in the design documentation, but our integration efforts disclosed major inconsistencies and omissions. We therefore reverse-engineered the CS versions from 1993 onwards using documentation of database structure and contents.

5.4.2 Business background 1. The Company This business case has been obtained from the same company as the Sites case, i.e., Postkantoren B.V. All post office outlets are centrally managed, but only the larger ones are owned and operated by the company itself. Some 800 small outlets are run by independent franchisers who combine the post office business with their regular trades like super markets, tobacconist or stationary.

Four Case Studies

125

2. Business functions The business function supported by this CS is administration of the commercial relationship with the local franchisers. The Franchise Compensation system keeps track of transaction volumes at each franchised outlet. It calculates monthly dues made up of three types of remuneration: - fixed fees to pay the rent, electricity, care of equipment, office janitor etc, - compensations for a variety of costs such as shop alterations, travelling expenses to attend meetings, personnel training etc. The franchiser reclaims these costs using a standard form, and the claim is paid out in one or more monthly installments, and - variable fees depending on the amount and value of conducted business. A fourth type is fee reduction that is deducted from the franchiser's earnings. It applies if a franchiser has to close shop for vacation, illness or whatever other reason. A Post Office employee may substitute for the franchiser to prevent shop closure and interruption of service to the public, and the incurred costs are subtracted from the monthly earnings. Later CS versions also support the planning and control business function. This drives the need to record the annual targets that are negotiated per outlet. Fixed fees for the upcoming year are based on these targets. 3. Daily operations Transaction records of business conducted at each franchiser outlet were submitted to a local back office center for further processing. The local back offices tallied the transactions on special summary sheets for data entry; before 1993, this was all done on paper. The summary sheets were forwarded to a national data entry center that processed the sheets and produced a single magnetic tape per month containing all summary transaction data. Finally, this tape was input for the mainframe Franchise Compensation system. The first two CS versions in our case study concern this system that dates back to the late seventies. It used Cobol as programming language, and a legacy file management system. When modern technology became available in the early nineties, the system was downsized and the error-prone paperwork process was abandoned. As manual data acquisition processing was being replaced by electronic data input, processing became faster and more reliable. All outlet transactions were recorded in detail and transmitted by file transfer to the back offices for further processing. The back offices used a downsized franchise information system operating on PC, with Dbase-like database management system and programming language. From the 1993 version onwards, the case study describes the CS versions at the core of this PC-based information system.

126

Exploring Conceptual Schema Evolution

5.4.3 The CS evolution: 1990 - 2000 1.

initial CS in 1990

ch5: Franchise 1990 Outlet

Tariff (claims)

Tariff (fixed)

Franchise

Tariff (absense)

Rate (fixed) Fixed Amount

Rate (absense)

Tariff (variable) Type of article

Rate (var)

Type of transaction

Fee Reduction Work Period Installment

Absence

Transaction Summary

Replacement

Figure 37. Initial CS in 1990 An OUTLET is run as a single franchise at a time. To account for changes of franchiser over time, the FRANCHISE↑OUTLET reference has N:1 cardinality. The franchisers use a paper form to tally business transactions per WORK-PERIOD (i.e., morning, afternoon and evening). These forms are submitted once a week for processing. Claims for compensation and formal requests for leave of absence go by the same mail. The diagram shows four types of remuneration (TARIFFs and RATEs) as has been explained. TYPE-OF-TRANSACTION provides an intermediate level of aggregation. TYPE-OF-ARTICLE records the kinds of physical evidence involved in the transactions, such as stamps, bus tickets, or lottery tickets.

Four Case Studies

2.

127

Minor changes in 1991

ch5: Franchise 1991 Outlet

Opening hours

Tariff (claims)

Tariff (fixed)

Franchise

D

Rate (fixed)

F Tax class

C

Tariff (absense)

Fixed Amount

Tariff (variable)

Rate (absense)

Type of article

Rate (var)

Type of transaction

Fee Reduction

Installment

A

Transaction Summary

B

Absence

E Figure 38. Minor changes in 1991 Change drivers There are minor developments in the UoD: - users no longer want all the detailed data on transactions per work period. This is a simplification of user perception, rather than a material change in the UoD. - some additional information concerning outlets and franchises is required. Changes in the CS The changing perception of the UoD is reflected by intricate changes in the CS: (A) the generalized notion of WORK-PERIOD is eliminated. Its specialization ABSENCE is retained and promoted to a full entity. The associated references are retained as well. The TRANSACTION-SUMMARY↑WORK-PERIOD reference is redirected to the owner entity OUTLET as well as duplicated as optional reference to ABSENCE. (B) granularity of TRANSACTION-SUMMARY decreases from summaries per work period, to aggregates per month. Its candidate keys are simplified accordingly. (C) a new entity TAX-CLASS records how each FRANCHISE is classified by the IRS. (D) attributes recording the opening hours of each OUTLET are moved into a new entity. Two changes could not be associated to ongoing developments in the UoD:

128

Exploring Conceptual Schema Evolution

(E) in conjunction with WORK-PERIOD, the REPLACEMENT entity is also eliminated, and a compulsory 1:1 reference replaces the former WORK-PERIOD↑FEE-REDUCTION reference. (F) the FIXED-AMOUNT↑FRANCHISE reference is redirected to its owner OUTLET. Measurements metrics → ↓ change (A) ABSENCE promotes to entity (B) decrease granularity of TRANS.-SUM (C) add new entity TAX-CLASS (D) add entity OPENING-HOURS (E) drop entity REPLACEMENT (F) redirect FIXEDAMOUNT reference

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer yes 5 all no +1 2 3 no yes

3

all

no

0

1

2

no

yes

1

all

yes

0

0

0

all

yes

1

all

yes

0

0

0

all

no

3

all

no

0

1

2

all

no

1

all

no

0

0

1

all

Four Case Studies

3.

129

Redesign in 1993 Trade

Zone

K

Adm.Unit

Outlet

S

Franchiser

Tariff

J

Opening hours

ch5: Franchise 1993

VAT rate

H

Franchise Contract

Q

Monthly Invoice

L

I

Fixed Amount

Fixed Fee

Claim

Installment

Rate

Absence

O

Bounds

N

M

Type of transaction

P

Variable Fee

G

R Transaction Summary Transaction Accepted Rejected

Figure 39. Redesign in 1993 Change drivers The 1993 redesign effort was undertaken in order to replace the legacy mainframe system. Old ways of conducting business were innovated, which also presented an opportunity to review the old perception of the UoD and real-world semantics: - electronic file transfer, with disk transfer as backup mode, replaces the paper-based mode of transaction reporting. Franchisers still have to use paper forms to request a leave of absence or to claim for compensation. - a long-standing need is met to distinguish between the original claims put in by the franchisers, and the subsequent installments paid out to them. This also permits users to remunerate large claims piecemeal, and to handle claim corrections and refunds. - administration of franchiser data is improved, so that it now accounts for the fact that some franchisers operate multiple outlets (e.g., a chain of super markets). - bookkeeping rules require that financial data is verified and stored for later retrieval. - outlet classifications are synchronized with other information systems (particularly the Sites system) for better comparability of reports.

130

Exploring Conceptual Schema Evolution

Changes in the CS The redesign provided the engineers with an excellent opportunity to change the CS in many respects. The new CS differs considerably from the previous CS version, although the basic layout is still clearly discernable: (G) TRANSACTION is introduced into the CS, reducing TRANSACTION-SUMMARY to deriveddata entity status. To protect data quality, a software mechanism is provided that rejects all transactions not referring to known OUTLET or TYPE-OF-TRANSACTION instances. (H) RATEs and TARIFFs generalize semantically similar entities. (I) the CLAIM entity is introduced to provide better support for the claims processing paperwork. Data conversion was accomplished by simply assuming a single CLAIM instance for each INSTALLMENT instance in the database at conversion time. (J) FRANCHISE is restructured. A set of attributes is moved out to represent FRANCHISER as an entity, subsuming the TAX-CLASS entity. We renamed the remaining part to FRANCHISECONTRACT for clarity. (K) three entities are copied from the Sites system and superimposed on OUTLET. (L) the legacy mainframe system for franchiser compensation produced monthly invoices, but it did not record the invoices as conceptual entities. MONTHLY-INVOICE is now inserted as a derived-data entity in the CS, subsuming the FEE-REDUCTION entity. Notice the reference to VAT-RATE, where a reference to TARIFF would have been more appropriate. (M) FIXED-FEE and VARIABLE-FEE are inserted to capture monthly aggregates of data, in a way similar to FEE-REDUCTION. These entities are clearly derived data, inserted to facilitate data processing and to contribute to the safeguarding of financial data. (N) certain variable fees are automatically checked to ascertain that the amounts fall within acceptable BOUNDS. The engineers also took the opportunity to make more changes in the CS than what was justifiable by contemporary UoD changes: (O) ABSENCE is restructured again. Although it is reduced to the derived-entity status because absences are now recorded as special TRANSACTION instances, it is not eliminated nor is the derived reference incorporated in the CS. The associated derived-data entity FEEREDUCTION is eliminated and ABSENCE inherits its reference to TARIFF. The reference to OUTLET is redirected again to FRANCHISE-CONTRACT, reversing change (A). (P) the improved level of automation permitted to refine the TYPE-OF-TRANSACTION entity, increasing its granularity from a few dozen to hundreds of instances. (Q) TYPE-OF-ARTICLE, while still a valid notion, disappears for no known reason. (R) the TRANSACTION-SUMMARY↑TYPE-OF-TRANSACTION reference is redirected to TARIFF. (S) VAT-RATE attributes, previously scattered among the TARIFF entities, are moved out and aggregated into one new entity.

Four Case Studies

131

Measurements metrics → ↓ change (G) introduce TRANSACTION (H) generalize RATE and TARIFF entities (I) recognize CLAIM entity (J) move attributes out to FRANCHISER (K) superimpose three entities (L) add MONTHLYINVOICE entity (M) insert FIXED-FEE and VARIABLE-FEE (N) insert entity BOUNDS (O) remodel ABSENCE (P) refine TYPE-OFTRANSACTION (Q) eliminate TYPEOF-ARTICLE (R) redirect TRANS.SUM. reference (S) move out VATRATE attributes

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer yes 3 (+1) all yes +1 2 (+1) 0 all yes

10

all

no

+3

7

3

yes

5

all

yes

0

1

2

no (1) no

yes

3

all

yes

0

1

0

all

yes

6

all

yes

0

1

0

all

yes

3

all

yes

+1

0

0

all

yes

6

all

yes

+2

0

0

all

yes

3

all

yes

+1

0

0

all

no

5

all

no

-1

2

3

all

no

1

yes

0

1

0

no

2

no (2) all

no

0

1

1

no (2) all

no

0

all

no

0

1

1

all

no

3

all

yes

0

1

0

all

(1) on the CS level, the generalizations require new key identities. On the instance level, identity was preserved as every former instance maps to exactly one new instance. (2) entity content was completely revised, and key identity was lost.

132

4.

Exploring Conceptual Schema Evolution

Improvements in 1994 Trade

Zone

Adm.Unit

Outlet

Opening hours

Franchiser

Transac. Forecast

Franchise Contract

Tariff

U

Claim

Monthly Invoice

W

Fixed Amount

Bounds

Absence

Type of transaction

Rate bracket

Y Fixed Fee

X

Installment

Rate

V

T Monthly Estimate

ch5: Franchise 1994

VAT rate

Variable Fee Transaction Summary Transaction Accepted Rejected

Figure 40. Improvements in 1994 Change drivers Now that a new system is operational, users are triggered to demand new types of information. Most demands can be met on the level of the External Schema by defining a new report or user application, having no impact on the CS. Only two changes concern the UoD information structure: - the planning and control business function is to be supported, and - the Post Office company sometimes decides to run a small outlet at its own expense, in order to prevent service interruption while a new franchiser contract is being negotiated. Changes in the CS Two changes are driven by information requirements in the UoD: (T) the CS is extended with two entities to support forecasting of transaction volumes per OUTLET. The annual forecasts are broken down to monthly estimates. The ease with which the planning and control function is meshed into the current CS may well be considered a demonstration of CS adaptability. (U) the FRANCHISE-CONTRACT↑FRANCHISER reference is relaxed to an optional reference.

Four Case Studies

133

Several CS changes could not be traced back to contemporary changes in the UoD. Apparently, maintenance engineers felt it necessary to improve the match of the CS with the perceived information structure of the UoD: (V) the FIXED-AMOUNT↑OUTLET reference is redirected to FRANCHISE-CONTRACT, thus reversing change (F). (W) the references from FIXED-FEE and VARIABLE-FEE to FRANCHISE-CONTRACT are redirected to the MONTHLY-INVOICE entity. (X) a reference INSTALLMENT↑MONTHLY-INVOICE records when a certain installment is paid. This change meets the demand about safeguarding financial data; it also reduces the CLAIM↑FRANCHISE-CONTRACT reference to the derived data status. Moreover, the change is untimely: this reference ought to have been incorporated in the previous CS version. (Y) flat rates are most common, but some rates have two or more rate brackets that the previous CS did not provide for. This is now remedied by appending the RATE-BRACKET entity. Measurements metrics → ↓ change (T) append two entities (U) relax a compulsory reference (V) redirect reference to member entity (W) redirect references to member entity (X) insert a new reference (Y) append entity RATE-BRACKET

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer yes 4 all yes 0 0 0 all yes

1

all

yes

0

0

1

all

no

1

all

no

0

0

1

all

no

2

all

no

0

0

2

all

no

2

all

yes

+1

2

0

all

no

2

all

yes

0

0

0

all

134

5.

Exploring Conceptual Schema Evolution

Extensions in 1997 Trade

Zone

Adm.Unit

Outlet

Opening hours

Franchiser

Transac. Forecast

Tariff

Franchise Contract

Monthly Estimate

Mar gin

Rate

2

Monthly Invoice Annual

Fixed Amount

1

Bounds

Type of Rate transaction bracket Special

Z

Fixed Fee

Claim

Installment

ch5: Franchise 1997

VAT rate

Absence

Variable Fee Transaction Summary Transaction Accepted Rejected

Figure 41. Extensions in 1997 Change drivers Over the next couple of years, some minor changes in the UoD emerged that were easy to accommodate in the CS: - to save costs, the expensive front-desk transaction equipment is taken away from small outlets. Once again, the franchisers must record business transactions on paper forms, with data entry taking place at the local back offices. This way of processing is very similar to the situation before 1993. - to reduce the level of bureaucracy, special front desks are introduced. These desks do away with compulsory on-line transaction recording of bulk business, using summarytransaction sheets instead. Here again, the business process is reverted to a paper-based way of working much alike the situation before 1993. - in addition to monthly invoices, system support is demanded for the annual finalizing of accounts. Changes in the CS The changes driven by the UoD result in two new specializations in the CS: (Z) SPECIAL is introduced in TYPE-OF-TRANSACTION to account for the two types of transactions that are not inputted via regular data input files.

Four Case Studies

135

(1) ANNUAL extends the semantics of MONTHLY-INVOICE. Henceforth, invoices are either per calendar month or per calendar year, invoices of the latter type being identified by the month number 13. One CS change is not associated with any definite UoD change driver: (2) a new MARGIN entity specifies how to calculate some, but not all, bounds per tariff. BOUNDS is thus partially reduced to derived-data entity status. Measurements metrics →

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity ↓ change change tible sion plexity entity refer (Z) create specializayes 1 all yes 0 (+1) 0 all tion SPECIAL (1) create specializayes 1 all yes 0 (+1) 0 all tion ANNUAL (2) superimpose optino 2 all yes 0 1 0 all onal entity MARGIN

136

6.

Exploring Conceptual Schema Evolution

Exception handling in 2000

Trade previous

3 Opening hours

Zone

Adm.Unit

Outlet

Franchiser

Transac. Forecast

Tariff

Franchise Contract

Monthly Estimate

Mar gin

Monthly Invoice Annual

Fixed Amount

Fixed Fee

Claim

Installment Finalized

ch5: Franchise 2000

VAT rate

Absence

4

Bounds

Rate

Type of Rate transaction bracket Special

Variable Fee Transaction Summary Transaction Accepted Rejected

Figure 42. Exception handling in 2000 Change drivers Only one material change in the UoD is known to develop in three years: - sales incentives for some products are offered to franchisers in an attempt to push up transaction volumes. Changes in the CS The CS appears to be flexible enough to accommodate the UoD change. The incentives are processed by the system like regular claims, and no CS change is required. Still, two CS changes could not be associated with contemporary changes in the UoD: (3) a previous reflexive reference is imposed on the OUTLET entity. When a new outlet opens up, this reference enables to link it to its predecessor or predecessors. (4) a specialization FINALIZED is introduced in the INSTALLMENT entity. It corresponds to the ANNUAL specialization in MONTHLY-INVOICE. The finalized installments are calculated annually when accounts are settled and the books are closed. Notice that the compulsory reference to CLAIM is relaxed. This is because all other installment instances refer to a claim submitted by a franchiser, but these finalized ones do not.

Four Case Studies

137

Measurements metrics → ↓ change (3) add a reflexive reference on OUTLET (4) add a specialization FINALIZE

environment timeliness adaptability justified size of compa- exten- com- suscept suscept identity change tible sion plexity entity refer no 1 all yes +1 1 0 all no

1

all

yes

0

(+1)

1

all

138

Exploring Conceptual Schema Evolution

5.4.4. Highlights of the case 1. Backward and forward integration in the information chain There is an abundance of derived data in the various CS versions, which is not surprising in view of the business functions being supported. It is even more interesting to see how the status of being source- or derived data changes over time: TRANSACTION-SUMMARY in the initial CS version was source data. Underlying details about the individual transactions were clearly beyond the scope of the UoD. However, advance in technology shifted the UoD boundary and brought full transaction details within the grasp of the CS. Change (G) then reduces TRANSACTION-SUMMARY and ABSENCE to derived-data status. Remarkably, this is reversed in change (Z). New ways of doing business are introduced that do not rely on automated systems, so that details of the new transactions are once again beyond reach of the UoD. Hence, the level of integration across the information chain decreases. Change (L) is an example of forward integration. MONTHLY-INVOICE is clearly derived data, but its introduction into the CS is justified by the demand to safeguard financial data. 2. Structural similarities as a potential for generalization In the initial CS versions, it is immediately obvious that the four TARIFF entities and the three RATE entities are candidates for generalization. Indeed, this change materialized in the 1993 CS version. Still, the CS contains more candidate entities for generalization. Looking for similarities in entity structures and references in the CS, we noticed how CLAIM and ABSENCE refer to FRANCHISE-CONTRACT and TARIFF, while FIXED-FEE and VARIABLE-FEE refer to MONTHLYINVOICE, and TARIFF again. We believe that these similarities indicate a potential for generalization. Our research is not intended to criticize an operational CS or to improve on it, but we do believe that these entities should have been united into one generalized INVOICE-ITEM entity, which would have resulted in a considerable uncluttering of the CS. Moreover, we feel that the generalization would have contributed to graceful CS evolution. Maintenance engineers might have detected the missing INSTALLMENT↑MONTHLY-INVOICE reference sooner, had the CS been simpler. The reference is now inserted belatedly in change (X).

Shortterm Patterns and Practices

139

CHAPTER 6. SHORTTERM PATTERNS & PRACTICES 'Because, you know in the art of grammar the pupil first amasses nouns, verbs and other components of speech and then learns to link them together with congruity. So it is in the art of dancing, you must first learn a variety of separate movements and then (..), together with the tabular arrangements of movements, you will grasp it all' From 'Orchesographie', [Thoinot Arbeau1589] (in translation by Evans, 1967, p.84)

6.1 INTRODUCTION 6.1.1 The shortterm view This chapter looks at shortterm changes in the CS evolutions as described in the four case studies of the previous chapter. We abstract the know-how experience from maintenance, and make that knowledge available to data administrators and researchers by extracting some of the patterns and best practices that are used to change an operational CS. Our findings include a comprehensive change process that both enterprises seem to follow to manage change in the CS and to ease its evolution. Another research finding is the notion of semantic change pattern observed in the cases. To our knowledge, this is the first attempt to systematically study changes that occur in operational schemas and to learn from them.

6.1.2 Chapter outline Section 2 describes a standard process of change that was being followed in all of our case studies. The four main stages of this change process are: 1. establish the need for change, 2. analyze the semantics of change, and draw up the change specification, 3. transform the CS to accommodate the change at the structural level, and 4. coerce the stored data, and adjust other components of the operational information system. Section 3 describes some common practices how to coordinate changes at the CS level with conversions at the level of the stored data. From our case studies, we noticed that many CS modifications are modeled on some underlying pattern of change. Section 4 discusses the notion of semantic change pattern, and discusses its primary benefits. Appendix B provides a first catalogue of semantic change patterns that we isolated by inspecting the 73 semantic changes in our case studies. A preliminary version of this section was published in [Wedemeijer'99nov]. The chapter is concluded by a summary.

140

Exploring Conceptual Schema Evolution

6.2 THE CHANGE PROCESS

Transform Transform the the CS CS Establish Establish the the need need to to change change

Specify Specify semantics semantics of of change change

coordinate

Coerce Coerce the the data data Figure 43. Four-stage process of changing the CS

6.2.1 Establish the need to change The first step in the process is to establish whether there is a need for change, and if so, why. First principles say that the CS ought to only change if the information structure of the UoD changes. Literature has it that major business events like mergers, acquisitions and diversifications drive the need for change in the CS. [McLeod'88] suggests that: 'when the real world that the database models evolves, when users' views thereof evolve, and when new patterns of usage are encountered' (p.219). Our case studies bring out that this view may be too limited. Other events, such as downsizing to another database platform, or a general change in modeling guidelines, may present legitimate causes for engineers to restructure the CS. The UoD can only be informally known as a common understanding of what is out there. Similarly, the notion of need to change defies to be formalized in a general way and it cannot be caught in formal definitions. Nevertheless, it can be managed and controlled to some extend. The enterprises in our case studies employed stock lists of problem reports and change requests to keep track of undecided change issues. Topics on the list are periodically reviewed for relevance, importance and urgency. In this way, topics may take from a couple of days to several years to mature to the top of the list, and some topics never get there at all. Top priorities on the list are submitted to management for their acknowledgement of the need to change the CS and for approval of business resources expenditure, weighing it against other pressing business needs [Broek, Walstra, Westein'94], [Orna'99]. In this first and informal stage, intangibles like organizational politics and intuitions about fit or discrepancy between UoD and CS play an important role. This stage of the change process terminates when management formally recognizes the need for change, and gives permission to proceed to the next stage in the change process.

Shortterm Patterns and Practices

141

6.2.2 Specify the semantics of change The second stage is analysis and specification. Once the need for change is agreed, it must be made clear what has to change, moving forward from a generally stated and vague need for change, to a precise specification of semantics. Although the change driver may either be internal or external to the enterprise, this second stage of the change process is always internally triggered by an explicit management decision. User requirements are collected and analyzed. Some user requirements may be found conflicting, and such conflicts must be resolved before requirements can be finalized. Change requirements concern the new CS, but can also pertain to the change effort itself, as suggested by [Proper, Weide'93] 'that changes to the structure can be made on-line' (p.347). Although the 3-Schema Architecture distinguishes the structural level of the CS from the level of data instances, it is impossible to change one in complete isolation of the other. This stage must define the overall semantics of change in such a way that the correlation between the two levels is safeguarded [Ewald, Orlowska'93]. In particular, retroactive data is problematic for changes in many operational environments: when a real-world event occurs before the CS change, but the corresponding transactions to store the appropriate data take place afterwards. For instance, we paid a book in Deutsche Mark on the last day of 2001, but the payment was recorded some days later in Euro. The engineer must consider this problem when deciding on a change strategy. Finally, the engineer must determine what other components of the information system will be impacted by the change. Although automated design tools may provide some support in this area, performing the analysis, and detecting and resolving conflicting interests will remain the prerogative of the maintenance engineer [Reiner'91]. This step is complete if the new CS requirements are clearly defined and agreed upon by all concerned, and if the preconditions and strategy for the changeover are outlined.

6.2.3 Transform the CS In this third stage, the engineer defines how the CS is to be changed, by viewing the CS as a mere composition of constructs and constructions. The maintenance engineer has to design both the new CS that satisfies new user requirements, and a way to transform the old CS into the new one. Complex semantic changes are analyzed and broken down into elementary changes according to the taxonomy. This can be very complex, because a single new requirement can involve many constructs of the CS, so that one is hardly able to assess the full impact of change on the CS [Kahn, Filer'00]. It is the responsibility of the maintenance engineer to devise a sequence of elementary schema transformations in such a way that the preconditions and strategy for the changeover are met [Shneiderman, Thomas'82], [Liu, Chryssanthis, Chang'94]. This step is complete if a new CS meeting the new user requirements is implemented in the running business environment, and the changeover has been successfully coordinated with changes of data instances and the changes in any other components of the information system.

142

Exploring Conceptual Schema Evolution

6.2.4 Coerce the data Whenever a CS changes, the components of the operational information system must change accordingly, in a well-coordinated fashion. Chapter 2 explained how the strict typing paradigm of the relational data model theory demands all stored data extensions to comply with their intent at all times. Data coercion (figure 44) is mandatory whenever an operational CS is changed, and this need for data coercion is perhaps the most important difference between change in the operational life cycle phase and CS change in the design phase.

ch6: data coercion coordinated with change in the operational CS Conceptual

Conceptual Schema

Schema Transform the CS

Internal Schema

data store

coordinate

Coerce the data

Internal Schema

data store

Figure 44. Data coercion coordinated with change in the operational CS The business need is to minimize the disruption of information system services and to avoid operational problems in data handling. A prearranged strategy ought to be followed to achieve these goals. It is often assumed that CS change must always precede changes in the data. However, precedence need not always apply, as will be illustrated in the next section by several examples from the case studies. The case studies show a variety of solutions for coordinating changes at the structural level of the CS with changes in the stored data, solutions that can and will affect the way in which the CS is actually being changed. This stage in the process is completed once the new CS is populated by the new data, and all other information system components are adapted to operate in compliance with the new CS.

Shortterm Patterns and Practices

143

6.2.5 Business value of the change process The change process, as we describe it, gives a new insight into the interrelationships of major aspects of semantic change as it occurs in the operational business. To the best of our knowledge, an integral view of the way of working when the CS is being adjusted to changed user requirements has not been described before. An understanding of this four-stage process is important to both practitioners and researchers, as we speculate that many organizations follow this change process, be it explicitly or implicitly. The four stages of the change process address all three dimensions of our framework for flexibility: - the first and second stage focus on the scope and the essential features of the UoD, and concern the environment dimension of flexibility - the third stage of accommodating a structural change into the CS, concerns the adaptability dimension of flexibility - the fourth stage focuses on coordination between CS change and impact of change, which mainly concerns the timeliness dimension. In our opinion, this integral view across all three dimensions is vital for understanding the stepwise evolution of any operational CS.

6.2.6 Validity and related work The change process as we describe it, emerged as the consistent way of working that was followed in all case studies. Basically, we inferred it in hindsight, by abstracting from the individual observations. We do not assume that the process of change proceeds in one uninterrupted flow. For one, delays are customary, for instance because of maintenance backlog. For another, not every semantic change will necessarily go through all four stages: - the process can halt in the first stage, if management does not recognize the need for change, or considers it not urgent enough to allocate the requested time and resources, - it can stop at the second stage, if the need for change can be met without changing the CS. Other information system components may be affected, but that is beyond our research, - the process may skip the stage of changing the CS, as sometimes a change allows to be accommodated by storing data in a slightly different way. If done correctly, it is a demonstration of true flexibility on the part of the CS: the new information needs are accommodated, in a timely fashion, and with minimal schema adaptation. As each of the four stages has been described before in the literature, we conclude that our change process has external validity. For example: - [Barua, Ravindran'96], [Patel'99] address the first stage. Change drivers are identified, but how these change drivers are related to specific changes in the CS of operational information systems is not investigated, - [McKenzie, Snodgrass'90], [Proper'97], [Batini, Di Battista'88], [Ewald, Orlowska'93] conduct research into schema evolution and taxonomies. The main focus is on semantics of change, on the theoretical foundations and potential changes in the CS constructs and constructions, while other aspects of change are excluded,

144 -

Exploring Conceptual Schema Evolution [Aiken, Muntz, Richards'94], [Meier'95], [Kudrass, Lehmbach, Buchmann'96], [Li, Looijen'99] report operational experiences in database evolution and data management,

and - [López, Olivé'00], [Karahasanovic, Sjøberg'01] develop methods and tools to support the coordination of change and the graceful evolution of the CS. These studies ignore the important stages of understanding the need for, and specifying the semantics of change.

6.2.7 Discussion Remarkably, while all separate stages have been described in the literature, most approaches are restricted to only one or two stages. The integral view of CS evolution appears to be lacking in the literature. A typical example is [Peters, Tamer Öszu'97] who argue that 'there are two fundamental problems to consider: Semantics of change: this refers to the effects of the schema change on the overall way in which the system organizes information (i.e., the effects on the schema), and Change propagation: this refers to the method of propagating the schema change to the underlying objects (i.e., to the existing instances)' (p.78). We think that it is an important advantage of the change process that it unites previously unrelated viewpoints into an integral view: - the business angle, where the need for change arises, - the systems engineering angle, that accommodates structural change in the CS, and - the angle of daily operations that must account for the impact of change on stored data and other information system components. From this integral perspective, it can also be understood why structural improvements in the CS are not always implemented forthwith. A business may be aware of design flaws or redundancy in the CS, but this alone does not establish a sufficient cause to trigger the tedious process of change. The change process as we describe it, bridges the gap between approaches towards information changes that are either business-oriented or data-centric. It outlines how enterprises may succeed in evolving a CS in the most favorable way by combining negotiated compromises with the user community on the one hand, with good maintenance practices on the other.

Shortterm Patterns and Practices

145

6.3 BUSINESS PRACTICES IN CHANGE COORDINATION 6.3.1 Avoid immediate data conversion We mentioned how the strict typing paradigm of the relational data model theory demands that data extensions must comply with their intent at all times. Hence, it is natural to expect that after each CS change follows a data conversion effort in order to ensure compliance [Peters, Tamer Öszu'97]. However, we observed several work-arounds in the case studies that coordinated data conversion with the corresponding CS changes in such a way that instantaneous conversion of data was avoided. This is also why in the previous section it was argued that data conversion is coordinated with, but not always preceded by change in the CS. 1. Defer data conversion This kind of work-around involves redundant constructions in the CS. One conceptual construction is the new structure that models the UoD in the best possible way. Another structure stores a legacy of old data. Only when an instance stored in the old format needs to be updated, is it converted into the new format. In this way, data is slowly migrated and the legacy of unconverted data gradually depleted. [Meier, Dippold, Mercerat, Muriset, Untersinger, Eckerlin, Ferrara'94] describe a similar situation involving two separate database environments. Our examples are simpler in that only a single database environment is involved: - change (F) of the Benefit case introduced two entities, PARTICIPATION-TRAIL-(NEW) and PARTICIPATION-TRAIL-(NEW)-FOR-BENEFIT, that would provide greater flexibility than the existing PARTICIPATION-TRAIL entities. The idea was to gradually migrate the data to the new entities, either manually or by a yet-to-build conversion routine. Interestingly, this strategy was aborted when it was decided to eliminate derived data from the CS. - change (A) in the Settlement case introduced two derived-data entities in the database, SUM-OF-REDUCTIONS and SUM-OF-DIVIDED-BENEFITS. The corresponding source data were still recorded only on paper. Later, the CS was extended to also hold these source data, but there was no instantaneous data conversion from paperwork to database. Instead, old constructs were retained while source data instances were slowly being added into the database whenever new settlements were recorded. Migration progressed slowly until all source data were on file. Changes (G) and (H) eliminated the derived-data entities. 2. Postpone elimination Although an entity has lost its relevance for users, it may remain in the CS. The advantage of postponed elimination is that it prevents the need for sorting out and updating legacy software accessing the entity, thus taking the urgency out of the change request. Instead, users simply ignore all data pertaining to the entity, and the quality of operational data may drop to a low level. Letting data standards slip is a simple and effective way to handle the, perhaps temporary, reduction in information needs. Examples in the case studies are: - the old-class-of reference in the Sites case is retained for five years until it is eliminated in change (E). The reference was already obsolete in the initial CS version ! - the OPENING-HOURS entity in the Franchiser case. The user community ignores the stored data, known to be unreliable. Still, the entity is not eliminated from the CS.

146

Exploring Conceptual Schema Evolution

3. Prevent a cascade of changes due to weak-entity keys Many databases use implementations based on weak-entity keys: the primary key of an owner entity is incorporated in the primary key of a member entity, thus hard-coding the corresponding reference into the stored data. In this way, the primary key of an owner entity high up in the CS hierarchy can be cascaded into many member entities. This implementation is fine, on two conditions: member-to-owner references are compulsory forever, and primary keys remain fixed forever. If either condition fails, then weak-entity keys must be adapted which must be considered an update anomaly [Date'00] on the CS level. For instance, a weak-entity key inhibits the corresponding reference to be relaxed to optional cardinality, and prevents the elimination of the owner entity. Thus, weak-entity keys enlarge the impact of change and increase the resistance to the accommodation of changes. Change (K) of the Benefit case underpins our argument of enlarged impact of change. It also illustrates a work-around that avoids the update anomaly on the level of the CS and prevents the otherwise massive data conversions. Before the change, POLICY used a weak-entity key based on its reference to PRODUCT; entities depending on POLICY also used weak-entity keys. Change (K) reified the POLICY↑PRODUCT reference into the PRODUCT-IN-CONTRACT entity, and these keys ought to change in accordance. The huge impact of change was avoided by providing PRODUCT with a new primary key, and by outfitting PRODUCT-IN-CONTRACT with an artificial attribute that resembled the former key of PRODUCT. All reference attributes in POLICY and all dependent entities could be left unchanged, now being based on the reference to PRODUCT-IN-CONTRACT instead of the former reference to PRODUCT.

6.3.2 Change Successions Analyzing the evolving CSs, we noticed the phenomenon of CS change successions, that we understand to be a set of coherent and premeditated changes in different versions of the CS. Some clear examples of change successions meeting a single semantic need for change are: - trimming down the derived-data entity PARTICIPATION-TRAIL. The PARTICIPATION-TRAIL entity in the Benefit case is restructured first in changes (O) and (P) in the September'99, then by change (Q) in the August 2000 version. This change succession was motivated by the modeling guideline to eliminate all derived data from the CS. While PARTICIPATION-TRAIL was created as a derived-data entity, analysis brought out that some of the underlying source data were no longer available. These data, to be called the PARTICIPATION-TRAIL-AS-OF-1-1-'96 specialization, resisted elimination. It is a fine example of the 'survives' situation of derived data, to be discussed in chapter 7. - subsuming three EXCHANGED-E.R.-BENEFIT entities by generalizations. Change (I) in the Benefit case introduced three generalized EXCHANGE entities that subsume EXCHANGED-EARLY-RETIREMENT-BENEFIT specializations. The next CS version eliminated the three specializations in change (N). Of course, the information is still available, but the constructs are no longer separately shown on the level of the CS. The decision to employ a succession of CS changes is motivated by internal and implementation considerations. By our definitions, a large and complicated CS change can be justified by a single UoD change. It is unjustifiable, however, if a single UoD change is accommodated by multiple semantic changes that change the CS first in one way, and then in another.

Shortterm Patterns and Practices

147

6.3.3 Change Reversal It is natural to assume that a change to the CS is never revoked. If a CS is improved by making a certain change, how can the CS later be further improved by undoing that change ? Nevertheless, some semantic changes are being reversed in all four cases: - in the Benefit case: changes (L) and (M) reverse changes (F) and (G), respectively. Previous changes intended to capture intermediate results of derivations in the CS. These changes were reversed as a result of the new modeling guideline to eliminate all derived data from the CS. - in the Settlement case: changes (G) and (H) together reverse change (A) Change (A) introduced two aggregate entities. By a succession of changes, these entities are reduced first, to derived-data status, and then eliminated. The introduction of the two aggregate entities was suboptimal, buying time to gradually evolve to a final CS in which the temporary entities have no place. - in the Sites case: change (Q) reverses change (N) Apparently, the entity elimination in change (N) was premature and had to be corrected. - in the Franchiser case: change (Z) reverses change (G), and (V) reverses (F) Change (Z) was motivated by a change in the way of doing business that deliberately counteracted the intentions of change (G). We have no clear explanation for the latter change reversal.

6.3.4 Discussion The common denominator in all examples is graceful CS evolution, reflecting a conscious and careful approach towards CS maintenance. There are no instantaneous and massive data conversions, no software applications go to waste overnight, and there is no sudden loss of user expertise. From the users' point of view, a CS change does not mean that stored data suddenly becomes invalid. Of course, some data may become irrelevant or incomplete under the new schema, but we never saw real-world data to become suddenly inconsistent or invalid. The recorded data were, and still are facts about the UoD; regardless in which CS the data is stored. Accuracy and validity of a data value are determined by how that data item is defined and measured in the real world. After that, it is immaterial how the data item is modeled in the CS and stored in the database. Therefore, data coercion is not about data validity. It is about how data is structured and stored, and it is about relevance of old data under the new requirements. We think the importance to coordinate CS changes with data conversions is generally underrated, and the consequences for CS evolution underestimated. The business need to minimize the impact of change in operational data can and will affect how engineers actually change the CS, and practical problems in change coordination are always solved in the business environment. In this respect, we feel that the handling of retroactive data is a litmus test for the capability of the enterprise to handle CS evolution.

148

Exploring Conceptual Schema Evolution

6.4 SEMANTIC CHANGE PATTERNS 6.4.1 The notion of Semantic Change Pattern Inspecting the case studies, we observed how many CS modifications are not new and innovative, but appear to be modeled on some basic 'model to be copied' whenever appropriate. A pattern, as commonly understood in information systems engineering, is an abstracted, generalized solution that solves a range of problems with shared characteristics. [Gamma, Helm, Johnson, Vlissides'95] have it that: 'each pattern describes a problem (..), and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without doing it the same way twice' (p.2). We understand the notion of semantic change pattern to be any coherent set of the four stages of the process of changing the CS, as can be accommodated in a single maintenance effort. Semantic change patterns provide the maintenance engineer with a proven way to resolve a need for change in the CS, whatever the reason for change: we do not require that semantic change patterns correspond to justified changes only. An easy example is the Append-an-Entity pattern. It extends the current CS with a new entity, with a new reference to a pre-existing one, and with new referential integrity constraints. It is immediately clear how the pattern will meet a user demand to record more detailed data. And while any taxonomy accounts for the three elementary changes, their combination to cover a specific semantic need is not often discussed.

6.4.2 A catalogue of Semantic Change Patterns Appendix B presents our current catalogue of semantic change patterns. We recognize a semantic change pattern if we clearly observe at least 3 occurrences in at least two of the four case studies. Of course, this is somewhat arbitrary, but definite guidelines for deciding whether a certain phenomenon is modeled on a pattern are non-existent. [Fernandez, Yuan'00] state that reasoning by analogy and generalization are the dominant means to detect patterns. [Coplien'97] simply declares a pattern to be any '"solution to a problem in a context" which is (..) definition of a pattern' (p.338). We describe the patterns in the catalogue according to the main stages of the change process: - the general need for change, deriving from whatever source, - the semantics of change, outlining typical user requirements for this pattern, and how the pattern meets these demands, - the syntax of the pattern, i.e. how the change in the CS is composed from elementary schema changes; some patterns allow minor variations in composition, - the impact of change on the data, and occasionally, on data manipulation routines as well. Our current catalogue captures a dozen patterns that cover 56 semantic changes (=77%) out of the 73 observed changes. It is bound to be an incomplete set; however, completeness was not our aim. We intend to outline realistic patterns of semantic change for the engineer to learn and copy from. The catalogue can and will be extended in the future, as ever more patterns are detected in operational environments and described by researchers and practitioners.

Shortterm Patterns and Practices

149

6.4.3 Practical benefits The main benefit of semantic change patterns is in the operational phase, as the patterns cover all four stages of the change process: - in deciding on the need to change, the business value of a pattern is to give insight in what may drive the CS change, in the effects that it will have on future quality of the CS, and in the impact that it may have on the information system. This knowledge may guide the stakeholders in their decision-making and budget allocation. - in determining the semantics of change, patterns are valuable for a number of reasons. The patterns help to decide on the correct scope of change, they provide a broader range of change alternatives than would otherwise be considered, and the proposed solutions are realistic and well understood. - in accommodating a change in the CS, the transformation is eased by the use of the pattern, in one of its variants. The maintenance engineer can take the pattern, verify whether some special preconditions apply, and tailor the pattern to the case at hand. - in accounting for the impact of change and data conversions, the pattern is valuable in indicating which data may be impacted, when to expect the impact, and what data conversion practices may be applied. These properties help the engineer, when faced with a general desire for improvement, to select the most appropriate pattern, to justify that choice to stakeholders, and to execute the change in a correct way. There are potential benefits in other life cycle phases as well: - semantic change patterns can be used in the design phase of the CS life cycle. A designer can experiment with various kinds of changes to the design proposal, and get a feeling how the proposed CS will react to likely changes in requirements. This may uncover flaws in the design that the designer can correct by applying the pattern. This application of patterns amounts to a passive strategy resembling the 'future analysis' technique proposed by [Land'82]. - semantic change patterns can be used in the final phase of the life cycle, when the legacy CS is to be reverse-engineered. Knowing and understanding the patterns of change eases the recovery of the operational CS, as deviations from the documented CS version may be explained by a change that is fashioned on a semantic change pattern.

6.4.4 Using Semantic Change Patterns The semantic change patterns have yet to be applied in real business cases, and therefore no practical proof of concept is currently available. However, we gather that semantic change patterns can and will assist engineering efforts in all phases of the CS life cycle. Of course, real changes in user requirements are rarely straight and simple, and we do not contend that every demand for change can be met by a single semantic change pattern. By composing partial solutions from the catalogue of patterns, the maintenance engineer can search for comprehensive solutions meeting the new user requirements in a best possible way. A compound CS change can often be achieved in more than one way, and it cannot be decided from the resulting CS how the maintenance engineer went about. An example is the compound change (D) in the October'96 version of the Benefit case.

150

Exploring Conceptual Schema Evolution

before

Participation

Participation Trail

Participation

Participation Exchanged Trail E.R.Benefit

Benefit obtained by E.R.exchange

after

Participation

Participation Exchanged Trail E.R.Benefit

Benefit E.R.Benefit obtained by level 2 E.R.exchange

Figure 45. Connect-by-Intersection followed by Append Figure 45 depicts a sequence of first, a Connect-by-Intersection pattern in its variant with two levels, followed by a simple Append-an-Entity. Figure 46 achieves the same compound change by starting with the variant of the Append pattern that appends two entities at once, to be followed by the Connect-by-Intersection pattern.

before

Participation Trail

Participation

Participation

after

Participation

Participation Exchanged Trail E.R.Benefit

Participation Exchanged Trail E.R.Benefit

E.R.Benefit level 2

Benefit E.R.Benefit obtained by level 2 E.R.exchange

Figure 46. Append followed by Connect-by-Intersection

6.4.5 Semantic Change Patterns are different The notion of semantic change patterns is a novel approach in CS maintenance that differs from design strategies, either active, passive or abstraction strategies. The patterns, abstractions of the practical knowledge and experience from business cases, are important as a means for understanding and improving current ways of working in CS evolution. 1. The difference from taxonomy Formally, every CS change is decomposable into elementary changes provided by the taxonomy. The implication is not, that semantic change patterns are superfluous.

Shortterm Patterns and Practices

151

[Brèche'96] remarks on the gap between 'schema changes by means of primitives closely related to the respective data model' (p.477) and 'advanced primitives', i.e. changes of a more semantic nature. We feel that our patterns represent actual changes in CSs in a far more useful way than taxonomies, because elementary syntactic changes are inadequate as a tool for change to meet changing demands in an operational environment. The shortcomings are in both a lack of semantics, and in ignoring the impact of change. To illustrate our point, several patterns in our catalogue insert an entity, e.g., Append-an-Entity, Connect-by-Intersection, and Superimpose-an-Entity. These patterns build upon the same elementary change of inserting an entity, but each is different. Awareness of the differences in semantics is allimportant to maintenance engineers and data administrators engaged in CS evolution. 2. The difference from lossless schema transformations Authors often assume that when a CS is changed, no data ought to be lost: 'Schema evolution is the process of allowing changes to schema without loss of data' [Goralwalla, Szafron, Tamer Öszu, Peters'98] (p.74). Some of our semantic change patterns coincide with lossless schema transformations as studied by [De Troyer'93]: - Move-Attributes-Out, which she calls projection (p.143), and - Promote-Specialization, called divide (p.157) Nevertheless, semantic change patterns are not a special variant of lossless schema transformations, for two reasons. First, loss of data is acceptable in semantic changes; there are numerous examples in the case studies where data is purposefully deleted. Second, schema transformations cover only the two last stages of our change process, disregarding the business perspectives and user requirements. 3. The difference from design primitives There is some similarity between semantic change patterns and the notion of design primitives or design patterns as found in relational database design [Batini, Ceri, Navathe'92] and object-orientation [Gamma, Helm, Johnson, Vlissides'95], [Fernandez, Yuan'00]. The MoveAttributes-Out pattern for instance is discussed as a top-down transformation primitive in normalization by [Atzeni, Ceri, Paraboschi, Torlone'99] (p.197), and as a pattern of entity decomposition by [Hartmann'00nov]. While design primitives propose static solutions for given design problems, our semantic change patterns are valuable in enabling the graceful evolution of operational CSs in accordance with changing user needs. 4. The difference from schema evolution operators Semantic change patterns differ from CS evolution operators that some database management systems may provide. Such operators perform data conversions following a change of the Internal Schema, but this is done without regard for the semantics of change. Schema evolution operators reduce the time required for CS change propagation, but there is no effect whatsoever on the environment and adaptability dimensions of CS flexibility.

152

Exploring Conceptual Schema Evolution

6.4.6 Absent patterns Several patterns are natural to expect, but we did not observe them. For instance, it may be assumed that every semantic change pattern is matched by a reverse pattern. However, the patterns in our catalogue appear to be mostly asymmetric and counterpart patterns that reverse certain CS changes are not always in the catalogue. Of course, any pattern may be applied backward, but the point is that such reversals are not frequent enough to be patterns of their own. To name some patterns that we did not observe in the four evolving CSs: - Eliminate a superimposed entity, - Redirect a reference to an arbitrary entity, - Restrict a reference; either from optional to compulsory, or from N:1 to 1:1 cardinality, - Restructure an N:1 reference into a many-to-many connection by way of an intersection entity, - Generalization of several similar but separate entities into a single unified entity, in particular the enfolding pattern that transforms a hierarchy into a single entity with a reflexive reference [Veldwijk'95] - Reify a reference into an entity. Why did we not observe these patterns ? In part because we explored too few cases and the semantic change pattern has not emerged, yet. But perhaps we need to address another question as well: why did we expect to see the absent pattern in the first place ? Expectations are guided by intuitions about what is 'natural to expect'. In turn, our intuition is guided by what current literature advocates as good design practices. However, literature focuses on the design phase, and it cannot be assumed that patterns intended for design will meet the needs of maintenance engineers working in a turbulent business environment. Good design patterns may turn out to be inferior in maintenance because the pattern would create unacceptable incompatibilities, or because the overall impact is too large. We already mentioned how the conservative tendency to safeguard current investments in running information systems generally prevents extensive restructuring of the CS. We believe this to be an important reason why theoretically attractive patterns like generalization and reification did not show up in our cases.

6.4.7 Validity The best argument for validity is that semantic change patterns are based on real changes in evolving CSs. Semantic change patterns embody the implicit knowledge of CS maintenance and maintenance engineers. We find that practitioners readily recognize our patterns, and we take this as another mark that the notion of Semantic Change Pattern has practical value and significance. Paraphrasing [Gamma, Helm, Johnson, Vlissides'95]: 'once you've absorbed the patterns, your vocabulary will almost certainly change. You will speak directly in terms of the patterns. You'll find yourself saying things like, "Let's use a Connect-By-Intersection here", or, "Let's promote this specialization"' (p.352). The concept of semantic change patterns relates well to the framework for CS flexibility developed in chapter 3. All three dimensions are involved: the patterns are concerned with the scope and semantics of CS change, with adapting the CS, and with the timeliness of change.

Shortterm Patterns and Practices

153

6.4.8 Discussion The notion of patterns has received considerable attention in the design phase of the CS. Remarkably, its application to the operational phase seems to be rare. We located only a few examples in the literature that resemble our semantic change patterns for CS evolution. [McLeod'88] proposes a small set of semantic change patterns, but only two of his examples are confirmed as a pattern in our case studies. [Brèche'96] discusses a related notion of advanced change primitives in object databases aiming to resolve 'classical problems of schema change: specify the semantics of the new primitives and accordingly, specify rules to preserve the consistency of the resulting schema' (p.477). [Proper'98] discusses an admittedly contrived example of CS change. All these examples have in common that they are contrived by theoretical arguments, not by abstracting actual CS changes observed in practice. We think that semantic change patterns offer an excellent opportunity to extend current database management systems with more intelligent data conversion solutions, and to provide maintenance engineers with easy to configure data conversion routines that are well coordinated with the corresponding CS changes. The patterns could be used as a basis to generate database routines performing semantically meaningful data conversions. Immediate data conversion should be performed wherever possible; deferred whenever required. In the latter case, the patterns could support the generation of user applications that support the manual data conversions and that assist in monitoring the overall progress in data migration.

154

Exploring Conceptual Schema Evolution

6.5 CONCLUSIONS 6.5.1 Summary One of our research objectives was formulated as disclosing operational practices and experiences in CS evolution. Restricting for the moment to the single transitions of CS versions, this chapter outlined important patterns and practices. First, we described a general way of working in CS change that explains how CS changes are driven by emerging business needs. The change process consists of four stages: - to recognize and understand the need for change on a business level, - to define the intentions and semantics of the required change, - to accommodate the change in the CS, and - to account for the impact of change, and more in particular, to execute data conversions. We classified this change process as a passive approach towards CS flexibility. We think that awareness, understanding and adoption of this change process in maintenance will improve the way of working when changes are being made in operational CSs. We then discussed some practices that businesses apply to coordinate changes in the stored data with the semantic change in the CS. These practices, which can be classified as an active engineering approach towards CS flexibility, attempt to evade the 'immediate' data conversion inherent to the relational database technology and 3-Schema Architecture. The business practices aim to strike a balance between the best way of adapting the CS, and the desire to minimize impact and simplify change propagation. In addition, we described change successions and change reversals, which we illustrated with some 16 semantic changes from the case studies. Although these 16 make up only some 20% of all semantic changes, we feel that change successions and reversals have enough relevance to merit further research. The more so as we could find no mention of the phenomenon in the literature. Next, we described semantic change patterns as a novel concept for CS evolution in operational business environments. The semantic change patterns address all three dimensions of our framework for CS flexibility: - they help to determine the scope and essence of a planned change - they guide the engineer to accommodate the change into the CS in a rigorous way - they improve timeliness because impact of change and data conversion strategy are outlined in advance The patterns are abstractions of practical knowledge about why and how to evolve a CS, about the new semantics and adapting the CS, and about adjusting database content to changing requirements and information needs in the business environment.

6.5.2 The shortterm view is not enough This chapter abstracted from the isolated experiences in the case studies in order to extract patterns and practices. Like [Marche'93], we studied the single transitions of CS versions only. However, we are convinced that the shortterm view alone is inadequate to achieve our research objective of disclosing general practices and experiences in CS evolution, and to come to understand CS flexibility in the long term.

Long-term Trends in Evolution

155

CHAPTER 7. LONG-TERM TRENDS IN EVOLUTION 'The world can doubtless never be well known by theory: practice is absolutely necessary; but surely it is of great use to a young man, before he sets out for that country, full of mazes, windings, and turnings, to have at least a general map of it' Letter to his son, dated August 30, 1749, by Philip Dormer Stanhope, 4th Earl Chesterfield (1694-1773), British statesman, and man of letters

7.1 INTRODUCTION 7.1.1 The long-term view of evolving CS One of our main research objectives is to explore long-term evolution of CSs, as opposed to short-term CS changes, in order to disclose operational practices and experiences. This chapter takes the broad view to see the CS evolve over an extended period of time. We abstract from the individual semantic changes being accommodated into the CS, and focus on the long-term evolution of the operational CS. We analyze the measurements for the operational metrics developed in chapter 4, and we determine their long-term trends and tendencies. And we translate our observations and experiences gained in the longitudinal analysis into a number of best practices for the business practitioner engaged in CS design or maintenance. Our summary of best practices must be read as additional to the current body of knowledge, so that the engineer may achieve a higher level of flexibility in, and a more graceful evolution of the CS. The merit of the recommendations is that their basis is in the practical experiences acquired from real business cases, rather than theoretical assumptions and hypotheses.

7.1.2 Chapter outline Section 2 analyzes trends for the eight operationalized metrics outlined in chapter 4 by examining time series of the measurements recorded in the four case descriptions in chapter 5. In addition, we discuss trends in those guidelines of our framework for flexibility for which we did not develop metrics. Section 3 discusses overall effect of consecutive semantic changes and evolution on the CS. By combining long-term trends in several of the guidelines, we concur that the main effect of CS aging is in the adaptability dimension of flexibility. Section 4 discusses the effects of derived data on the long-term CS evolution. We develop a taxonomy of derived data using three characterizations. This taxonomy helps to understand why the theoretical directive to ban all derived data from the CS cannot always be adhered to. A preliminary version of this section was published as [Wedemeijer'00aug]. Section 5 provides our list of best practice recommendations for the business practitioner engaged in CS design or maintenance. Section 6 concludes this chapter, and the Practical Track of our research.

156

Exploring Conceptual Schema Evolution

7.2 MEASUREMENTS OF CS EVOLUTION 7.2.1 Justified change The metric for justified change, and the size of change metric to follow, address the environment dimension of our framework for flexibility. The hypothesis underlying the metric for justified change is that the CS is a valid model of the information structure of the UoD, and that change in the CS is therefore invariably driven by change in the information structure of the UoD. Figure 47 depicts our measurements for this metric. 10

Benefit:

Settlement:

Sites:

Franchiser:

5

2000

1997

1994

1993

1991

1998

1996

1993

1989

1986

1999

1997

1994

2000 Aug

Sept

1999

Feb

1999

Jul

1997

1996

Oct

0

-5

Figure 47. Justified change positive numbers represent justified changes in each CS version negative numbers represent unjustified changes 1. Almost half of CS changes are unjustified Of the 73 semantic observed changes, 34 (=46%) are not justified by a corresponding change in the information structure of the UoD. Some examples of changes that are not driven by contemporary real-world changes are: - schema restructuring as in change (P) in the Benefit case, and (G), (H) in the Settlement case. These changes either accommodate derived data, or eliminate it from the CS, - changes (N) in the Benefit case, (G) in the Settlement case, and (4) in the Franchiser case. These changes reflect user needs to explicitly model certain specializations in the CS, - changes (E) and (F) in the Sites case that eliminate obsolete constructs, and - changes (P) and (Q) in the Benefit case, and change (J) in the Sites case that replace a generalized entity by explicit specializations, decreasing the level of abstraction in the CS.

Long-term Trends in Evolution

157

2. Justified changes are often driven from the internal business environment Of the 39 justified changes, 21 are associated with external causes driving the change. However, 18 justified changes (=46%) are driven by internal business considerations rather than external causes, such as: - the need to record historical data, which accounts for change (A) in the Sites case, - changes in perception, such as the decision to model products with a dependency on contracts, as seen in change (K) of the Benefit case; or the decision to promote a specialization into a full entity as seen in the Sites case in changes (B) and (J), and - change in the modeling guidelines about derived data. 3. Implications for theory For justified CS changes, theory offers numerous reasons deriving from the external business environment. For instance, [ANSI/X3/sparc'75] stated how 'the CS is sensitive to business cycles, diversifications, mergers, new interests and other dynamics of the corporation' (p.III12). Remarkably, other sources of change appear to be generally overlooked. Our finding is in agreement with [Fitzgerald'90] who reports that 'enhancements have been analyzed and broadly fell into three main groups: environmental, technical, and organizational (..) it emerged that approximately 70% of the enhancements to the systems could be attributed to organizational factors. The majority of which were internally controllable by the individual organizations' (p.8). [Lederer, Salmela'96] propose to distinguish between change drivers that are external to the enterprise ('a more stable external-environment enhances stability') and changes arising from the internal business environment ('a more simple internal environment enhances stability'). We conclude that their proposal is in agreement with our finding. Further research is needed to investigate the intricate relationships between the external business environment and the management level where the need to change the CS is decided upon [Jordan, Tricker'95], [Feurer, Chaharbaghi, Weber, Wargin'00]. 4. Implications for practice The large share of unjustified changes cannot, in our opinion, be explained as faulty scope setting, i.e. by assuming that designers fail to determine the correct scope of the UoD. For one, it is a fundamental problem to know what the essence of the UoD really is. The Benefit case illustrates this when the EXCHANGED-E.R.-BENEFIT entity is subsumed by the EXCHANGE generalization in change (I). How could anyone know beforehand that the intent of this entity would later have to be extended ? Another argument is that the engineers are not always to blame for seemingly inferior scope-setting. Sometimes, business decisions impose ways of modeling that prevent engineers from grasping at the essence. Some examples are: - change (F) in the Benefit case, where new entities are inserted to record elective benefit options, but four specific benefit options are being dealt with by attributes (not depicted in the diagram) rather than instances of the newly inserted entities, - change (A) in the Settlement case that creates a temporary solution to quickly comply with legal demands, to be followed by a train of unjustified changes to remedy the negative aspects of the initial decision and to account for the full impact of change belatedly, - change (B) in the Sites case. Users want to distinguish sites that are either operational or managerial, but the demand could have been met without a structural change in the CS.

158

Exploring Conceptual Schema Evolution

7.2.2 Size of change The hypothesis underlying this metric is that if a CS is a good model of the UoD information structure, then change in the UoD will be accommodated in the CS by a change of proportional size. It was explained in chapter 4 how proportionality of change cannot be assessed due to the lack of an objective measure for size-of-change in the UoD. Therefore, we simplified the metric to measure only the size of change in the CS. Figure 48 shows the measurements of this simplified size of change metric.

15

Benefit:

Settlement:

Sites:

Franchiser:

10

5

2000

1997

1994

1993

1991

1998

1996

1993

1989

1986

1996

O ct

1997 J ul 1999 Fe b 1999 Se pt 2000 Aug 1994 1997 1999

0

ch7: size of change Figure 48. Size of change separate rather than stacked columns are shown because measurements do not add up

1. As a rule, a semantic change affects 6 constructs or less Most CS changes are limited in size. Out of the 73 observed changes, only 4 changes (=5%) affect more than 6 constructs at once; and these are justified changes. There are 19 semantic changes (=26%) that affect 5 or 6 constructs. The majority of 50 semantic changes (=69%) affects 4 or less constructs.

Long-term Trends in Evolution

159

Average size of the changes per CS version

7

6

5

4

3

2

1

0 0%

20%

40%

CS age

60%

80%

100%

Figure 49. CS age versus average size of the changes CS age expressed as percentage of observed lifetime average size of changes per CS version calculated as: sum of 'size of change' divided by number of changes. trend line calculated by linear approximation 2. Size of semantic change does not clearly depend on CS age Figure 49 shows CS age (horizontal) versus average size of changes per CS version (vertical). The averages vary between 1 and 6 affected constructs per semantic change. The trend line slopes downward from 4 to 3, indicating that the average size of change decreases as the CS ages. However, this downward trend is not firm: a single large change late in life will tilt the trend line the other way. Thus, there is no clear and uncontested evidence confirming the hypothesis that old CSs systematically go through smaller changes than new ones. The relationship between CS age and size of the average change is yet undecided. 3. Implications for theory The hypothesis underlying this metric is that change in the UoD ought to be accommodated by a change in the CS of proportional size. Our cases show some examples of justified semantic change where 10 constructs or more are affected: changes (E) and (I) in the Benefit case; and change (H) in the Franchiser case. Inspecting these particular changes, we find these to be related to important changes of perception regarding real-world information structures. The finding puts the practical importance of Future Analysis [Land'82] into perspective. Future analysis may certainly be beneficial, but it is unlikely to prevent future changes of large size.

160

Exploring Conceptual Schema Evolution

7.2.3 Compatibility The hypothesis underlying the compatibility metric is that enterprises will try to prevent user intervention or manual editing of data when a semantic change is committed. This metric, and the extensibility metric to be discussed next, is primarily addressed at the timeliness dimension of our flexibility framework. 10

Benefit:

Settlement:

Sites:

Franchiser:

5

2000

1997

1994

1993

1991

1998

1996

1993

1989

1986

1999

1997

1994

2000 Aug

Sept

1999

Feb

1999

Jul

1997

1996

Oct

0

-5

Figure 50. Compatibility positive numbers represent compatible changes per CS version negative numbers represent incompatible changes 1. The rule is compatible change Figure 50 depicts our observations on the compatibility of CS changes. This hypothesis is clearly confirmed. Incompatibility is exceptional: we see it in only 3 out of 73 semantic changes (=4%). Moreover, we find that the extent of incompatibility is kept as small as possible: manual adjustments were always limited to a single entity only. 2. Implications for practice The demand for compatibility is rarely explicit in change requirements, but it is standard practice to provide it, and to provide by means of an automatic data conversion whenever possible. This may pose a difficult challenge for the engineers, especially when dealing with legacy databases where the semantics and composition of the stored data have been lost in history. Still, the call for compatibility is always legitimate and must be acknowledged.

Long-term Trends in Evolution

161

7.2.4 Extensibility The extensibility hypothesis is that changes in user requirements are accommodated either by adding new constructs to the CS, or by broadening the intents of existing ones. Figure 51 depicts measurements for the extensibility metric. 10

Benefit:

Settlement:

Sites:

Franchiser:

5

2000

1997

1994

1993

1991

1998

1996

1993

1989

1986

1999

1997

1994

2000 Aug

Sept

1999

Feb

1999

Jul

1997

1996

Oct

0

-5

Figure 51. Extensibility positive numbers represent extensions in the CS version negative numbers represent changes that do not extend the CS 1. One out of three CS changes is no extension The extensibility hypothesis is confirmed, but only weakly so. There are 46 extensions in 73 changes (=63%). Of the remaining semantic changes, 14 (=19%) are CS reductions (construct elimination and intent restriction). The new CS version differs from the previous version in other ways in 13 remaining changes, e.g., reference redirections and entity restructuring. 2. CS age does not seem to affect the share of extensions or non-extensions This observation is based on figure 52. The two trend lines are not remarkably different, expressing indifference of CS age as to the share of extension and non-extension changes.

162

Exploring Conceptual Schema Evolution

Number of Extensions (dot) and Non-extension (triangle) changes

10

8

6

4

2

0 0%

20%

40%

60%

80%

100%

CS age

Figure 52. CS age versus share of (non)-extensions extensions are shown as dots with a solid trend line non-extensions are shown as triangles with dashed trend line trend lines calculated by linear approximation 3. Implications for theory Extensibility is certainly important in CS changes, but it is not the only mode for CS evolution. New information is easily captured by CS extension, e.g., by applying the Append-anEntity or the Connect-by-Intersection pattern, but this does not mean that new requirements always call for new data. This is clearly illustrated in the Benefit case in changes (L) and (M) that eliminate derived data entities in response to a change of modeling guideline. [King'86] observed how the introduction of a new information system triggered demands for additional information, causing extensions to outnumber other types of change. His observation may well be valid for newly introduced systems. However, it is probably an initial effect only, as we do not observe it in the long-term evolution of our operational CSs.

Long-term Trends in Evolution

163

7.2.5 Lattice Complexity The lattice complexity metric, and the three metrics to follow, addresses the adaptability dimension of our framework for flexibility. The hypothesis underlying the lattice complexity metric is that complexity hampers change. By implication, CS changes are presumed to be simple. Figure 53 depicts our measurement how much CS lattice complexity changes because of semantic change. 10

Benefit:

Settlement:

Sites:

Franchiser:

5

2000

1997

1994

1993

1991

1998

1996

1993

1989

1986

1999

1997

1994

Aug

2000

Sept

Feb

1999

Jul

1999

1997

1996

Oct

0

-5

Figure 53. Change in CS lattice complexity positive numbers represent increase in lattice complexity per CS change negative numbers represent decrease in lattice complexity 1. Most CS changes alter lattice complexity by at most 1 Most CS changes are simple. Of 73 semantic changes, 41 leave lattice complexity unchanged (=56%). A total of 24 semantic changes raise or lower lattice complexity by exactly 1 (=33%). Only 8 semantic changes (11%) cause a larger change in lattice complexity. 2. Lattice complexity is not a determinant for CS version life span The basic hypothesis about complexity is that more complex CSs will change less frequently. Thus, we ought to see longer life spans for more complex CS versions. Figure 54 investigates whether lattice complexity is a determinant of version life span. The horizontal axis plots version complexity, assumed to be a determinant, and the vertical axis depicts the life span of that CS version as percentage of total CS lifetime. We include only the CS versions that have been terminated, we exclude the final CS versions of each case study, as their life spans are unknown. The trend line in the figure is almost horizontal: lattice complexity is not an important determinant of CS version life span. In other words, more complex CSs apparently do not change much faster (or slower) than a simple CS.

164

Exploring Conceptual Schema Evolution

However, this finding must be interpreted with care. It was indicated in chapter 1 that some CS version changes may have escaped our notice, so that the base data about CS version life spans in this figure cannot be relied upon. 60%

Version life span as % of total CS life

50%

40%

30%

20%

10%

0% 0

2

4

6 8 10 Lattice complexity of CS version

12

14

Figure 54. Lattice complexity of CS version versus CS version life span horizontal axis plots lattice complexity of CS version vertical axis plots CS version life span (as percentage of CS life) trend line calculated by linear approximation 3. Lattice complexity increases initially, but is curbed at higher age of the CS Figure 54 suggests that the flexibility guideline to 'keep the CS simple' has little effect on the stability of a given CS version in the short term. However, another picture emerges from figure 55 and its trend line. It depicts the evolution of CS lattice complexity over the entire CS lifetime, taking the long-term view, whereas figure 54 looks at life spans of individual CS versions. The horizontal axis in figure 55 plots CS age expressed as percentage of the total life span observed in the case study. Lattice complexity of CS versions is plotted vertically. The trend line, calculated by second-order approximation, clearly shows how lattice complexity rises at first due to the initial changes, but at higher CS age, lattice complexity goes down. 4. Implications for theory Apparently, there are counterforces at work so that changes later in the CS life cycle bring lattice complexity down through specific countermeasures. Examples are seen in the elimination of derived data in the Settlement case, and the system renovation in the Sites case. Our finding is compliant with the Law of Increasing Complexity for software programs stated by [Lehman, Belady'80]: 'As an evolving program is continually changed, its complexity, reflecting deteriorating structure, increases unless work is done to maintain or reduce it' (p.412 in the '85 reprint).

16

Long-term Trends in Evolution

165

15

Lattice complexity of CS version

10

5

0 0%

20%

40%

CS age

60%

80%

100%

Figure 55. CS age versus lattice complexity of CS version horizontal axis plots CS age vertical axis plots lattice complexity of CS version trend line calculated by second-order approximation We suspect that high lattice complexity signifies an intimate coupling between the CS and the software applications that produce derived data. The CS supports the applications by storing intermediate and final results of calculations, and also captures the references that relate the stored derived data to their source data. This way of modeling is instrumental to the processing and updating of derived data, but it does not enrich the CS either in semantics or in information content. Illustrations of this line of reasoning are found in the Benefit and the Settlement cases where derived data entities are eliminated. 5. Implications for practice A best practice recommendation for practitioners is to strive for minimal change in lattice complexity. Users should decompose their complex requirements for change into simpler, and more manageable requests for semantic change. This will improve feasibility of the change process, and it eases the accommodation of change in the CS lattice. The advice is in keeping with other intuitive guidelines how to deal with complexity, such as those proposed by [Truijens, Winterink'96]. When a CS is changed, it is not enough to account for entity integrity and referential integrity of the affected entities only. Whenever lattice complexity of the CS is altered, and this happens in over 40% of the semantic changes, the connection constraints must be reviewed. A new connection, i.e., a pathway of references between existing entities may be established. Care must be taken that the semantics of the associated new connection constraint is well accounted for in the existing data populations.

166

Exploring Conceptual Schema Evolution

7.2.6 Susceptibility to change : per Entity The basic hypothesis underlying our susceptibility-to-change metrics is that some types of construct are more likely to change than other types. However, we cannot compare susceptibilities for all types of construct provided by the data model theory, as our research covers only the entity and reference constructs. 15

Benefit:

Settlement:

Sites:

Franchiser:

10

5

2000

1997

1994

1993

1991

1998

1996

1993

1989

1986

1999

1997

1994

2000 Aug

Se pt

1999

Feb

Jul

1999

1997

1996

Oct

0

ch7: susceptibility to change:

Figure 56. Susceptibility to change : per entity positive numbers represent number of affected entities per CS change separate rather than stacked columns are shown because measurements do not add up 1. A semantic change affects no, one, or several entities, with equal probability A classification of the 73 semantic changes according to the number of affected entities results in three classes of comparable sizes. The largest class contains the 29 changes (=40%), that do not affect existing entities (i.e., straightforward extensions). The second class contains 26 changes (=36%) that affect a single entity. The smallest class, containing 18 items (=25%), is the class of semantic changes that affect several existing entities at once. 2. As a rule, no entity is eliminated if it has both member and owner entities We only saw entities being eliminated that were either at the top or at the bottom of the CS hierarchy. For instance, eliminations (A) and (H) in the Sites case, and eliminations (L) and (M) in the Benefit case. The exception to this rule is found in changes (G) and (H) of the Settlement case. However, this underpins rather than refutes our argument, as the eliminations concern derived data, and the entities as well as the connections are fully redundant.

Long-term Trends in Evolution

167

3. As a rule, every entity structure evolves We defined entity structure as its set of references and specializations, i.e., entity composition in terms of aggregation and generalization. We find that almost every entity had at least one reference or specialization changed in the course of the CS evolution. In the case studies, we located only four entities that remained unaffected in both references and specializations: CUSTOMER and BENEFIT-PREMIUM / REDUCTION in the Benefit case, and PENSION-POLICY and PARTICIPATION in the Settlement case. No entity remained unaffected in the two other cases.

entities encapsulating changes (dark) versus entities not encapsulating change (light) when the CS version changes

100%

75%

11 105

29

6

29

8 7 20

5

4

3

1

1

50%

25%

6 17

9

7

4

6 3

0% 1 2 3 4 5..6 7..8 9..12 13..16 17..20 21.... entity position in CS hierarchy, expressed as number of dependent entities ('package size')

Figure 57. Susceptibility to change : per entity depending on hierarchic position shown as percentages of changed versus unchanged packages 4. Susceptibility to change is independent of entity position in the CS hierarchy It may be speculated that entities high up in the CS are less prone to change than entities low in the hierarchy. Figure 57 is created to study this assumption. We calculate the relative position of an entity in the CS lattice by counting its dependent entities, using the notion of package. The package for an entity contains the entity itself, its member entities, all members of its members etc, forming the transitive closure [Blaha, Premerlani'98]. Package size is the number of entities in the package. We came up with 122 entities of package size 1 (i.e., the package contains only the entity itself), 38 entities with size 2 etc. Next, we checked whether the entity package in that CS version happened to encapsulate a semantic change. If so, we count it in the lower range of entities in figure 57, entities that encapsulate a semantic change (shown dark). Otherwise, it is in the upper range of figure 57. As entities low in the CS hierarchy far outnumber the high-up ones, we show percentages rather than absolute numbers for comparability.

168

Exploring Conceptual Schema Evolution

Figure 57 does not corroborate the speculation that high-up entities are more stable. Rather, the figure suggests that entities with package sizes 4 to 12 are most prone to change. The fluctuation may be accidental, but if not, then we have no theoretical explanation for this tendency. 5. Implications for theory The finding that the structure of every entity evolves is somewhat unexpected. We expected the well-designed entities, whichever they are, to remain stationary, with only some entities of inferior design quality being prone to change. [Sjøberg'93] reaches a conclusion similar to ours concerning entity compositions: 'every relation has been changed. At the beginning of the development, almost all changes were additions. After the system provided a prototype and later went into production use, there was not a diminution in the number of changes, but the additions and deletions were more nearly in balance' (p.39). 6. Implications for practice Connecting entities are persistent. With one exception, no entity with both member and owner entities was seen to be eliminated. Such entities provide connections, pathways of references connecting the instances of member entities to instances of the owner. Although the entity itself may become irrelevant, the connection remains vital in the CS structure and the entity is not eliminated.

7.2.7 Susceptibility to change : per Reference Figure 58 depicts measurements for susceptibility of change per reference. 1. As a rule, references are stable No reference is involved in 44 out of 73 changes (=60%), and only the remaining 29 (=40%) of semantic changes affect one or more conceptual references. But there is more to it. Reference changes are often made as a consequence of some change in one of the associated entities, even though the reference construct as such, its semantics, is unaffected. If we exclude changes in references associated to entity addition or elimination, only 14 semantic changes (=19%) are found to affect a reference as an independent construct. If we also consider the fact that most CSs have more references than entities, then it must be concluded that as a rule, references in the CS are more stable than entities. 2. Reflexive references are unstable Reflexive references are uncommon, and we have only 7 such constructions in our cases. The evolution is remarkable: every reflexive reference changed in the course of the CS evolution. Four reflexive references were in place in the initial CS version as evolution began, and changed into regular references in some later CS version. The three other reflexive references were not included in the initial CS version, and were inserted later. 3. References with 1:1 cardinality are unstable Like reflexive references, the 1:1 cardinality for aggregate references is uncommon, and we count only 6 in our cases. Again, these references are unstable. Five 1:1 references were inserted in the course of the CS evolution. One reference with 1:1 cardinality was present in the initial CS of the Benefit case; change (H) relaxed it to N:1 cardinality.

Long-term Trends in Evolution

15

Benefit:

169

Settlement:

Sites:

Franchiser:

10

5

2000

1997

1994

1993

1991

1998

1996

1993

1989

1986

1999

1997

1994

1996 Oct 1997 Jul 1999 Feb 1999 Sept 2000 Aug

0

ch7: susceptibility to change: per reference

Figure 58. Susceptibility to change : per reference positive numbers represent number of affected references per CS change separate rather than stacked columns are shown because measurements do not add up 4. Implications for theory An example evolution of a reflexive reference is seen in the Sites case. The reflexive reference between instances of the UNIT entity was initially labeled as managed-by. Later versions of the CS add references that implement other relationships between units, e.g., directs and recorded-by. Thus, the semantics of the reference dissociates in a way that is reminiscent of the design pattern presented in [Batini, Ceri, Navathe'92] (p.59). In our experience, true 1:1 cardinality for references is an uncommon phenomenon in data modeling. We believe that 1:1 cardinality for a reference is most likely to be a design flaw that should have been modeled either as N:1 cardinality, or as a generalization. The Settlement case shows an example of a missed generalization as two separate entities are modeled, SEPARATED-BENEFIT and BENEFIT-DEDUCTION, that together represent a single concept of the UoD. An example of the former situation is the BENEFIT↑PARTICIPATION reference in the Benefit case. As advocated by [Davis'90], the preferred way of modeling is an N:1 reference, plus a separate cardinality constraint to capture the business rule. As the business rule is liable to change, evidenced in this example by change (H), this solution minimizes future impact of change.

170

Exploring Conceptual Schema Evolution

5. Implications for practice Some authors suggest that CS flexibility may be partly due to the capability to independently manipulate the reference construct in the CS lattice. However, we find that many changes in references are actually a consequence of some change in the associated entities; we already noted how every entity structure evolves over time. Hence, the theoretical capability has little practical significance.

7.2.8 Preservation of Identity The underlying hypothesis of this metric is that the set of candidate keys captures the essence of an entity. Figure 59 depicts our measurements for this metric. 10

Benefit:

Settlement:

Sites:

Franchiser:

5

2000

1997

1994

1993

1991

1998

1996

1993

1989

1986

1999

1997

1994

2000 Aug

Feb

Sept 1999

1999

Jul 1997

1996

Oct

0

-5

Figure 59. Preservation of entity identity positive numbers represent identity preserving changes in each CS version negative numbers represent semantic changes that do not preserve entity identity 1. As a rule, entity identity is preserved We observed 11 changes of key identity in 73 semantic changes (=15%) so as a rule, candidate keys do not change. Once an entity is operationalized, its identity remains more or less stable, even if its structure (references and specializations), its attribute composition, or its intent change over time. Thus, preservation of identity is a stabilizing factor in CS evolution. 2. Implications for theory Chapter 6 discussed the phenomenon of weak-entity key. Weak-entity keys give rise to update anomalies when the CS is changed, and generally prohibit schema restructuring. Although we clearly observe preservation of entity identity, it is unclear whether this confirms the hypothesis of candidate key as an expression of entity essence, or whether it is merely a symptom of rigidity due to weak-entity key implementations.

Long-term Trends in Evolution

171

7.2.9 Qualitative trends in other flexibility guidelines 1. Proportional rate of change The hypothesis is that the CS will evolve at a high rate if a turbulent environment is modeled, while stability in the environment will be reflected in a low rate of change in the CS. A metric for proportionality in change rates could not be established, for want of an objective measure of 'turbulence in the UoD'. Perhaps progress can be made by exploiting the change process outlined in chapter 6, in particular its initial stage that focuses on the business need to change the CS, but this is an issue for future research. Nevertheless, the inability to assess rate of change does not diminish validity of our other observations. 2. Modules in the CS One hypothesis regarding modules in the CS is that the semantic changes are encapsulated if modules in the CS are well chosen. A second hypothesis outlined in chapter 4 is that modules in the CS will evolve as independent units, and therefore, that a well-chosen modularization will contribute to CS stability. Many theories exist on how to create a modular CS, but none is well accepted or widely practiced. Some CSs in the case studies were designed or even maintained by using partitions, but the partitions served documentation purposes only. The engineers never suggested that the particular partitions constituted a modularization of the CS with a view to long-term stability. Nevertheless, we attempted to investigate the hypotheses, resorting to the notion of 'package' as defined by [Blaha, Premerlani'98]. We found that 58 (=80%) out of 73 semantic changes are encapsulated by a package consisting of 6 entities or less. This may seem significant, but actually, this package size holds for the majority of entities (see figure 57). Moreover, it cannot be assumed that these packages enable to build a good, permanent modularization of disjoint modules that encapsulate the changes in the CS. Therefore, we believe that the hypotheses regarding modules in the CS have little value for practice. 3. Type persistence As we pointed out in chapter 4, type persistence is an important assumption underlying the susceptibility to change metrics. We find that type preservation is certainly predominant in the CS evolution of the case studies, but there are several counterexamples: - switch from attribute into entity, as seen in the pattern Move-Attributes-Out. - switch from attribute into constraint. UNIT in the Sites case used its own artificial key. Gradually, this artificial attribute evolved into an artificial, i.e. semantically meaningless uniqueness constraint across unrelated entities. - switch from instance on the data level to construct on the schema level. Instances of the CLASS entity in the Sites case record classifications of the instances of UNIT. Later CS versions modeled these classifications explicitly by way of entity and reference names. - switch from reference into entity, as seen in change (K) of the Benefit case.

172

Exploring Conceptual Schema Evolution

7.2.10 Discussion Chapter 4, in operationalizing the set of metrics, remarked on a possible bias in the metrics. There may be an indirect dependence on CS size, or, for that matter, on other features of the CS that we consider immaterial to CS flexibility. Within the framework of this exploratory research, it is impossible to ascertain whether our metrics are biased in some way. We analyzed four evolving CSs for long-term trends according to the dimensions and guidelines of our framework for flexibility, and without attempting to verify comparability beforehand, we treated the four cases on an equal basis. However, in our analysis, we did not experience a fundamental difference between the cases that would prohibit us to treat the four cases uniformly. Naturally, the cases are not equal and the evolutions differ. But we regard the differences as different manifestations of the fundamental concept of CS flexibility, a concept that is identical in all cases.

Long-term Trends in Evolution

173

7.3 THE AGING CS 7.3.1 CS size increases Size, the number of entities in a CS, depends on the extent of what is considered to be the relevant real world, and the perceived differences between objects in that real world. In theory, CS size is important for overall understanding, but it bears no relation to CS flexibility or CS evolution over time. Therefore, size has no place in the framework for flexibility and chapter 4 did not mention CS size as a relevant metric. In practice however, we find that CS size is a relevant feature of CS evolution. 10

Benefit:

Settlement:

Sites:

Franchiser:

5

2000

1997

1994

1993

1991

1998

1996

1993

1989

1986

1999

1997

1994

2000 Aug

Feb

Sept 1999

1999

Jul 1997

1996

Oct

0

-5

Figure 60. change of CS size per version positive numbers represent increase in size negative numbers represent decreasing size Figure 60 depicts change in the number of entities. The CS is seen to grow larger in almost every CS version. The few exceptions of decreasing size correspond to elimination of derived data entities, to be discussed in the next section. Apparently, CS evolution is almost synonymous with CS growth. Notice how this finding on CS versions is similar to the extensibility hypothesis that we formulated for semantic changes.

7.3.2 The number of semantic changes declines Older CSs are often suspected to be less amenable to change. Figure 61 investigates the assumption. CS age is shown horizontally, expressed as percentage of total CS life observed in the case studies. The number of semantic changes is shown vertically. The figure demonstrates that a CS accommodates less semantic changes with increasing age. Moreover, it shows that the nature of the change driver, i.e., justified or unjustified change,

174

Exploring Conceptual Schema Evolution

bears no relation to this effect. This is shown by comparing the two trend lines in figure 61. The solid line depicts the trend for justified changes, the dashed line is the trend for unjustified changes. Both trend lines curve downwards in a similar way. 8

(Un-) justified changes

6

4

2

0 0%

20%

40%

CS age

60%

80%

100%

Figure 61. CS age versus number of (un)justified changes dots and solid trend line depict justified changes triangles and dashed trend line depict unjustified changes trend lines calculated by second-order approximation

7.3.3 Level of abstraction decreases Level of abstraction is an important consideration in data modeling, and theory advocates generalization as a design and maintenance best practice. No metric was developed for this guideline. Nevertheless, a clear tendency towards a concrete rather than abstracted way of modeling is evident in the case studies. Some examples are: - the BENEFIT entity in the Benefit case could have been extended to capture E.R.-BENEFITLEVELS-1, -2, and -3 of changes (A), (B) and (C), but it was not, - again in the Benefit case, change (F) inserts two entities labeled ...-(NEW). Here again, we believe that the change should have been accommodated by using entity extensions, rather than by appending new entities, - in the Sites case, the concept of UNIT was replaced by concrete entities through changes (B), (I), and (J). Change (H) in the Franchiser case is the only example where a deliberate effort is made to counter the tendency towards less abstraction. This change unifies semantically similar RATE and TARIFF specialization entities into a single generalized entity.

Long-term Trends in Evolution

175

Although change (I) in the Benefit case also amounts to generalization, this is in response to a broadened concept in the UoD. It is not a deliberate effort of maintenance engineers to use a more abstract way of modeling. A high level of generalization in the CS is supposed to increase CS flexibility. However, there is a price to pay, as simplicity and ease of data access are lowered. A concrete design is simpler as it captures the everyday concepts by familiar and understandable entities, where an abstract CS has to use vague, highly generalized entities. Data access and data maintenance in an abstracted CS become more complex as the correct specialization and entity instance must be selected among the many in the generalized entity that are irrelevant. The demand for compatibility adds a strong drive towards concrete ways of modeling, by demanding that running software applications will remain fully functional without being adapted or tested. Thus, shortterm advantages of a concrete CS outweigh the potential disadvantage of reduced flexibility in the long term.

7.3.4 Aging affects the adaptability dimension Three combined effects are seen: size increase, declining number of semantic changes, and decreasing level of abstraction. However, these effects cannot simply be interpreted as a decline of CS quality. Other effects of CS aging must also be taken into account: - the ratio of justified and unjustified changes appears to be unaffected by CS age. Thus, if unjustified change is a token of bad quality, then CS quality may be doubtful to begin with, but there is no evidence of further degradation of quality over the CS life time, and - size of semantic change, another metric defined for the environment dimension of CS flexibility, shows no clear relationship between CS age and CS quality (see figure 49). Therefore, we are inclined to believe that it is not the environment dimension of flexibility that lies at the root of the aging effect in the evolving CS. The timeliness dimension is also not involved. We defined two metrics for this dimension, and in the CS evolutions, neither shows a specific trend towards higher or lower levels of compatibility or extensibility (see figure 52). Thus, we hypothesize that the adaptability dimension of flexibility lies at the root of the aging effect in CS evolution. In other words, the aging CS grows less adaptable over time. The fact that lattice complexity initially increases, but is curbed later (see figure 55), corroborates our hypothesis. Resistance to change increases slowly as more changes are accommodated into the CS, and subsequent CS changes become ever harder to accommodate. The CS is and remains a valid model of the UoD, but slowly, its adaptability decreases. Or, rigidity of the CS increases.

7.3.5 Discussion Several researchers engaged in reverse-engineering research have remarked upon the effect of aging and increasing rigidity [Winans, Davis'91], [Chiang, Barron, Storey'94], but it has not been pursued as a subject for research. In the area of software management, [Lehman, Belady'80] and [Burd, Bradley, Davey'00] discuss case studies that display similar trends of

176

Exploring Conceptual Schema Evolution

deteriorating software structure. Our exploratory research is first to firmly demonstrate the effects of aging in evolving CSs, in an objective and verifiable way. However, we can only speculate why exactly rigidity increases, as our exploratory case study approach is unsuited to establish and prove causal relationships. Various speculations can be made as to the cause of rigidity: - lack of knowledge and deep understanding of the CS, because the original designers and their intimate knowledge of the UoD and the CS are no longer available, - lack of experience in the maintenance teams, inadequate tools, or insufficient time and budgets to conduct maintenance in the best possible way. A similar suggesting is made by [Jones'94] regarding software complexity: 'surprisingly, much of the observed complexity appears to be technically unnecessary (and) excessive schedule pressure and hasty design tend to be a common root cause' (p.100), - an increasing number of software components has come to depend on the CS that is perceived as a reliable and stable component; resulting in compatibility demands that become ever more restrictive. Thus, the very success of the CS, its longevity and its quality, are an indirect cause of rigidity.

Long-term Trends in Evolution

177

7.4 DERIVED DATA DECREASE STABILITY OF THE CS 7.4.1 Introduction This section develops a taxonomy of derived data using three characterizations. To our knowledge, this taxonomy of derived data in a CS has not been reported before. The characterizations are illustrated by examples of derived data in the four case studies. We propose a best practice in CS design how to deal with derived data in the operational business environment.

7.4.2 A taxonomy of derived data in the Conceptual Schema A data item z in the CS is called derived data if some non-trivial function ƒ exists, and some base data {a, b, c, … } in the CS such that the value of z can be calculated: z = ƒ (a, b, c, ... ) Non-trivial here means to exclude cyclic dependencies: the set of base data {a, b, c, ... } may neither contain z itself, nor be derived from z in any way. Derived data is not the same as functional dependent data [Date'00]. A functionally dependent attribute can have only a single value for each combination of the data it depends on, but that single value can still vary to some extend. In contrast, a derived data value is inflexible, being deterministically dependent upon the derivation function and base data. Three characterizations (figure 62) should be considered in any analysis of derived data in the CS: the construct, business process, and temporal characterizations. We do not claim that these characterizations suffice to understand all problems regarding derived data, but it does provide a good basis for common understanding.

Taxonomy of derived data

ch characterization

construct construct derived value derived existence

business business process process from external data from data within schema

temporal temporal is redundant survives precedes is temporarily specified Figure 62. taxonomy of derived data in the CS

178

Exploring Conceptual Schema Evolution

1. Construct characterization of derived data First, notice how the intuitive definition z = ƒ (a, b, c, ...) actually holds for a single attribute value at a time. Thus, values of attributes may be derived. However, data model theories provide other constructs besides attributes, and those may be subject to derivation too. For one, existence of entity instances may be derived; a further complication may be that existence of instances in one subset is derived in one way but existence of instances in another specialization is inferred by another derivation. References may also be derived, both in value and in existence. The latter situation is illustrated by compulsory reference constraints forcing one reference instantiation to exist for every instantiation of the member entity illustrate the latter situation. That reference values may be derivable is illustrated by connection constraints, e.g., the connection constraint may require that the same owner entity instance is reached from a member instance regardless of which reference path is followed to make the connection. 2. Business process characterization of derived data Some of the base data in the formula z = ƒ (a, b, c, ... ) may be derived themselves. This is the root idea behind the concept of information chains [Redman'96] that propagate business data from a source to some final destination in a sequence of steps. Long information chains, i.e. a large number of derivation steps are generally considered to be undesirable. Shortening the chains will show up in the CS evolution, as the derived data in the CS will gradually be replaced by base data. This phenomenon is illustrated in the Settlement case. Two different situations must be distinguished in this. First, data in the schema may be derived from data external to the current CS. This results in a need to extend the CS in order to capture the source data that was formerly beyond the scope. Second, data in the schema may be derived from other data that is already within the scope of the current CS. This results in a need to improve the CS in order to eliminate the derived data. Thus, a CS can be expected to change less, and be more stable, if for all data in the CS the information chains are kept short to begin with. Pursuing this idea further, the starting point for information chains must be determined, the challenge being to pinpoint the data that cannot be derived. Evidently, values for those data can only be observed from real-world phenomena, and such data may be called enterprise source data. Of course, data about those real-world phenomena may still contain some inherent dependencies reflecting laws of nature; e.g., partner relationships are symmetric by law. It is not a static property to be, or not to be enterprise source data: boundaries between enterprises and their environment can and will shift, enterprises merge and diversify, new ways of doing business will be introduced, etc. It is a misconception to think that to be source-data or derived-data is a fixed and unchangeable property of the data. 3. Temporal characterization of derived data The intuitive derivation formula z = ƒ (a, b, c, ... ) does not include a notion of time [Roddick, Patrick'92]. Regrettably, 'it is often not clear whether momentary or permanent dependency is meant' [de Brock'95] (p.46). Indeed, it appears to be a common misconception that data that was derived at one time, is always redundant [Batra, Zanakis'94] (p.235). A derivation may

Long-term Trends in Evolution

179

hold at one moment in time while being inapplicable at other times, because derivability is frustrated if either some base data or the derivation formula is unavailable. Complications may arise because of mismatches in the time intervals that base data and derived data are available in the data store. In addition, the derivation formula itself may change over time, frustrating derivability even further. Let us call retention-time the time interval that a derived data value z is available in the data store; we disregard the complication that in general, time periods can be made up of multiple time-intervals. Let denote the time interval that all base data as well as the associated derivation formula are available, i.e. the intersection of the retention-time intervals of the separate base data and the software applications that implement the derivation. Applicability of the derivation formula then depends on two time-relations: between Tfrom(z) and Tfrom(ƒ), and between Tuntil(z) and Tuntil(ƒ). Figure 63 depicts the four possible combinations of retention-time intervals:

Tfrom(z)

Tfrom(z)

Tuntil (z)

Tfrom(ƒ) Tuntil (ƒ) base-data retention time

Tuntil (z) time axis

is redundant survives precedes is temporarily specified h7 b retention-time d t d dintervals i d Figure 63. base-data and derived-data -

'redundant' is the first combination of time intervals, and only this combination is usually accounted for in the literature. The data can be completely reproduced from available base data, but the derived data is modeled in the CS for reasons of (semantic) clarity, completeness, or (technical) performance. Only when base data and derived data coexist can the derived data safely be eliminated from the CS and replaced by a conceptual query [de Brock'95] (p.135),

180 -

-

-

Exploring Conceptual Schema Evolution 'survives' occurs frequently in operational systems. Data warehouses are an example where base data is summarized periodically and the outcome is stored while the base data is deleted. Deletion of the base data affects not the semantics of the summarized data, but its status: the derived data becomes base data in its own right, 'precedes' is when derived data are known in advance. A useful application of derived data to precede its base data is in data transfer. Summary data is transmitted first and then used to check correctness, completeness and consistency of the transferred data, 'temporarily specified' concerns base data that only exists during part of the lifetime of the derived data. This situation arises when a sound or at least a credible basis is needed to temporarily explain a complex data item. For instance, calculation of a pension benefit depends on the marital status of the insured-party, and the detailed information about the spouse is only temporarily needed to ensure accuracy of the marital-status data.

7.4.3 Derived data in the case studies Some examples of derivability in our case studies for these three characterizations are: 1. Derived data and the construct characteristic In the Settlement case, the existence of the aggregate entities SUM-OF-REDUCTIONS and SUMOF-DIVIDED-BENEFITS is derived from the DIVIDED-BENEFIT and BENEFIT-DEDUCTION entities once these are accommodated into the CS by changes (C), (D) and (E). Change (F) in the Settlement case inserts a derived reference. It is understood to be a reference between SEPARATED-BENEFIT and BENEFIT, but it is modeled as an optional reflexive reference on BENEFIT. Also, notice how some instances of a construct can be enterprise source data, while other instances are derived data. This is called hybrid [Olivé'01] (p.421), and the Benefit case study shows a fine example in the PARTICIPATION-TRAIL entity. Changes (P) and (Q) remove derived-data instances, and what remains is the source-data entity named PARTICIPATIONTRAIL-AS-OF-1-1-'96. Likewise, change (2) in the Franchiser case inserts the MARGIN entity, turning the BOUNDS entity into a hybrid. 2. Derived data and the business process characteristic In the Settlement case, the entities DIVIDED-BENEFIT and BENEFIT-DEDUCTION are known to be derived from other information that was not captured in the CS. Thus, the level of derivability in the schema has not been maximally reduced, and further improvements are conceivable up to the point where all data in the CS is considered irreducible. In the Franchiser case, change (G) makes transactions details available in the CS, but the corresponding TRANSACTION-SUMMARY entity is not eliminated. In the Benefit case, the BENEFIT entity is derived by definition. From an enterprise point of view, pension benefits are always calculated from external source data such as employment, family relationships, pension scheme etc. It is evident in the case studies that the theoretical maxim that a CS must be free of all derived data cannot always be operationalized.

Long-term Trends in Evolution

181

3. Derived data and the temporal characteristic The Settlement case illustrates a temporal 'precede' situation in references DIVIDED-BENEFIT ↑SEPARATION and SEPARATED-BENEFIT↑BENEFIT. These references are indispensable to temporarily record intermediate derivation results while the business process that should insert the corresponding base data has not yet been successfully completed. The designer should have justified the presence of such transitory data in the CS, but we feel that it was avoidable. A combination of the 'redundant' and 'survives' situations is seen in the PARTICIPATION-TRAIL entity of the Benefit case. Most instances of its SUCCESSOR specialization are derived from related PARTICIPATION-TRAIL instances, except the SUCCESSOR instances timestamped at 1-11996. The base data for this specialization are missing, which is recognized in change (Q) that promotes this specialization to an independent entity. Another example is the CLAIM entity in the Franchiser case. An example of the 'temporary specification' situation is seen when change (E) in the Benefit case inserts the BENEFIT-DEDUCTION entity intended to record specifications of SEPARATEDBENEFITs. However, users regard the specification as optional and it is not always entered in the database.

7.4.4 When elimination of derived data fails The above examples illustrate the importance of the temporal characterization, as data that was derived at one time may loose its derivability later, because source data is eliminated, the derivation formula is lost, or both. However, even if all temporal aspects are properly accounted for and a particular data item is fully redundant in the CS, it is not always eliminated instantaneously. Despite the theoretical best practice that advocates elimination, there are practical reasons and limitations why elimination fails, such as: - legal obligations. BENEFIT is a derived entity in the Benefit case, but it resists elimination because Dutch law requires that all pension benefits be physically recorded, - upward compatibility when a CS is changed. It prevents the need to tamper with legacy software accessing the derived-data entity, and enables a prolonged migration of source data into the changed CS, - situational factors, when the cost and effort to eliminate derived data from a CS can exceed the expected return on investment, or - organizational factors, when ownership of enterprise source data lies with another department and the derivation process transgresses organizational boundaries [Goodhue, Wybo, Kirch'92]

7.4.5 Recommended practice to cope with derived data It is evident from the taxonomy that complete coverage of all aspects of the business process as well as the temporal characteristics is necessary to ensure that data that is derived at one time remains permanently redundant. Obviously, only fully redundant data can be eliminated from the CS without loss of information. Elimination of redundant data involves changing the CS while the UoD remains unaltered: definitions remain the same, and derivation formulas do not change. The only reason for change in the CS is to improve the handling of derived data. In doing this, the CS must evolve

182

Exploring Conceptual Schema Evolution

in such a way that information chains are followed upstream, and derived data are replaced by their base data. By tracing information chains back to their ultimate beginnings, we find that we want to have the enterprise source data in our CS. At the same time, base data retentiontimes must be extended so that the 'precedes' and 'survives' situations will no longer arise. Therefore, we recommend avoiding derived data from the outset by designing each CS as part of the global CS. In other words, we advocate designing each CS as an excerpt of the global CS by way of specializations. The global CS in its turn must be defined in such a way that the life cycles of all relevant real-world objects are entirely covered by the (extended life cycles of) enterprise source entities and specializations [Jones'82]. This broad scope for the global CS will prevent future problems with derived data in local CSs, as all the required enterprise source data will be available at the level of the global CS. The required derivation formulas should also be provided for, but we regard this more as an issue of application software engineering than conceptual data modeling. The advice assumes that a global CS is available for this kind of use, which is not often the case. Even so, the CS design should be prepared with the above ideas in mind. The information chain shows how the data is processed, and how the information system currently obtains and processes data. Figure 64 illustrates how the demarcations of information processing chains and the CS boundary can be improved without any material change in data semantics. The UoD and the scope of the information system should be demarcated in such a way as to prevent preprocessing and other gaps in the information chain: - preprocessing takes place on external data before it is entered into the CS. This causes data definitions in the CS to deviate from the enterprise source data. The Franchiser case shows a clear example where the UoD boundary shifts back and forth. TRANSACTION data are beyond the scope of the initial UoD but advance in technology brings that data within scope. The UoD boundary shifts back again due to a management decision in a later version. - data is extracted from the information system, processed externally, and the results are inputted again. The derivation process is performed beyond the scope of the CS, causing uncontrollable constraints in the data. The Settlement case shows a clear example. SETTLEMENT data are recorded in detail, but only SUM-OF-DIVIDED-BENEFITS derived-data instances are recorded at first. The corresponding source data entities are lacking in the 1994 CS version, to be appended in later CS versions. To follow these recommendations, a rigorous process of data analysis is called for that involves at least the following activities: - analyze for each construct in the CS how its data is obtained: either from outside the enterprise (this is fine), or by derivation from other data already available (this should be remedied). This step addresses the construct characterization. - if derived data has been located, determine its base data and derivation formula. Iterate this step for all base data, following the information chains upstream and tracing the data back to their origin, i.e. to the enterprise source data [Moody'96]. This step determines the most natural and stable UoD boundary, and reduces the length of information chains, which is relevant in the business-process characterization.

Long-term Trends in Evolution -

-

183

correlate the designated enterprise source data with the global CS and identify the entity of which each data item is a special case. If a mismatch is discovered, e.g. in data granularity, then the CSs should be adjusted to ensure semantic homogeneity. finally, resolve temporal mismatches by extending the lifetimes of source data instances in the global CS so that the derived data becomes redundant at all times.

information chain

UoD scope too small data entry

preprocessing deri- externalvation

processing

maximal UoD scope : enterprise boundary Figure 64. cover the full information chain Following the recommended practice and data analysis approach has many advantages for CS design and maintenance: - CS stability and data independence are increased. The CS depends less on current derivation formulas and implementation choices in business process information chains, - changes in derivation formulas do not affect the CS other than that new enterprise source data may be added to the CS, - scope-setting for the UoD and information system boundaries is done independently from deciding how to implement the information chains, and - the quality of the global CS is enhanced and the potential for shared use of enterprise source data is enhanced.

7.4.6 Resistance to change Important business considerations may counteract the advocated best practice to avoid derived data in the operational CS. To name some: - data ownership and organizational responsibilities [Alstyne, Brynlofsson, Madnick'95], - a global CS that is either unavailable or too abstract for use,

184 -

-

Exploring Conceptual Schema Evolution cost of data acquisition. If enterprise source data are fragmented, have low data quality, or are otherwise unavailable for transaction processing, then the business manager may stick to some derived data rather than spend resources to improve the quality of source data. lack of source data. However, to deal with the lack of data, we feel that provisions ought to be made at the global level, not locally. This will prevent later misunderstandings between departments that have compensated for the lack of data in different ways.

A further disadvantage of the advocated best practice is that the time for initial data analysis will substantially increase. This is generally the case if there is little insight in the overall structure of the data resources in the enterprise, e.g. if many legacy systems are in use. In such cases, the advocated design practice will cause part of the global CS to be analyzed and documented, a huge task that can vastly exceed time and budget limitations of the assignment.

7.4.7 Discussion Many authors contend that derived data in the CS are unacceptable for various theoretical reasons. The reasoning may be legitimate, but it overlooks the important difference between data items that are permanently redundant, and the data items that are obtained by derivation, but the derivation cannot be reproduced later. Our case studies show that the long-term stability of CSs is deteriorated by derived data. Surely, if CSs could be designed free of derived data, the threat of decreased stability would not exist. In particular, temporal database theories seem promising to alleviate problems of derived data in the CS [Jensen, Snodgrass'96]. However, the cases also show some important practical limitations in eliminating derived data. We must accept that derived data is a fact of life in business practice that engineers must cope with. From our involvement in the case studies, we learned that the status of an entity or attribute to be or not to be derived, and to be or not to be dispensable in the CSs can be heavily debated. However, we believe that the time and effort spent in the initial analysis will be more than compensated for when subsequent improvements are necessary. The trade-off is therefore to do rigorous data analysis at design time when knowledge is still fresh in mind, or to postpone it, and be forced to do it later in maintenance when that knowledge has eroded and when the impact of change is greatly increased.

Long-term Trends in Evolution

185

7.5 BEST-PRACTICES FOR CS EVOLUTION 7.5.1 Expanding the current state of the art This section advocates best practices for the business engineer engaged in maintenance of an operational CS. When faced with structurally new requirements, engineers often start by creating a new Internal Schema that meets the new requirements. Regrettably, the issue of converting the available store of data into the new schema is not often addressed at this time. Even less attention is paid to formulating the change demands on the CS level and deciding on the best possible way to accommodate the change into the CS in order to ensure a graceful CS evolution. To provide some assistance in the task of accommodating a semantic change into the CS, we present a list of best practices founded on our experiences from the case studies. The list is intended to add to the current body of knowledge, not to replace it. Our aim is to enable the maintenance engineer to achieve a higher level of flexibility in, and a more graceful evolution of, the evolving CS. As we want to provide the engineer with clear instructions on how to enhance flexibility in a CS, the best practices are phrased as simple do's and don'ts.

7.5.2 A dozen best practices 1. Provide maximum compatibility - Change the CS by evolving it, rather than by designing it from scratch. Evolution makes for less design errors and superior compatibility with the previous CS. - Aim for compatibility from the start, do not address data conversion as an afterthought. - Use a methodical approach to ease schema evolution [Wedemeijer'00may]. - Select the most appropriate semantic change pattern that meets the user requirements, and deploy it in the CS. - Provide for compatibility by means of automatic data conversion. If manual conversion cannot be avoided, limit the scope of incompatibility to a single entity. Reduce the trauma of change by providing some practical work-around (see chapter 6). - Finally, do not let compatibility that has precedence in the shortterm prevail over the importance of correct modeling in the long term. 2. Model entities at a high level of abstraction - Model comprehensive generalizations, and aim to capture the essence of data. A more abstract CS has less constructs and constructions, so less can change and it is simpler and easier to understand in its entirety. - Capture the entire life cycle of the real-world objects, assuming that entity instances will have eternal life. Consider restrictions on instance life cycles as implementation issue. - Specify the current status of an entity through specializations and conceptual references. - Model new constructs at an equally high level of abstraction as the existing ones. 3. Extend entity intent rather than append a new entity - Attempt to use extensions of entity intents first, thus utilizing the potential for change that the available constructions in the CS have to offer.

186 -

-

Exploring Conceptual Schema Evolution Do not confuse extensibility with flexibility. Prevent unnecessary growth of the CS. Insert new constructs only when modeling for a new way of doing business. Do not model a specialization of an existing generalization by appending an entity to the CS. It may be easier to do so, but it diminishes the level of abstraction and increases future resistance to change. Only consider modeling a specialization as an independent entity if it is both dominant and permanent. This may prevent a later Promote-a-Specialization pattern. When in doubt, include the entity history in the CS. If an entity is time-aware, then its member entities are probably time-aware also.

4. Safeguard the level of abstraction in the CS - Work to counter the tendency towards more concrete ways of modeling. A lower level of abstraction results in a CS that is larger, more fragmented, less understandable, and harder to maintain - Be consistent in the use of keys. In specializations, use the primary key that is implemented for the entire generalization. - Regard the information system as an indivisible unit that engages in data exchange with its environment. - Do not model the components of the information system or user organization in the CS. The separate components are often immaterial to the information structure of the UoD. Modeling these components will drive unjustified change in the CS. 5. Include all objects in the UoD that may act as owner entity in the CS - Consider all real-world objects in the current UoD and analyze which other objects may act as a determinant, classifier, hierarchically superior, etc. - Ascertain the best and most natural demarcation for the UoD. - Prevent a later need to superimpose an entity or to move attributes out into a superimposed entity. 6. Model enterprise source data, not the derived data or the information chain - Determine the appropriate enterprise source data and capture it in the CS. - Do not foster entities or references in the CS to model derived data or support derivation. - Work towards eliminating derived data by reducing the information chains to their minimum length, preferably to zero length. - Analyze information chains by considering the construct, business process, and temporal characterizations of derived data. - Model the enterprise source data in the CS but not the derivation process. - For each existing derived-data construct, establish under what conditions it may be superseded by its corresponding source data. Then, work towards fulfilling those conditions. 7. Minimize multiple connections; then capture the connection constraint - Capture and document the semantics of each connection constraint extensively, whenever an entity connects with other entities by multiple connection paths (transitive closure). - Beware of implicit redundancy: can each construct instantiation change independently from all other instances ? - Capture and document the semantics of each connection constraint.

Long-term Trends in Evolution -

187

Analyze whether some connection is redundant and eliminate it to prevent data retrieval errors, the so-called 'connection traps' [Date'00].

8. Do not model 1:1 cardinality for references - Restructure an 1:1 reference either as an is-a reference, in particular if the reference is optional, or as N:1 reference that is currently constrained by some business rule. - Look for a generalization if a real-world object is seen to migrate, i.e., it is represented first as instance in one entity, and then as instance of another entity. Include the missing generalization in the CS. - If a reference has N:1 cardinality that is currently constrained as 1:1, model it as N:1 in the CS. Have the cardinality constraint implemented as a software routine. This will minimize impact of change in case of a future Relax-a-Reference pattern - When in doubt, direct a reference to the lowest owner in the CS hierarchy. The Redirecta-Reference-to-an-Owner pattern is a compatible change but to redirect a reference to a member entity is an incompatible change, as every link has to be reestablished. - Do not eliminate an entity if it has both member and owner entities. The entity itself may be irrelevant, but it provides a connection that may still be vital in the CS structure. 9. Avoid reflexive references - Attempt to use a PERIOD-OF or RELATIONSHIP-BETWEEN entity to capture the semantics of the reflexive relation in the real world. A reflexive reference is not necessarily wrong, but it is more likely to change than a regular reference. - Examine core entities for the existence of a reflexive relationship, as a relation between the modeled real-world objects is often known and noted by the users even if the CS design does not provide for it. 10. Keep change in lattice complexity small - Strive for semantic change in the CS to alter lattice complexity by at most 1. - Have complex user requirements decomposed into separate semantic change requests. This makes the change process more realistic and more manageable. The structural changes will be easier to accommodate into the CS lattice. - If lattice complexity increases, define the semantics of the new connection constraint and account for its impact of change on existing data. - If lattice complexity decreases, an existing constraint disappears. Beware that this may permit the recording of previously excluded instances. 11. Avoid temporary solutions and change successions - Do not accept a temporary solution in the CS. It calls for subsequent permanent solution, causing subsequent unjustified change by way of change successions (see chapter 6). - Do not let restrictive demands, for compatibility or otherwise, prevail over correct modeling in the long term. Temporary solutions reduce CS understandability. - Permit redundancy in the CS only if properly controlled by business rules and constraints. Prevent the 'survives', 'precedes', and 'temporarily specified' situations from arising.

188

Exploring Conceptual Schema Evolution

12. Take care of documentation - Depict the CS diagram according the CS hierarchy. Arrange the entities first in one direction, according to the CS hierarchy, then in the other direction so that there are as few as possible line crossings. We prefer a top-down presentation [Möller, Wiese'96], although others prefer left-to-right [Hay'95] or bottom-up arrangements [Ter Bekke'93]. - Beware of unnoticed changes in entity intents, in particular if constraints are dropped. A change of entity intent turns its name into a homonym, as it stands for different concepts. - Consider renaming a changed entity to avoid misunderstandings. An example of confusing names is in change (Q) in the Benefit case. - Document the conceptual constraints extensively. If the set of constraints is not adequately documented, it cannot be checked for completeness and consistency, which may threaten quality of the static CS, and adaptability in the long term.

7.5.3 Balancing the dimensions Our framework for flexibility is made up of three dimensions: environment, timeliness, and adaptability. At first glance, the dimensions appear to be unrelated, which means that efforts to achieve a flexible CS must address each of the dimensions: - the environment dimension, to produce a best possible CS - the timeliness dimension, to produce a CS that can be changed as fast as possible - the adaptability dimension, to produce a CS that requires the least effort to be changed. On closer inspection, however, we suspect that the dimensions are interrelated. In the course of the Sites and the Franchise case studies, we saw the enterprise modernize its project management methods and its database technology in the course of the case studies. Consequently, the ease of propagating CS changes to other information system components was significantly improved. At the same time, we saw the levels of abstraction in the CS decrease in these cases. The lower levels of abstraction make for larger, but simpler CSs that are easier to understand for maintenance engineers. In the Benefit case, a gradual rise in the turbulence of the external environment was experienced. At the same time, the size and complexity of the CS was significantly reduced by a deliberate change in modeling guideline resulting in an improved level of CS adaptability. Apparently, a reduction of CS flexibility in one dimension is compensated for by improvements in other dimensions. The other way around, enhancements in one dimension imply that the engineers can be lax on another. Business practitioners are pressed to create quick and correct solutions, but they are less compelled to follow theoretical guidelines if there is no immediate return on investment in adherence. Of course, this means risking that some potential long-term benefit may remain beyond reach, but this is overridden in the short term. A typical remark underpinning our observation is made by [Aiken, Yoon, Leong-Hong'99] who write that to 'ensure high levels of quality in DoD data, the department needs to adopt a long-term perspective to data engineering and not allow pressure for short-term results to undermine the value of the effort' (p.166). Thus, we speculate that the goal of CS maintenance is not to maximize all three dimensions of flexibility as a potential for future change. Rather, there is a trade-off between the dimensions, and the engineer must strike a balance between the three dimensions.

Long-term Trends in Evolution

189

7.5.4 Discussion Modeling guidelines are general directives on how the CS should be modeled and maintained to ensure its consistency and quality over time [Rosenthal, Reiner'94]. Listings of advocated best practices intended to improve and safeguard the quality of CS designs can be found in for instance [Batini, Ceri, Navathe'92], [Wiederhold'95], [Moody, Shanks'98], [Schuette, Rotthowe'98], [Thalheim'00]. Nevertheless, there exists no comprehensive list of accepted best practices on how to evolve a CS when a change is commanded. Moreover, it must be realized that methods to support the design, build, test and implement novel CS designs need not be appropriate to support maintenance of a CS in its operational life cycle phase. Evidently, modeling guidelines are not part of the CS. Even so, the CS itself, and its evolution will be affected by the guidelines, and even more so by changes in the guidelines. Examples of the detrimental effect of changing guidelines on CS evolution is seen in the Benefit case, when guidelines are explicitly changed to eliminate derived data and curb the rise in CS complexity. An implicit change in modeling guidelines is seen in the Sites case, where the high level of abstraction is gradually replaced by a concrete way of modeling. Maintenance engineers are expected to follow the advocated guidelines that may be either implicit or explicit, and that have a more or less formal status in the business environment. [Premerlani, Blaha'94] contend that 'database designers, even the experts, occasionally violate rules of good database design and often employ unusual constructs' (p.49). We however are convinced that current modeling guidelines are an insufficient means to achieve the best quality. The current state of CS design and maintenance as an engineering discipline has much in common with the creative arts, and good craftsmanship is to be preferred above strict, but unimaginative adherence to whatever set of rules [Moody'95].

190

Exploring Conceptual Schema Evolution

7.6 CONCLUSIONS 7.6.1 Summary This chapter analyzed long-term trends in the four evolving CSs in our case studies by using the operational metrics developed in chapter 4. Important findings are: - almost half of the changes are unjustified; and of the justified changes, most are driven by internal causes, - most, but not all semantic changes are limited in size, affecting 6 constructs or less, - with few exceptions, all new CS versions are compatible with the previous CS version, - a small majority of semantic changes are extensions, i.e., add new constructs or extend existing ones. Just under half of the changes modify the CS in other ways, - most changes alter the complexity of the CS lattice by at most 1, - of every 3 semantic changes, one has no impact on pre-existing entities, one affects only a single entity, and one impacts several entities at once, - type preservation is predominant in CS changes, but not exclusively so, - references in the CS appear to be more stable than the entities themselves, although the less common reflexive references and references with 1:1 cardinality are less stable, - the rule is no change in candidate keys; the exception is change of composition of keys. We investigated the effect of CS aging, and we concluded that the main effect of age is in the adaptability dimension of flexibility: the CS becomes less adaptable over time, or, its rigidity increases. We reached this conclusion by combining trends in CS evolution: increasing CS size, the declining number of semantic changes being accommodated, the decreasing level of abstraction in the CS, and the initial increase in lattice complexity that is curbed later. Next, we analyzed how the notion of derived data evolves in the CSs. We presented a taxonomy for derived data that, to the best of our knowledge, has not been described before. As a best practice, we recommend to design and maintain every CS as part of the global CS. The experiences, trends and tendencies in the four evolving CSs are translated into a dozen operational practices that maintenance engineers can readily apply. The advocated practices are founded on our experiences of evolving CSs, rather than on theoretical opinions.

7.6.2 Concluding the Practical Track Chapter 1 stated the second objective of our research as: - to explore long-term evolution of CSs in the ongoing business, in order to disclose operational practices and experiences and outline the implications for flexibility of CS designs This research objective rephrases the two main issues that have motivated our research. These issues are, first, to observe and understand the natural evolution of real CSs, and then, to assess the level of CS flexibility in view of that natural evolution. The Practical Track answers these issues. Long-term evolution of CSs in the ongoing business is explored in chapter 5 by way of four case studies. Chapter 6 extracted some of the patterns and operational practices characteristic of short-term CS evolution. Chapter 7 analyzed long-term trends in CS evolution by means of the eight operationalized metrics developed in chapter 4. Additionally, we drew on our newly gained insights in long-term evolution of the CS to present best-practice recommendations to achieve a graceful CS evolution.

191

SYNTHESIS

192

Exploring Conceptual Schema Evolution

Synthesis

193

CHAPTER 8. SYNTHESIS '(..) it occurred to me, in 1837, that something might perhaps be made out on this question (i.e., of the origin of species) by patiently accumulating and reflecting on all sorts of facts which could possibly have any bearing on it. After five years' work I allowed myself to speculate on the subject' From 'the Origin of Species', [Darwin1859], Introduction

8.1 INTRODUCTION 8.1.1 Introduction This chapter presents a synthesis of the theoretic concepts of CS flexibility discussed in the Theoretic Track, and the experiences of long-term CS evolution that we gained in the four exploratory case studies described in the Practical Track. We draw conclusions from our exploration of Long-term Evolution in CSs, and discuss validity and usability of our results. Maybe you read all previous chapters in a chronological order. Or perhaps you read the chapters of the Theoretical Track, but skipped the Practical Track. Or you ignored the theory, but explored the exploratory results instead. If however you started reading from here, without even reading the introduction, you had better turn back to chapter 1, as we will not be repeating the introduction.

8.1.2 Chapter outline The Practical Track produced numerous results and insights in CS evolution. Section 2 discusses the important topic of validity: why do we suppose that similar longitudinal researches will result in similar results. Section 3 reviews the framework for flexibility that we developed in chapter 3. This framework for flexibility is a valuable contribution to the state of the art in information modeling, and we argue that adequate proof of concept has been delivered. Section 4 revisits the design strategies discussed in chapter 3 by inspecting the evidence on CS evolution as analyzed in chapters 6 and 7. Section 5 conjectures five Laws of Conceptual Schema Evolution that comprise our experiences in long-term evolution. Section 6 concludes the chapter and indicates directions for further research.

194

Exploring Conceptual Schema Evolution

8.2 VALIDITY OF THE EXPLORATION RESULTS 8.2.1 Empirical research Our case study approach is an empirical inquiry into the Conceptual Schema evolution within its real-life context. While the notion of Conceptual Schema has a clear and well-accepted definition in theory, its long-term evolution in the business environment has not been well researched. Our research contributes insights into characteristic phenomena of CS evolution to lay the groundwork of formulating hypotheses on CS flexibility. However, we do not presume to deliver statistical proof of some hypothesis on CS flexibility that generalizes to all kinds of operational schemas. In selecting the cases, we aimed for diversity. Representativity, however defined, was not an issue. The four cases that we selected represent ordinary, realistic CSs; the selection criteria do not interact in such a way that only immaculately designed CSs happen to be selected. The CSs in our case studies share a number of characteristics: - the CSs support transaction-based, data-intensive administrative enterprises, - each CS models single business function UoDs only, - all CSs are founded on the relational data model theory and the 3-Schema Architecture, - the CSs have moderate sizes, with between 3 and 30 entities, and - the CSs always ran in a single version at a time. We focus on operational life cycle phase, rarely venturing into other life cycle phases. The operational life span of the CSs in our case studies was never less than 5 years, and the UoDs and business functions being modeled are not remarkably turbulent. Therefore, our results will have a bias towards long-lived, and hence more steady environments. To some extend, the CS evolutions are evidence of the levels of expertise and experience of the maintenance teams. However, as team members come and go, we expect this effect to cancel out. Although we cannot claim statistical significance due to our small sample size, we do believe that our results show realistic trends in CS evolution. We consider it likely that studies of similar CSs evolving in similar enterprises will show similar results. It is important to realize that validity of our results is limited to CSs that share the above characteristics. Further research is required to extend these conclusions to other kinds of evolving CSs.

8.2.2 Threats to validity Several threats to validity can be identified for our case study research: 1. Case verification We collected raw documents 'on the scene', and had to deal with all of their shortcomings. In particular, we encountered several ambiguities when attempting to reverse-engineer CSs from the original Internal Schema documentation. We resolved these interpretation problems by relying on our intimate knowledge of the enterprises. To prevent misinterpretations, we verified the case descriptions by consulting knowledgeable business engineers involved in maintenance of the various CSs. Their comments corroborated

Synthesis

195

the CS evolutions as we describe it. Interestingly, on many occasions we found that more time was spent on explaining the history of the evolving CSs, than in discussing and validating it. 2. Intermediate CS version In collecting documentation, we may have overlooked some intermediate CS version. We took meticulous care to reconstruct the overall sequence of CS versions in its correct order. Therefore, shortterm details of changes may have gone unnoticed, but the oversight does not affect the long-term trends and tendencies in CS evolution. To overlook an intermediate CS version means that wrong measurements for the life spans of CS versions will be reported. This is a threat to the validity of a rate of change metric. However, we did not develop this metric nor study its characteristics. We did refer to version life spans in figure 54, but this did not result in a definite conclusion. 3. Lack of homogeneity It may be claimed that we investigated CSs that are fundamentally different with respect to flexibility or long-term evolution. However, this objection can be made in retrospect only. Naturally, the cases are not the same and the evolutions differ. Nevertheless, the differences do not signify that the enterprises, or the business information systems that we investigated, are incomparable. The notion of CS flexibility, and the goals and intentions of the CS evolutions are alike in all cases. 4. Reconstruction of the CS We conduct our research within the 3-Schema Architecture that demands a clear distinction between Universe of Discourse, Conceptual Schema, and Internal Schema. To ensure a clear distinction of these separate units in the architecture, we employed a single data model theory throughout the cases. This essential theory (see appendix A) enables only to model conceptual constructs. We spent an inordinate amount of time in getting the documentation organized, and in reconstructing and verifying each separate CS version that is free of implementation details. Other researchers engaged in reverse engineering also mention this as a sideline [Andersson'94].

8.2.3 Lessons learned 1. Documentation deficit There is no such thing as a fully documented and immaculately implemented CS. Documentation deficit is a problem that has long been recognized [Fox, Frakes'95]. To reconstruct the CS at the correct conceptual level, we used whatever material was at hand. We were always faced with documentation widely varying in completeness, conceptual level, and clarity. Summarize our experiences in this, we learned that: - no matter how much time you spend on understanding the documentation, there is always one more real-world aspect that you have not yet understood, - many reports that are not meant to document the CS do a good job of clarifying it; many documents that are intended to document the CS make a poor job of it, and the use of a state-of-the-art repository tool is no guarantee for good documentation, and - poor documentation does not signify inferior quality of the CS, nor does good documentation guarantee high quality of the CS.

196

Exploring Conceptual Schema Evolution

2. There is more to the CS than the documentation There is no such thing as a high-quality, flawless CS. We had to grasp at the ideas behind written documentation to understand the evolution of the average CS, and we learned that: - no matter how much effort is spent on making the best possible CS, there are always some constructions that are not up to standard, and are liable to be improved in some future CS version, - no matter how strict the modeling guidelines are, there are always other ways of modeling [Moody'95], - conceptual constraints are never captured in full detail. Paraphrasing [Premerlani, Blaha'94], we suspect that in some cases it is impossible to describe a complete and consistent set of constraints, because it never existed, and - enterprises rely on the few experts who still know and understand the CS and all its intricacies, more than on the formal documentation. 3. There is more to CS evolution than theory contends Overall, we experienced a gap between the theory of data modeling, and the practical applications of modeling techniques. The academic point of view is that practitioners often deviate from good modeling practices and therefore cause quality degradation of the CS. The practitioners' view is that theoretic guidelines are nice for simple, straightforward models, but inadequate to handle real business needs and user requirements. Others have ventured similar opinions, e.g. [Reiner'91], [Halassy'91], [Gerard'94], [Hitchman'95].

Synthesis

197

8.3 REVIEWING THE FRAMEWORK FOR FLEXIBILITY 8.3.1 Introduction This section discusses the implications of the long-term trends in operational CSs, for the dimensions of the framework as introduced in chapter 3. We also draw conclusions that may have consequences for current theories and best practices in information modeling. Chapter 4 elaborated the framework by formulating hypotheses about CS evolution, and developing metrics to assess CS evolution (see figure 65, copied from figure 14). These metrics were applied in the four cases studies of CS evolution outlined in chapter 5. Chapter 7 elicited long-term trends of the metrics. The analysis produced results on some of the hypotheses: either an affirmative or a negative outcome. We could not study all hypotheses in detail, for the reasons as explained in chapter 4.

8.3.2 Overview of outcomes on hypotheses hypothesis environment dimension 1. every change in the CS is justified

outcome

2. every CS change will be proportional in size to the change in UoD information structure that causes it 3. the rate of change in the CS will be proportional to the rate of change in the UoD information structure timeliness dimension 4. the rule is compatible change, the exception is incompatibility at specific places in the schema 5. the rule is schema extension, the exception is modification of existing constructs adaptability dimension 6. a more complex CS will change less frequently

undecided no measure of 'severity of change' no metric no measure of 'turbulence in UoD'

7. a more abstract CS will go through fewer changes 8. 9. 10. 11.

not confirmed

confirmed not confirmed

not confirmed

no metric no measure of 'level of abstraction' some types of construct in the CS are more partially confirmed susceptible to change than other types of construct no data on attribute or constraint the rule is no change in candidate-key compositions, confirmed the exception is change in some candidate key a single UoD change will cause change in only a no metric single CS module for want of sound modularization modules in the CS are stable no metric for want of sound modularization Table 4.

198

Exploring Conceptual Schema Evolution

We were unable to develop a metric for hypothesis 7 due to the absence of a reliable measure for 'level of abstraction' in the CS. However, based on a qualitative discussion, we believe the hypothesis to be not confirmed. Hypothesis number 8, on susceptibility to change, was only investigated for the entity and reference constructs. The attribute and constraint constructs remained beyond the scope of our research, and we have no data on the evolutions of these constructs. In the comparison of the entity and reference constructs, we found that the reference construct is less susceptible to change than the entity construct, which is a partially confirmation of the hypothesis.

dimensions and guidelines

environment environment select the best scope capture essence of UoD

timeliness timeliness

ch8: fig14 operationalized metrics 1. Justified change 2. Size of change

3. Compatibility

minimize impact ease change propagation

adaptability adaptability

4. Extensibility

5. Lattice complexity

keep it simple use abstraction layering model each feature once provide clusters

6. Susceptibility, entity 7. Susceptibility, reference 8. Preservation of identity

Figure 65. Operational metrics associated to the guidelines in the framework for flexibility

8.3.3 Environment dimension Theory contends that the CS should model the information structure of the UoD, the whole UoD, and nothing but the UoD. Unjustified change, i.e., a semantic change in the CS that is not driven by a corresponding change in the UoD, is therefore regarded as a mark of inferior design. Semantic changes that affect many CS constructs at once are also regarded as indicators of bad design. We associated two operational metrics to the guidelines involved in the environment dimension: the metrics for justified change, and for size of change, respectively. The analysis in the Practical Track brings out that almost half of CS changes are unjustified, with unjustified changes occurring in all cases. We also find that the vast majority of CS

Synthesis

199

changes are limited in size. Only some 5% of semantic CS changes affect more than 6 constructs, but these are all justified changes. Are we then forced to conclude that we investigated only inferior CS designs ? We think not: the findings cannot be explained away by allegations of bad design. Rather, we believe it is the underlying hypothesis that falls short of the mark. Apart from the information structure of the UoD, the CS captures other features as well. Those other features are liable to change, and such changes are the main cause of unjustified changes.

8.3.4 Timeliness dimension The timeliness dimension is concerned with the synchronization of change drivers in the environment and the changes accommodated into the CS. The user community rarely demands that CS changes must coincide with changes in the UoD, and some lag time is always allowed for. Thus, synchronization is a relative concept, not an absolute. The timeliness dimension is subdivided into two guidelines. However, the guideline to ease change propagation to other components of information systems is beyond the scope of our research, and we can only discuss our findings regarding the guideline to minimize impact of change. The analysis in the Practical Track brings out that minimizing the impact of change is a predominant consideration in the accommodation of CS changes. Only some 5% of semantic changes require manual adjustments to the stored data, and these adjustments concerned a single entity only. In addition, over 60% of changes extend the CS, either by adding new constructs to the CS lattice or by extending the intent of existing ones. Moreover, we found that CS changes never affected the validity of stored data. Validity of a stored data item depends on how the value is defined and measured in the real world, but it is clearly independent from how it is modeled in the CS and stored in the database. Thus, constructs in the CS are abstractions of the real-world concepts, and not the other way around, when CS constructs determine what users ought to perceive 'out there'. From these observations, we conclude that minimizing the impact of change is a predominant business consideration in the accommodation of CS changes, and maintenance engineers should be well aware of this.

8.3.5 Adaptability dimension Adaptability is the ease with which CS constructions can be changed to accommodate new user requirements. It is concerned with how the requested changes are to be accommodated into the CS. Of the three dimensions, this one is the almost exclusive responsibility of the maintenance engineer. We associated four operational metrics to the guidelines in the adaptability dimension: lattice complexity, susceptibility to change per entity and per reference, and preservation of identity, respectively. We developed the metric of lattice complexity to investigate the effects of complexity on CS evolution. Essentially, the metric counts the excess number of references in the CS lattice over the number of entities. The metric does not capture other aspects that CS complexity may be dependent upon. The Practical Track shows that some 90% of semantic changes have

200

Exploring Conceptual Schema Evolution

little or no effect on lattice complexity. A second important finding is that lattice complexity is not an important determinant of CS version life span, or, a simple CS is not changed more or less often than a more complex CS. By implication, complexity of the semantic change itself is more important for maintenance than lattice complexity of the entire CS. This also reflects the way of working when semantic changes are accommodated into the CS: a change is made locally, to the affected constructs only, and it is indifferent to the size, quality or complexity of the entire CS. Literature on the art of data modeling emphasizes the importance of creating simple and understandable CS designs. For the operational life cycle phase however, we conclude that the emphasis should be on making the semantic changes simple and understandable. A large and detailed artifact such as the CS is easier to understand and to maintain if it is consists of several layers of abstraction. As our research is restricted to only the abstraction layer composed of the entity and reference constructs, we cannot investigate CS flexibility across different abstraction layers. An implicit assumption regarding abstraction layering is type persistence: once a real-world feature is modeled by one type of construct, the same type of construct will model it forever. The Practical Track demonstrated several counterexamples to type persistence. Therefore, to meet realistic expectations, theories on CS evolution should not assume type persistence. Some modeling methods assume that a single construct or construction in the CS captures one relevant feature of the UoD. The corollary is that a change in that single CS construct will accommodate a change in the real-world feature. This appears to be incorrect, because of the 73 semantic changes, approximately 25% affect more than one entity. Thus, it can be attempted to model each feature once, but it cannot be presumed that the attempt will be entirely successful. Finally, for the guideline to 'provide clustering in the CS', no metric was developed, mainly because of the lack of generally accepted and implemented methods to modularize the CS. However, we believe that maintenance on the CS is not concerned with modules at all. Changes are made locally to all the affected constructs in order to accommodate a semantic change and engineers do not care for a modular decomposition in doing maintenance.

Synthesis

201

8.3.6 Proof of concept of the framework for flexibility In this thesis, we proposed to use as a simple and intuitive working definition of flexibility: the potential of the Conceptual Schema to accommodate changes in the information structure of the Universe of Discourse, within an acceptable period of time A fundamental problem of this definition is that we must look towards future events in order to assess the level of flexibility in a CS. Our approach to overcome this fundamental problem was to study past events instead. Based on the working definition of flexibility, we introduced a novel framework for flexibility consisting of three dimensions and eight guidelines. This framework enabled us to assess the main characteristics of CS flexibility. We consistently applied the framework for flexibility in both the Theoretical Track and the Practical Track of this research. By doing so, we have shown that the framework is a valuable and appropriate instrument to assess the flexibility of operational CSs: - the framework is based on the working definition of flexibility, and we derived useful operational metrics for its dimensions and guidelines, - the framework captures the basic ideas on flexibility inherent in today's design strategies in a satisfactory way, and - the framework is instrumental to understand and interpret the evolutionary behavior of operational CSs. In our opinion, we have delivered adequate proof of concept for the framework for flexibility. The three dimensions and eight guidelines provide us with appropriate and potent instruments to explore and understand Conceptual Schema evolution. By implication, adequacy of our working definition of flexibility framework is demonstrated.

8.3.7 Outlook The framework offers a sound basis for future work. For theoreticians, the outlook is to ascertain the theoretical foundations that the dimensions and guidelines are based on, and to find evidence for causal relationships between these foundations and the level of flexibility that is achieved in an operational CS. For business practice however, the framework as it stands is all but finished. The outlook is therefore to enhance it and develop from it a tool to assess the level of flexibility in an operational CS in hindsight, or even better, in foresight. 1. Develop metrics for all guidelines We defined several metrics, but we did not develop a comprehensive set of instruments that enables to investigate all aspects of Conceptual Schema evolution. In particular, we were unable to operationalize metrics for several flexibility hypotheses of a comparative nature. For example, we hypothesized that a more abstract CS will undergo less change. A confirmation of such hypotheses requires a longitudinal study of CSs modeling the same UoD but differing in their levels of abstraction, however defined. Similarly, we studied the guideline to 'keep it simple' only by the lattice complexity metric. Other notions of complexity are discussed in the literature that we did not develop metrics for,

202

Exploring Conceptual Schema Evolution

e.g., complexity regarding specializations and is-a inclusion hierarchies. Theoretical as well as practical research is required to operationalize metrics that will cover all guidelines of the framework in a satisfactory way. 2. Enhance the framework Apart from the information structure of the UoD, the CS captures other features as well. Those other features are liable to change, and such changes are the main cause of unjustified changes. In our research, we did not ascertain what these features might be. We also noticed that some CS changes have a surprisingly large impact of change that is in need of explaining. It calls for further research to clarify the underlying mechanisms, and to firmly establish cause-and-effect relationships between these characteristics and the guidelines of the framework. Further explorations may also indicate the need to expand the framework. Additional guidelines or dimensions may be identified that our research did not disclose. 3. Potential interdependencies The three dimensions of the framework for flexibility are considered independent. However, their mutual independence follows from theoretical considerations only. We speculated in chapter 7 that it is a business practice to balance the levels of flexibility in the various dimensions: deterioration in one dimension is compensated for by improvements in another. The implication is that correlations and interdependencies will exist between the dimensions and guidelines in the natural evolution of operational CSs. For instance, one may conjecture that efforts to reduce CS complexity are a cause of unjustified changes in the CS. Further research is required to investigate these kinds of conjectures and determine the statistically significant correlations.

Synthesis

203

8.4 DESIGN STRATEGIES REVISITED 8.4.1 Indirect proof only Chapter 3 discussed how authors often claim that a design strategy will deliver CS designs that are flexible, but that they rarely demonstrate how the flexibility will favorably affect the evolution of the CS in its operational life cycle phase. Evidently, the effectiveness of CS design strategies ought to be demonstrated by evidence gathered in the business environment. And it is almost as evident that such evidence is hard to come by. Some obstacles that researchers will encounter in field investigations of evolving CSs in business environments are: - it is rarely known which design strategy was employed in the design of a particular CS, as engineers do not record this kind of meta-information, - mixed design strategies are used because most engineers do not adhere to a single design strategy, but proceed by intuition and experience, and - the guidelines and best practices are often so vague that it is next to impossible to check for strict adherence to the strategy in complex and confusing situations. These and other obstacles make it next to impossible to produce clear and uncontested proof of superior CS flexibility that is due to the successful application of a particular CS design strategy. Does this mean that we are back where we started: flexibility of a CS design is to be argued from theoretic principles only ? We think not. Some progress is possible. The Practical Track of our research provides us with unique, albeit circumstantial evidence of long-term trends and tendencies of evolving CSs. This enables us to assess the claims towards CS flexibility underlying the various designs strategies, and thus, to discuss their effectiveness.

8.4.2 Active strategies Chapter 3 discussed active strategies that attempt to arrange the constructs in the CS in such a way that the CS can easily adapt to changing requirements. The subsequent analysis in chapter 3 brought out that claims of active strategies towards improved CS flexibility are founded on improvements in the timeliness and adaptability dimensions. The environment dimension is minimally addressed. We considered four active strategies: 1. Schema transformation This range of CS design strategies addresses the timeliness dimension that is subdivided in two guidelines. No metric was defined for one of the guidelines, 'ease change propagation', and we cannot draw any conclusions for it. We operationalized the metrics of compatibility and extensibility for the other guideline, 'minimize change impact'. Our measurements convincingly demonstrate that as a rule, semantic changes in CS are compatible. In addition, the majority of CS changes are indeed extensions. As the schema transformation approach supports compatibility and extensibility, its claim to flexibility is confirmed. To enhance the contributions of this approach to graceful CS evolution, we recommend expanding the range of schema transformations that are available to the maintenance engineer.

204

Exploring Conceptual Schema Evolution

2. Normalization Normalization primarily addresses the flexibility guideline to 'model each feature only once'. This guideline is addressed by the preservation of identity metric. It shows that identity is preserved in 85% of the semantic changes, thus supporting the claim that normalization does achieve the goal of this guideline. However, there is conflicting evidence. For one, the semantic change pattern to 'move attributes out into a new entity', which may be taken as a form of belated normalization. Moreover, the 'size of change' metric points out that over 60% (45 out of 73 changes) affect more than 2 constructs (entity or reference) at a time. Therefore, it is undecided whether normalization does indeed achieve the goal to 'model each feature only once'. A similar opinion is also held by [Buelow'00]: 'that normalization is not as difficult as is generally supposed, that it should not be performed as is generally understood, and that it is not as important as is generally thought' (p.37). 3. Modularization of the Conceptual Schema Modularization may enhance flexibility of operational CSs through a number of guidelines. Regrettably, the case studies are unsuited to discuss modularization, as modularization was not employed in any case as a means to enhance long-term flexibility of the CS. Moreover, we have no metric to measure the benefit of modularization towards the level of flexibility in the CS. An attempt to employ the notions of logical-horizon and package to investigate the potential contributions of modularization failed, as we found these notions inappropriate: - in the Sites case, the OUTLET-PERIOD entity at the bottom of the CS hierarchy is the single most important entity where most, if not all changes take effect, - POLICY and PARTICIPATION in the Benefit case reside in the middle of the CS hierarchy, but these entities are clearly the important ones for encapsulating changes. These particular entities differ from the other entities by their large number of references. In other words, the number of references is an important determinant, and we conjecture that modules should be organized around these core entities to achieve a graceful CS evolution and good flexibility. Perhaps object-orientation or other data model theories are better suited to investigate modularization as a strategy to enhance flexibility, but this is beyond the scope of our research. 4. Reflective approach The reflective modeling approach aims to boost flexibility of an operational CS by making meta-data readily available for change. Table 4 suggests that compatibility, extensibility, lattice complexity, and preservation of identity may be relevant. But we feel that these metrics are inappropriate to grasp at the potential benefits of reflective modeling. Instead, we can turn to the Franchiser case study where the reflective approach has been attempted. Overall, long-term benefits of the reflective approach have been disappointing in this case. Of course, one case does not disprove the reflective approach's potential to provide long-term flexibility in general. We recommend to generally avoid the reflective approach, only applying it if adequate precautions are taken to ensure that the envisioned benefits will be reaped in the long term.

Synthesis

205

8.4.3 Passive strategies Passive strategies focus primarily on the environment and timeliness dimensions. We considered five passive strategies: 1. Anticipating future developments The guidelines addressed by this approach are 'select the best scope', 'minimize change impact', and perhaps 'keep the CS simple'. The relevant metrics are justified change and compatibility. However, our research setup is unsuited to investigate this particular strategy. This is because a well-anticipated change goes unnoticed: when the anticipated change materializes, no CS change is needed to accommodate it. Several changes in the case studies have been successfully anticipated. However, the strategy is not failsafe. The PARTICIPATION-TRAIL-(NEW) entity in the Benefit cases is an example where an anticipated change does not materialize, and the effort spent in preparing the CS for future change never paid off. We therefore recommend this strategy, but to apply it sparingly. 2. Schema Integration approach This approach unites partially overlapping UoDs into a single large UoD. Engineers have to work with the broadest scope possible and to come up with a single CS that covers all areas of interest. Relevant metrics are justified change and lattice complexity. Our cases bring out several problems that are related to CS integration. Size and complexity of the CS lattice increase because of the integration of multiple CSs. As more and more constructs are added, users will generally resist to progressive integration, and only allow it if impact is minimal. Consequently, the integrated CS is often suboptimal. Further change is needed to redress the CS quality and reduce complexity. Finally, integration may create more levels in the CS hierarchy; we suspect -though it is unproven by our case studies- that this is a cause of decreased CS flexibility. Thus, we concur with the conclusion by [Crowe'92] that 'integration projects aimed at increasing flexibility will, in all probability, reduce flexibility' (p.33). There may be good business value in system integration, but the value is not in a higher level of flexibility. 3. Repository tools Repository tools used in the operational life cycle phase of the CS are supposed to make documentation easier to understand and analyze, and to ease change propagation by providing impact-of-change analyses. Relevant guidelines for this strategy are the 'minimize impact of change' and 'keep the CS simple'. Measurements of the associated metrics must be interpreted with care: we need to contrast cases of CS evolution where repository tools are used, with cases that are not supported by tools. Only the Benefit case used a repository tool consistently and over its entire operational life, and so we must look for differences between Benefit and the other cases. The most conspicuous difference is in lattice complexity. It rises to values of 14 and 15 in the Benefit case, whereas lattice complexity in the other cases does not exceed 10. Does this mean that the use of repository tools allows engineers to be lax on complexity ? We think not. The repository tool in the Benefit case was introduced to make design and maintenance more manageable, and it has successfully done so. It was never claimed that it would make the CS more flexible. Overall, we believe that the current tools used in the business cases are adequate for maintenance, but contribute little to CS quality or flexibility in the long term.

206

Exploring Conceptual Schema Evolution

4. Proven pattern (re)use The use of patterns helps the designer in getting the real-world scope right, and it improves timeliness when new user requirements are accommodated quickly and correctly. The justified change and compatibility metrics are involved in this. While we have no indication that any design patterns were used in our cases, we do believe that good design patterns may enhance CS quality in the general sense, and flexibility as well. In fact, we advocate the use of patterns, and we expect to see sets of good design patterns gradually evolve into a comprehensive catalogue of semantic change patterns. The patterns must be embedded in a reuseoriented approach that clarifies under which circumstances a pattern may be reused, how to incorporate the pattern into the operational CS, and what benefits will derive from it. 5. Procedural approach Change requests are handled using some change procedure to ensure a careful change effort and graceful CS evolution. Procedures will not necessarily affect the environment or adaptability of the CS, but it will affect timeliness of the change. Compatibility is the one metric that applies and its overriding importance is evident. We advocate the use of the structured process of changing the CS as outlined in chapter 6 by embedding it in the organizations' standard way of working.

8.4.4 Abstraction strategies Third, we have abstraction strategies that aim to abstract from the structural level where changes might occur. The two involved dimensions are environment and adaptability. Chapter 4 discussed four abstraction strategies: 1. Ontological approach Ontology strives to capture the essence of the UoD. Regrettably, the metrics defined in chapter 4 are incapable of assessing the contribution of the ontological approach towards flexibility. Inspecting our case studies, we note that entities in the upper parts of the CS hierarchies appear to be well matched with the ontology. In lower parts however, concepts in the CS fit less well with the ontology. Moreover, the abstract concepts evolve into more concrete, down-to-earth concepts as explained in chapter 7. Indeed, an ontological analysis of the final CS version of the Franchiser case brings out that several entities can be unified into a generalized INVOICE-ITEM entity that is absent from the CS. We are convinced that the ontological approach may contribute to CS quality and flexibility, and the use of appropriate ontologies can help to prevent undue tendencies towards concrete ways of modeling. 2. Abstract data model theory This strategy cannot be investigated due to our research approach. We work with a single data model theory throughout. Therefore, we cannot compare the levels of flexibility between CSs founded on different data model theories. 3. Abstraction layers in the Conceptual Schema A top-down approach will try to capture the most essential features first. Relevant metrics are proportional change, and susceptibility to change. The measurements do not indicate whether engineers attempted to use abstraction layering in the CS. However, it was discovered that the

Synthesis

207

overall hierarchy of the CS does remain stable over time. This new discovery can lay the foundation for design and maintenance methods that produce more stable CSs. A further difficulty inherent in this guideline is that one knows what the best level of abstraction is to capture real-world semantics. The Benefit case illustrates this when the EXCHANGED-E.R.BENEFIT entity is subsumed by the EXCHANGE generalization in change (I). How was one to know that the intent of the initial entity would later be extended ? 4. Data architecture Modeling guidelines set out the scope and essence of the UoD, and simplicity of the overall schema. The data architecture also ensures that each real-world feature is modeled only once. Relevant metrics are justified change, proportional change, lattice complexity, and preservation of identity. There are many unjustified changes and some disproportional ones as well. This indicates that guidelines, as applied in the cases, did not work out as effectively as hoped for. We also noticed that the guidelines themselves were subject to change over time. We conclude that modeling guidelines may be beneficial to other quality aspects of the CS, but their contribution to CS flexibility in the long term is not corroborated.

8.4.5 Discussion It is often implied that good flexibility will result simply by following the theoretical design guidelines and best practices. This line of reasoning must be rejected, as our practical experience convincingly demonstrates that CS flexibility deserves to be named as a quality aspect in its own right. It is not merely a derived aspect that devolves from some other quality aspects. Strategies to enhance CS flexibility should match well with business practice in CS evolution, thus evidencing a seamless body of knowledge. The design strategies that we studied are limited in scope, as only the design phase of the CS life cycle is considered. Consequences of design decisions later in the CS life cycle are rarely considered. Of course, a design approach may be used in maintenance, but it is not what it was created for and its effectiveness in maintenance is uncertain. What must be done to reap the benefits of the design approach, and how does one show the merits ? How does the strategy account for the natural tendencies in CS evolution, and how must an engineer cope with infractions ? Although theoretical design strategies may have a good potential to enhance flexibility, the fit between theoretical design strategies and practice is generally poor, and practical guidance or support tools for the maintenance engineer is found to be lacking.

8.4.6 Outlook We believe that a deeper understanding of CS evolution in the long term is fundamental for the future enhancement of methods for CS maintenance and CS design. In turn, this will pave the way to increased levels of flexibility and stability of operational CSs. As remarked upon by [Perry'94]: 'we will be able to effectively understand and manage the evolutions of our systems only when we have a deep understanding of these sources (i.e. change drivers), the ways in which they interact with each other, and the ways in which they influence and direct system evolution' (p.303).

208

Exploring Conceptual Schema Evolution

By enhancing the fit between design strategies and maintenance practice, and by providing adequate guidance and support for the maintenance engineer, we hope to see current CS design strategies evolve into more mature strategies supporting the entire CS life cycle. Not only the design phase, but also operational life cycle phase that, in our opinion, is the more important one. We are convinced that the notion of semantic change patterns will contribute significantly in this respect. In fact, we expect to see a comprehensive catalogue of semantic change patterns to unify several design strategies, i.e., the schema transformation approach, proven patterns (re)use, repository tools, and ontological approaches. The long-term outlook is that information maintenance strategies emerge as a separate branch in the area of information modeling.

Synthesis

209

8.5 LAWS OF CONCEPTUAL SCHEMA EVOLUTION We now rephrase our practical experiences, we abstract from the four separate case studies, and we generalize our findings by formulating five conjectured laws of CS evolution. We conjecture that these laws, verifiable hypotheses amenable to rigorous research in the business environment, will prove to hold for the mainstream of business information systems founded on the conventional 3-Schema Architecture and relational data model theory. By these laws, we hope to replace some of the rules of thumb and vague intuitions of today's modeling methods. It is hoped that these laws may spur, either to confirm or to contradict, new and more comprehensive investigations of evolving schemas in the operational business environment, an area of research that is, in our opinion, unexplored yet challenging.

8.5.1 Law of Change All our CSs evolve, which is easily verified by comparing initial and final CSs of each case. It is not surprising though, as we selected our cases by demanding at least three different CS versions. This law of change is the equivalent of the well-known Law of Continuing Change by [Lehman, Belady'85]: 'every system that is used undergoes continuing change' (p.300). Second, every entity evolves. The entity structure, i.e., its references and specializations, is bound to change one time or another. Few entities in our cases have not shown structural change - so far. These entities are not immune to change, and it is just a coincidence that no change has affected the structure of these entities so far.

8.5.2 Law of Compatibility As a rule, changes in the CS are compatible so that data stored under the old CS can be coerced to fit the new CS. The exception is incompatibility in specific constructs. Manual intervention is maximally avoided. As intervention will considerably increase overall cost, time and effort of change, designers will go out of their way to prevent it. There is a price to pay as other aspects of CS quality may be compromised: - references and conceptual constraints may be relaxed, temporarily or indefinitely, - a succession of changes may be used to reduce the impact of change, causing unjustified follow-up changes and increased complexity as constructs eliminations are postponed, or - generalizations may devolve into multiple and less abstract entities. Compatibility is not rigidity. It is not a dictate prohibiting all changes. Compatibility is a natural demand of the user community to avoid extra efforts, user intervention or manual editing of data. Change is allowed, and conceptual constructions may be altered, but users expect the existing data to remain available within the context of the new CS.

8.5.3 Law of Stable Hierarchy The top-down hierarchy in every evolving CS remains fixed. The CS lattice, determined by aggregation references, remains stable throughout the entire CS life. Entities do not slide up and down in the CS hierarchy, nor do entities switch relative positions. This is effectively and convincingly demonstrated by scanning the diagrams of the evolving CSs in chapter 5. We speculate that the CS hierarchy reflects the essence of the UoD as expressed by its ontology [Storey, Ullrich, Sundaresan'97]. As the CS hierarchy remains stable over time, it is beneficial to have it clearly reflected in the CS diagram. A clear hierarchy makes for easy-to-read CS

210

Exploring Conceptual Schema Evolution

diagrams [Nordbotten, Crosby'99]. The overall hierarchy will become very familiar to users, which will ease future interactions with the user community. Maintenance engineers will benefit from this law, because the overall CS hierarchy will never need to be reconsidered during the operational life cycle phase of the CS. All semantic changes can and will be accommodated by elaborating upon the existing CS hierarchy.

8.5.4 Law of Increasing Complexity and Decreasing Abstraction A CS grows larger, more complex and less abstract over time, unless specific work is undertaken to reduce complexity and restore abstraction. The preferred mode of maintenance is to add new entities and specializations. This way of adapting the CS is quick, understandable, and has little impact on the existing CS. However, there is a price to pay: - there is a clear tendency to use concrete ways of modeling for the new constructs, and - the CS grows larger in size and in lattice complexity. These tendencies are curbed by specific efforts undertaken to bring lattice complexity down. [Belady, Lehman'76] reported a similar finding, that unstructuredness (complexity) of a system increases with time, unless specific work is being done to improve the system's structure.

8.5.5 Law of Schema Persistence CS quality does not decline in terms of being a complete and valid model of the information structure of the UoD. CS quality does decline in terms of resistance to change. As CS size and complexity increase, and as its level of abstraction decreases, the CS becomes harder to adapt. By implication, the CS may be considered outdated, too rigid or hard to adapt, but as a model of the UoD, it remains valuable. It can and will survive the information system of which it is a core component, if properly reverse-engineered and upgraded. Therefore, we disagree with [Elmasri, Navathe'00] who write that 'the information system life cycle is often called the macro life cycle, whereas the database system life cycle is referred to as the micro life cycle. The distinction between these two is becoming fuzzy for information systems where databases are a major integral component' (p.530). Under the law of schema persistence, we believe that the CS life cycle deserves to be named the macro life cycle as in all likelihood, the CS as a conceptual model of the UoD survives the information system.

Synthesis

211

8.6 CONCLUSIONS 8.6.1 Research objective Chapter 1 formulated the main objectives for our research: -

to develop a framework for CS flexibility that captures relevant aspects of flexibility in the business environment and demonstrate its relevance,

and - to explore long-term evolution of CSs in the ongoing business, in order to disclose operational practices and experiences and outline the implications for flexibility of CS designs The first objective was met by the Theoretical Track of this thesis. In the Practical Track, chapter 5 described the histories of four evolving Conceptual Schemas. Chapters 6 and 7 were devoted to fact-finding, disclosing shortterm patterns and practices in chapter 6 and long-term trends of schema evolution in chapter 7. This chapter outlined the major implications for flexibility of CS designs. We discussed the implications for theory in sections 3 and 4. Section 5 formulated the predominant implications for practice as five Laws of CS evolution. CS flexibility in the ongoing business environment today is perhaps less of a terra incognita as we have explored the area, developed a qualitative understanding of CS evolution, and disclosed some of the characteristic phenomena that landmark the area.

8.6.2 Limitations of the research 1. Exploratory case study approach We emphasize that our findings result from an exploratory study of four cases only. We aimed for a qualitative understanding of CS evolution, and for the discovery of characteristic phenomena that landmark the area. Further investigations of evolving CSs in the business environment are required to ascertain reliability of our findings, and particularly our Laws of Conceptual Schema Evolution. A rigorous field study is required to produce statistical proof, and to confirm these laws. Or, to refute them and come up with better ones. The outlook is to progress to rigorous and verifiable rules for long-term CS evolution that are based on sound experience and that provide engineers with valuable guidelines for CS maintenance in the turbulent business environment. 2. Relational data model theory, and only entities and references We purposefully restricted our research to operational CSs founded on the conventional 3Schema Architecture, and we employed one particular data model theory to describe all CSs in a uniform way (see appendix A). We cannot draw conclusions as to other data paradigms, or to other variants of the relational data model theory. We restricted on entities and conceptual references only. It is often speculated that most CS changes are confined to attributes only, but this was not investigated. Conceptual constraints were not investigated primarily because of unavailable documentation.

212

Exploring Conceptual Schema Evolution

3. Bias of the survival All four case studies covered at least five years of successful operations. This creates a bias towards long-lived and apparently successful information systems. This is in keeping with our research objective to explore long-term evolution of CSs in the ongoing business. It is beyond our goal to ascertain whether information system failures are caused by lack of flexibility on the part of the operational CS. 4. Bias of the focus on change The Law of Change states that all our CSs evolve. This is not surprising, as we selected our cases by demanding at least three different versions. It must be realized that this creates a bias in our results. Our research focuses on CS features that do change. Generally, we ignore all changes in real-world features and user requirements that are successfully accounted for without changing the CS. It must be considered a mark of true flexibility if a change in the UoD information structure is accommodated without affecting the CS. This serious limitation is inherent to our approach and research protocol. However, we feel that it does not invalidate our results, if only we keep in mind that our focus is on CS changes, and not on UoD changes.

8.6.3 Directions for future research The results of our longitudinal research are, in our opinion, promising, and deserving of future follow-up. Numerous directions for such research can be indicated: 1. Verify in order to confirm or refute Additional research into the long-term evolution of CSs is to be conducted to either confirm or refute our results, and to show that the results are significant and valuable. We are strongly convinced that current theoretical information modeling approaches will benefit from such empirical research in the operational business environment. 2. Improve strategic alignment The relationships between the corporate strategies, business information needs and change drivers, and semantic changes in the CS are not well understood. This area of research concerns the first stage of the change process outlined in chapter 6. It is the area of Information Management science; an area offering many research challenges [Kim, Everest'94], [Ratcliffe, Sackett'01]. A promising topic in this area is to elicit which business domain characteristics are prime determinants for CS evolution. This will enable to develop early indicators of emerging information needs, thus improving the timeliness of change. A potential benefit is to find ways of controlling upcoming information needs, and of preventing CS changes that are harmful to overall CS quality. 3. Benefit from expert knowledge Major advances can be made by investigation of how experienced designers create and maintain high-quality designs, and how they avoid modeling pitfalls in turbulent, everchanging business environments [Moody'95].

Synthesis

213

4. Establish theoretical foundations Fundamental theoretical research is needed in several areas where current theoretical terms are too unspecific to be captured in objective and generally accepted measurements. This lack of specificity prohibited us from developing rigorous metrics for level of abstraction [Kent'89], CS complexity [Da Silva, Laender, Casanova'00], and modularization. 5. Develop comprehensive metrics So far, we have been following an analytical approach. This approach results in objective measurements of all the separate details. This fine-grained level of details is not convenient in a practical situation. Comprehensive metrics need to be developed that comprise all the guidelines of an entire dimension of the framework [Kitchenham, Pflegger, Fenton'95]. The compound metric should show the correct tendency; i.e. the metric must produce a more favorable measurement if the CS has superior flexibility in that dimension. This will enable the business engineer to compare levels of flexibility between CSs. First, between CS versions before and after a change; next, between structurally different CSs that model the same UoD. 6. Unify the quality frameworks Chapter 2 outlined several important quality aspects for the static CS. Other frameworks have been developed for data quality [Redman'96], [Strong, Lee, Wang'97], [Wang'98], and for information system quality [Delen, Looijen'92], [DeLone, McLean'92], [Wilkin, Hewett, Carr'00]. We hope to see the emergence of a single unified framework to support the maintenance engineer with clear, uncontested, and easy to use guidelines. 7. Enhance data model theories Chapter 2 pointed out that the data model theory is the language that expresses CS change: a real-world feature that cannot be expressed by the data model theory cannot be seen to evolve either. Research needs to be done to determine whether the entity and reference constructs of the relational data model theory offer the best level of flexibility. Moreover, there is a need for research to clarify which type of data model theory (object-oriented, temporal, multidimensional, relational) offers the best flexibility of schemas, depending on characteristics of the business area that is to be modeled.

214

Exploring Conceptual Schema Evolution

Conclusions

215

CHAPTER 9. CONCLUSIONS The Road goes ever on and on Out from the door where it began Now far ahead the Road has gone Let others follow it who can ! From 'The Lord of the Rings' [Tolkien'54]

9.1 EXPLORING CONCEPTUAL SCHEMA EVOLUTION 9.1.1 Research motivation This thesis has grown out of an interest in Conceptual Schema evolution. Not merely an interest to read up on state-of-the-art theory, but an interest to add to it. Two questions have been motivating our research. First, what is the natural evolution of a CS that operates in the business environment, and second, how to assess the level of flexibility of a CS, in view of that natural evolution ? We developed a framework for flexibility, and designed objective operational metrics for it. We conducted four case studies of evolving CSs and analyzed their semantic changes, adding qualitative understanding and identifying characteristic phenomena of CS evolution. Section 2 summarizes the contributions of our research endeavors. Still, CS flexibility in the ongoing business environment remains a largely unexplored area, and the work done in this thesis is only a beginning. Further research is needed to arrive at a comprehensive understanding of CS evolution, to develop new modeling and maintenance practices, and so to provide the sought-after solutions for flexibility in CS evolution. Our experiences in this exploratory research have convinced us that real schema evolution can only be understood by conducting longitudinal studies in the business environment. Section 3 outlines how future researchers can carry on in this important and challenging area of human ingenuity [Wedemeijer'01may].

216

Exploring Conceptual Schema Evolution

9.2 RESEARCH OBJECTIVES We formulated two objectives for our research: - to develop a framework for CS flexibility that captures relevant aspects of flexibility in the business environment and demonstrate its relevance, and - to explore long-term evolution of CSs in the ongoing business, in order to disclose operational practices and experiences and outline the implications for flexibility of CS designs. These objectives have been achieved through the following contributions of our research:

9.2.1 Contributions of the research 1. Framework for flexibility Chapter 3 developed a framework for understanding and analyzing CS flexibility. The framework is appealing and easy to understand. It consists of three dimensions, subdivided into eight flexibility guidelines, which provide a sound and objective basis to assess CS flexibility. Relevance of the framework was demonstrated both in an analysis of current CS design strategies, and in the development of metrics for CS flexibility. 2. Operational metrics Chapter 4 defined metrics for CS flexibility that derive from common hypotheses about schema evolution. Eight metrics were operationalized that gauge characteristics of semantic changes in the evolving CSs of the case studies. The long-term behaviors of these metrics, that we analyzed in chapter 7, show useful and sometimes unexpected trends. Other metrics could not be operationalized, but we elicited their long-term trends by inspecting illustrative examples in the case studies. 3. Four case studies Chapter 5 is our 'pièce de resistance'. It explores the long-term evolution and presents in-depth analyses of four operational CSs. To the best of our knowledge, similar examples of longterm CS evolutions have not been reported in the literature. These case studies descriptions are vital in better understanding the issues and trends in CS evolution. 4. Comprehensive Change Process Chapter 6 outlines the comprehensive process of CS change that the businesses apply as a standard practice in operations. The change process unites several angles into an integrated view, bridging the gap between business-oriented approaches towards information changes on the one side, and data centric approaches on the other. 5. Semantic Change Patterns We introduced the novel concept of Semantic Change Patterns that reflect operational practices in CS evolution. [Moody'96] advocates engineers to be creative, and to think of alternative ways to solve modeling problems. Patterns can help in this respect, offering a new instrument to ease CS evolution in the short term. The patterns can be applied reactively, when a need to

Conclusions

217

change the CS materializes. And the patterns can be applied pro-actively, to check whether a CS design proposal will evolve gracefully if such a change materializes. 6. Best practice recommendations We extracted in chapter 7 a series of best practice recommendations for CS evolution. These recommendations give guidance how to check the CS design for flexibility. Being based on operational experience rather than theoretical hypotheses, we expect these best practices to be useful for engineers wanting to ensure a graceful evolution of their CSs designs. 7. Laws of CS evolution Five laws of CS evolution are conjectured in chapter 8 that summarize our experiences from the four case studies. These experiences are generalized and rephrased as objective and widely applicable propositions that can be subjected to future research.

9.2.2 Influence of the research settings on the results The longitudinal case study research was conducted in the research settings as outlined in chapter 1. Our experiences with these settings are satisfactory. The settings did not pose unnatural restrictions, nor did we experience them to cause a bias in the methods of research, in the way of analyzing the research material, or in presenting the results. 1. 3-Schema Architecture and relational data model theory We studied Conceptual Schemas in accordance with the conventional 3-Schema Architecture, and the relational data model theory described in the appendix. No particular problems with these conventions materialized in the course of our research. Other architectures and data paradigms have been proposed besides the 3-Schema Architecture, such as Object-Orientation, XML, multidimensional and OLAP architectures, temporal data models, web-enabled architectures, etc. Information systems and conceptual schemas founded on such other architectures are bound to show other evolutionary behaviors. But whatever architecture or technology is selected, change will be inevitable, and flexibility will be mandatory. 2. Semantic change as unit of analysis Chapter 1 introduced our unit of analysis as the 'semantic change' that we stated to be the deliberate, coherent and meaningful way in which the CS is changed in response to some change driver. Although we did not define the concept in a precise and formalized way, we did not experience problems in identifying and isolating the semantic changes. The notion readily appealed to engineers in the business. Moreover, we are convinced that this notion is superior to, and should replace, theoretical notions of change such as lossless schema transforms or elementary changes in a taxonomy. 3. Operational life cycle phase of the CS This research setting is familiar in the context of operational studies of information systems, which is restricted here to the CS at the core of large administrative transaction-processing systems. We were able to clearly distinguish the operational life cycle phase from the phases that precede it. Remarkably, we experienced some problems in separating the operational phase from the terminal life cycle phase of replacement. The point is that a CS survives its

218

Exploring Conceptual Schema Evolution

information system [Bisbal, Lawless, Wu, Grimson'99], in a way similar to how DNA may survive its host, the living being. An information system may be terminated, but its valuable data resources are salvaged and converted into the database of a new information system. The structure of that data is not disposed of, but it merges into the new CS. 4. The four selected case studies We conducted our research in four evolving CSs that we selected for diversity, not for representativity or outstanding quality of their CS. Chapter 1 outlined our selection criteria for the case studies. The selected cases represent common enough examples of operational CSs, which is as it should be when researching the evolution of average CSs. Our findings may be denounced by claiming that the four evolving CSs are not well designed according to theoretical best practices. However, such a claim is invariably based on theoretical arguments, but it holds no practical value: there are no perfectly designed CSs out there. Engineers in the businesses environment design and maintain the operational CSs to meet the user requirements. Meeting the theoretical best practices is not of prime importance. The problem is not in the shortcomings of existing CSs. The challenge is to come to understand these shortcomings and best practices, and to begin improving them. 5. Case Study Protocol The case study protocol described in chapter 1 was successfully applied in all four case studies. It engages in a new area of research in information management that perhaps may be called data archaeology: the exploration of the past history of an operational CS and the reconstruction of its long-term evolution.

Conclusions

219

9.3 A LEARNING CYCLE APPROACH 9.3.1 The need for flexibility A critical step in information modeling is the creation of a CS that meets user requirements to the full. The CS is a core concept in the conventional 3-Schema Architecture, which is still the dominating information architecture in many businesses. But basically, it is a static view and does not provide for change over time. Obviously, the initial design is only a first phase in the life cycle of the information systems that will provide long-term support to business operations. Depending on the characteristics of the business area to be modeled and the volatility of user requirements, CS inflexibility can pose a serious threat to future business operations. We therefore think that the commonly stated purpose of information modeling must be reviewed. Its purpose is not simply to create a CS that is perfect and it never has to change. Change is a fact of life, and any operational CS will undergo some change eventually. In the long term, it is important to protect business investments made in data, software applications, operational procedures etc. In our opinion, current information modeling methods should focus not only on creating a CS that meets current user requirements. One of the main goals should be to prepare the CS for future changes, in order to meet new users demands while retaining its level of quality. As the dependence on information systems continues to increase, engineers are challenged not only to develop high-quality CS designs today, but even more so to ensure the effectiveness and efficiency of their designs in the business environment of tomorrow. Thus, there is a need to improve current information modeling methods, and develop approaches that are better geared to provide flexibility of CSs as a dynamic quality aspect.

9.3.2 To design for flexibility is not enough The common approach to prepare a CS for future change is the linear waterfall process (figure 66, repeated from figure 12). It creates a CS design of the best possible quality, meeting current user requirements, and possibly, some predicted ones as well. It is assumed that if only best design practices are followed, then invariably a good design will be delivered. To create such a single best design is an expensive and time-consuming effort. But will the effort pay off ? Will the designed schemas be able to accommodate required changes in their future life cycle ? In our opinion, this current state of affairs is unsatisfactory, for several reasons: - modeling methods and best practices are based on assumptions about what may enhance future flexibility, - the arguments are mainly theoretic in nature, not based on operational evidence, - the arguments are not objective and measurable. A designer cannot use them to prove that a proposed CS design has adequate flexibility; nor can a reviewer use the arguments to detect a lack of flexibility, and - the design approach to flexibility may not be the best way to prepare the CS for inevitable change.

220

Exploring Conceptual Schema Evolution

design design

build, build, test, test,& & implement implement

use use& & maintain maintain

Current Design Practices:

What Flexibility Is About:

¾once-and-for-all strategy

¾operational schemas

replace replace

¾does not cover maintenance ¾changes in ongoing business ¾no proof of effectiveness Figure 66. The linear approach Quality of a design, in terms of adherence to modeling standards, does not prepare the CS for actual change. The flexibility is 'cast in concrete', as the potential for change is limited to what was envisioned at design time. Eventually, a structural change will come along that is somehow difficult to fit into the CS. By then it is too late to argue that the design is inflexible and that this was to be expected 'since good design practices cannot be assumed to have been followed' [Chiang, Barron, Storey'94] (p. 109). What is needed is an understanding of flexibility as a potential to accommodate new requirements in a timely way. It requires the investigation of evolving CSs in operational environments, to verify whether the theoretical assumptions match practical experiences.

9.3.3 Inflexibility of schemas in business In an ongoing business, every CS will be exposed to changing user requirements and urgent adaptations. Many requests will have nothing to do with the CS but address other parts of the information system, so we are not concerned with them. Our interest lies with the CS and with the unexplained phenomenon that many operational CSs appear to be lacking in flexibility, even if the CS design was judged to be of high quality. There may be well-conceived theories to support claims of current information modeling methods towards flexibility of the CS designs. However, whether these CS designs are indeed able to meet the needs for evolution in operational environments is yet an open question. Current information modeling methods are based on assumptions rather than insight into the evolution of CSs in operational business environments.

Conclusions

221

9.3.4 Learning cycle approach What is needed is to move away from the current intuitive state of affairs in information modeling to a more mature status where design and maintenance practices are founded not only on theory, but even more so on practical experience. To this end, we need to learn about flexibility of operational CSs by studying the nature of CS changes, and by understanding the effects on business operations in the long run. What is called for is a learning cycle to improve on the linear approach, as depicted in figure 67:

round-trip enginee

1. create a flexible CS design and implement it in the live business environment

2. evolve the CS using available flexibility

4. improve current methods and best practices

3. measure and evaluate the achieved "level of flexibility" Figure 67. The learning cycle

Step 1 is where most information modeling methods start; and stop. Current methods go only as far as creating a once-and-for-all CS, but disregard the changes that are bound to occur at future times. Step 2 is the life cycle phase of CS operation, when changes must be made to keep the information system aligned with business operations. This phase is when the available flexibility must actually be demonstrated: exploiting the potential to adapt the CS within an acceptable timeframe, as the structure of the Universe of Discourse is changing. Step 3 is to measure and evaluate the CS evolution as it is observed in the business environment. It calls for extensive case study research in operational business environments, in order to understand how changes in an operational CS actually come about and in what ways flexibility as a potential for change is deployed. Step 4 completes the cycle. By extracting and abstracting the wealth of material obtained in the previous step, our knowledge of CS evolution and the use of available flexibility is increased. From this, we can hope and expect to improve existing modeling methods and learn to devise ways of designing and maintaining CSs that cope better with change.

222

Exploring Conceptual Schema Evolution

The learning cycle approach is about understanding the typical behavior of evolving CSs, and assessing our achievements in the levels of CS flexibility. This calls for a new way of looking at CSs, and new ways of investigating their quality. Instead of static CS quality, we must focus on the dynamics of the CS and its quality as it evolves. The challenge is to disclose what the essential features are for a CS to be truly flexible, to ascertain which features increase the capability of the CS to accommodate changes, and to know how to exploit that potential to the full. It is only by investigating the CS in its natural environment, the running business, that the foundations can be laid for a next generation of information modeling methods. This is quintessential if we want to understand what makes a CS flexible in the business environment, and to develop new modeling methods that build upon those flexible features [Maciaszek'00]. Improved modeling approaches should deliver CSs that, in their operational life, have an increased potential to meet dynamic user requirements of tomorrow [Patel'99]. The improved levels of flexibility should be manifest in the superior capability of the CS to accommodate changes in user requirements better, faster, and with less effort.

9.3.5 Concrete flexibility The need for flexibility of the CS is beyond doubt. The usual approach is to design flexibility into the CS. This thesis reviewed some common assumptions that design approaches base their claims to flexibility on. The main conclusions are that claims and assumptions are generally based on theory, that there are no generally agreed-upon best practices to assess the level of flexibility in the CS, and that there are no reports of their effectiveness in live business environments. Thus, there is a discrepancy between design approaches towards CS flexibility, and what becomes of flexibility in the operational business environment. Switching the perspective to the operational phase of the CS life cycle, we found that information modeling methods are not geared towards the needs of practitioners to achieve and maintain flexibility of the CS in the long run. Indeed, [March'00] argues that: 'while conceptual modeling is routinely taught in systems analysis and design course and described in basic textbooks, very little research has investigated how they should be taught or the impact of conceptual modeling in the real world' (p.24). In our opinion, the root of the problem is in the emphasis on designing flexibility once and for all. Relying on design quality alone, is casting flexibility into concrete, instead of preparing for the inevitable change and building up experience from it. What current information modeling approaches are lacking is round-trip engineering in order to learn from past experience. For future research, we advocate a learning-cycle approach to propel us from the design approaches based on theoretical yet unproven assumptions, towards a more mature state of the art in information modeling where methods are founded on sound and uncontested qualifications of CS flexibility. We did not yet close the learning cycle, but we did show the viability of the approach by executing major steps towards that goal. The long-term perspective is to progress to a situation where true flexibility of a CS can be understood in hindsight, be predicted, or better still, be actively managed and prepared. This will bring the concrete benefits of flexibility within the grasp of maintenance engineers who today are hard put to find it in existing information modeling methods.

Conclusions

223

APPENDICES

224

Exploring Conceptual Schema Evolution

Appendix A : Data Model Theory

225

A : DATA MODEL THEORY A.1

Essential Variant of the Relational Model

A data model theory provides the constructs and constructions that give precise and unambiguous meanings to the Conceptual Schema to model the real world as perceived by the user community. The scientific community calls it the data model, but we consistently call it the data model theory to avoid confusion in the business environment where the term data model is a synonym for CS. Our data model theory, rather similar to [ter Bekke'93], is an essential as opposed to rich variant of the well-known Relational Model. Its constructs and constructions are readily explained to end users, and the constructs are not too far removed from how conventional information systems handle their data. We outline the major features of the data model theory, omitting details on the attribute and conceptual constraint constructs as these are excluded from our study of evolving CSs.

A.2

Constructs and constructions

A.2.1 Entity Entity intent defines what the relevant real world objects in the UoD are. By definition, an entity is an aggregation of attributes. The attributes serve to capture and record the properties of relevant objects in entity instances. The entity extent or population is the current set of instances recorded in the database. The structure of an entity is its references and specializations, as defined by the modeling primitives of entity aggregation and generalization. Every entity has a name, but names have no conceptual significance: a name change does not constitute a CS change. A.2.2 Rule 1: Entity integrity Entity integrity, also called key integrity or primary-key integrity, requires that two entity instances can always be discriminated from each other by a difference in at least one of their attribute values. Instance identification enables one to recognize and retrieve a designated instance amidst the set of all entity instances at some later time. Identification is more than just discrimination: cows in a herd are physically discriminate, but most people are unable to recognize the same cow on two different days. A candidate key is any combination of attributes to guarantee instance identification. Entity integrity enforces the entire set of attributes to be candidate key. We require candidate keys to exist, not that one is selected as a primary key. A.2.3 Attribute An attribute is a unique, well-defined recipe that specifies how to observe the value for a certain property (or fact type) of a real-world object. An attribute value or fact is the result of an observation on the corresponding real-world object. No particular data format is enforced for attribute values: the choice of data format and representation belongs to the implementation level of the Internal Schema. Thus, format change (recast, or 'domain mismatch' [Kent'91]) is not a relevant change on the semantic level. Likewise, attributes have names for clarity, but a mere change of name is considered irrelevant for CS evolution.

226

Exploring Conceptual Schema Evolution

An attribute is optional if the corresponding property may be lacking for the real-world object. It is compulsory otherwise. The 'null' value signifies that the result of observation is not on record; this value is indistinguishable from all other attribute values. In addition, we define: - a derived attribute is a recipe, unique and well-defined, to calculate the correct value from available information within the database. No real-world observation is required. - an artificial attribute is a recipe, unique and well-defined, that records a property value that cannot be observed in the real-world object as such. For instance, a car license plate number is an ordinary attribute in most applications, but it is artificial for the Motor Vehicle department that must pass out the plate numbers. Artificial attributes are mostly used to assist in (but never: to replace) entity identification. Attributes are not included in our research of evolving CSs for practical purposes. A.2.4 Reference A relation of a real-world object is its property to be related to one other real-world object within the UoD. The relation in the real world is captured as a reference in the CS. A reference attribute is a special attribute that for a given entity specifies how to observe (the identification of) the related object. The entity wherein the reference attribute is aggregated is called the referring, or member entity. The entity modeling the related object is called the referred, or owner entity. Most references have N:1 cardinality; i.e. an arbitrary number of member instances may refer to the same owner instance. Some references have 1:1 cardinality; i.e. each owner instance is referred to by at most one member instance. The real-world relation may be lacking for some of the objects; and the associated reference is called optional. A 'none' value for the reference attribute signifies that the observation of the relation resulted in no other object. A reference is compulsory if the real-world relation is perceived to exist for all concerned objects. A.2.5 Rule 2: Referential integrity Referential integrity requires correct instance identification by any reference attribute. The reference attribute values may only hold currently existing instance identifications of the referred entity. We do not require that a particular candidate key be used in the reference. A.2.6 Entity specialization / generalization An entity is a specialization of another entity if every object that meets the specialization's intent automatically meets the defined intent of the generalized entity. Thus, an injective is-a reference exists from the specialization to the generalization, because the real-world object is the same, whether it is viewed as an instance of the specialization or as an instance of the generalized entity. We do not require permanent membership of the specialization: an instance of the generalization can participate in the specialization at one time, and not at another [Olivé, Costal, Sancho'99]. Set theory, when applied to entity populations, induces various specializations that require no further definition: - a reference induces in the owner entity a specialization of instances being referred to. For instance, car owners are special persons. - an optional reference induces the set of instances with meaningful reference (i.e., not 'none') as a specialization in the member entity. For instance, the Irish are special persons. - a specialization of the owner entity induces a similar specialization in the member entity. For instance, company cars are special cars where the car owner is a company.

Appendix A : Data Model Theory

227

A.2.7 Connection constraint Apart from entity and reference integrity, a third rule is important to ensure data integrity in the lattice. A connection constraint applies whenever one entity connects with another by more than one reference or chain of references. Starting out from an instance in the lowest member entity, more than one reference path can be followed upwards in the CS hierarchy to the owner entity. The connection constraint then specifies whether the separate pathways must lead up to the same instance, to different instances, or whether some other rule applies.

A.3

Graphical conventions

A rectangle in the CS diagram depicts an entity. Specialization is depicted as an enclosed rectangle, a diagramming technique borrowed from Venn diagrams in set theory. The is-a injective reference is evident from the inclusion. A line connecting rectangles depicts a reference: - an arrow pointing upward from member to owner entity indicates N:1 cardinality; the notation is suggestive of the reference attribute located in the member entity - a plain line with no arrowhead indicates the (rare) 1:1 cardinality - a line starting in the interior of the member entity represents an optional reference; it suggests that it is connected to the enclosed specialization induced by the optional reference - a line starting out from the boundary of the member entity is a compulsory reference: all member instances must participate in the reference As usual, the attributes and conceptual constraints are not depicted in the CS diagrams. These graphical conventions make for compact, and easy to read diagrams with a hierarchic, top-down structure. As we focus on the conceptual entities and references, all that we need to know about the sequence of CS versions in the case studies, is conveyed in the CS diagrams.

A.4

Taxonomy

Elementary changes per entity: o addition o deletion o change of intent Elementary changes per attribute: o addition o deletion o change of definition, i.e., how values are observed in the real-world objects o relax to optional attribute o restrict to compulsory attribute o promotion from ordinary attribute to reference attribute o degradation from reference attribute to ordinary attribute Elementary changes per reference: o promotion from ordinary attribute to reference attribute o degradation from reference attribute to ordinary attribute o change of intent o change of cardinality o relax to optional reference o restrict to compulsory reference

228

Exploring Conceptual Schema Evolution

B : CATALOGUE OF SEMANTIC CHANGE PATTERNS B.1

Append an entity

This semantic change pattern is perhaps the most familiar one. It clearly demonstrates schema extensibility that authors often allude to.

Customer

Customer Ex Partner has-a

Sum of Divided Benefits

Figure B.1. append an entity Need

To capture information on newly relevant objects as the UoD extends, or as the perception of the UoD extends, e.g., due to the integration of different conceptual views (perceptions) of the same business area. Semantics There are new user requirements. The requirements may concern totally new objects to be associated with existing entities, or the new requirements may refine the known information on existing objects. A new entity is defined as a coherent set of new attributes and added into the CS. Accommoda- The change is accomplished by two elementary changes: tion of change - insert an entity, and - insert a compulsory reference to an existing entity CS size increases, but lattice complexity remains the same. Propagation of No changes in existing data are generally needed. The update and delete change routines of the existing entity referred to must be adjusted to ensure referential integrity. Applicability The pattern is applicable for all straightforward extensions of information needs, without special complications or constraints. The new reference induces a specialization that may, or may not be modeled explicitly. A variant of this pattern appends a second entity to the first one, actually repeating the two elementary changes. Considerations Users often do not wait for the new construct to become available, and start storing the new information in dossiers, spreadsheets or other convenient

Appendix B : Catalogue of Semantic Change Patterns

229

means. Such temporary solutions ought to be remedied as soon as possible, to prevent different ways of storing the same data. Observations Append an entity is an easy change, and the pattern is sometimes applied regarding best- where another pattern should have been more appropriate. For instance, it is practices easier to append a specialization as a new entity, than to merge it into an existing one by applying the Extend-Entity-Intent pattern. Examples Settlement (A) SUM-OF-DIVIDED-BENEFITS is appended to CUSTOMER Benefit (A) EARLY-RETIREMENT-BENEFIT-LEVEL-1 is appended to PARTICIPATION Benefit (E) PARTICIPATION-TRAIL-OF-EX-SPOUSE is appended to PARTICIPATION; the PARTICIPATION-TRAIL-OF-EX-SPOUSEPREMIUM-/-REDUCTION entity is then appended from it Sites (C) an entity cluster modeling an outlet's OPERATIONAL-TYPE is appended to OUTLET-PERIOD Sites (O) an entity cluster modeling an outlet's SERVICE-ASSORTMENT is appended to OUTLET Franchise (D) OPENING-HOURS is appended to OUTLET Franchise (T) TRANSACTION-FORECAST is appended to OUTLET

induces a specialization within a previously implicit specialization

variant appending two new entities at once

230

B.2

Exploring Conceptual Schema Evolution

Superimpose an entity

This semantic change pattern is another demonstration of 'extensibility'. It resembles the previous one syntactically, but it has a larger impact of change and is more rare. The important difference in semantics is that the referencing attribute (foreign-key) goes into an existing entity that becomes dependent on the new one.

Franchiser run-by

Outlet Period

Outlet Period

Figure B.2. superimpose an entity Need

There is a change in perception of the information structure of the UoD. New objects are recognized that are essential to group and organize the existing objects. Semantics The new user requirement is that information on the newly recognized objects is recorded in such a way that the existing data populations are organized (grouped and sorted) by it. Accommoda- The change is accomplished by two or sometimes three elementary changes: tion of change - insert an entity, and - insert a reference from an existing entity to the new entity, either by inserting a new referencing attribute, or by changing an existing attribute to referencing attribute This is the usual case, when key integrity remains unchanged. A variant is where the new reference contributes to key integrity, seen in the example where CONTRACT is superimposed on PRODUCT-IN-CONTRACT. The size of the CS increases, but lattice complexity remains the same. Propagation of On the application level, there is little difference between superimposing and change appending an entity. Referential integrity must be ensured so the existing insert, update and delete routines must be adjusted. Impact on the data level is limited if the reference is optional. The main impact on data is for a compulsory reference. It demands that a (temporary) data population is created for the new entity, and all instances of

Appendix B : Catalogue of Semantic Change Patterns

231

the existing entity are updated to record the correct reference. The impact is even larger if the new referencing attribute is included in the primary key as this may cause a cascade of foreign-key changes in dependent entities. Applicability We saw the pattern mainly applied as a result of view integration. The UoD is extended to cover new objects that were previously beyond the scope, and that provide a new view on (hierarchic) organization of the existing ones. If the new reference contributes to key integrity, this may indicate that the intent of the existing entity is broadened but this cannot be determined from the CS alone. For example, if cities in a CS are identified not only by name but also by country, then different cities may now carry the same name. But it leaves uncertain if the semantic concept of city is broadened. Considerations Business reorganization and view integration may have a large impact on the overall CS, as a wider view of reality may call for new objects to be superimposed on existing ones. To limit the impact of change, the common business practice is to implement the new reference as optional, even though semantically it is compulsory. Observations Ascertain the most natural UoD demarcation in an early stage by analyzing regarding best- all objects that may act as determinant, classifier, or (hierarchic) organizer practices for other objects now, or at some later time. On the implementation level, prefer artificial identifiers over weak-entity keys if ever an entity might be superimposed that is crucial for entity identity Examples Sites (K) Benefit (K) Sites (I) Sites (R) Franchise (K) Franchise (2)

FRANCHISER is superimposed on OUTLET-PERIOD CONTRACT is superimposed on PRODUCT-IN-CONTRACT ADMINISTRATIVE-UNIT superimposed on OUTLET-PERIOD FORMULA and REGION are superimposed on OUTLET-PERIOD ZONE is superimposed on OUTLET MARGIN is superimposed on BOUNDS

variant with superimposed entity having a dependent entity of its own

variant with stack of two entities being superimposed

232

B.3

Exploring Conceptual Schema Evolution

Move attributes out into a new entity

In design, the creation of a new entity from a coherent set of attributes is a common normalization procedure. The same change in maintenance may be called 'belated normalization'. While this semantic change pattern resembles the previous ones in syntax, the difference is that there is no extension here, only restructuring.

Policy

Policy

Policy Attribute

Figure B.3. move attributes out into a new entity Need

Users experience that 'something is wrong' with updating the data in the CS. The updates on one set of attributes should be independent of updates to another set of attributes. Under the current model, users are forced to copy all attribute values even if only a few values actually change. Semantics The overall information content in the CS remains unchanged, but the layout of the CS needs to be changed. Apparently, functional dependencies are misrepresented in the CS and users experience the negative effects of update anomalies. A new entity is created that will hold the one set of attributes. The 'really important' attributes stay behind in the main entity. The entity definition (intent) is refined in the process. Accommoda- The change is accomplished by three elementary changes: tion of change - delete the set of attributes from the old entity, - produce a new entity from the attribute set, and define a key-integrity constraint for it (if necessary, add extra attributes to the composite), and - insert a reference between the old and new entity. The main pattern pushes attributes down, thus resembling the 'append' pattern as the newly formed entity is dependent on the main entity. A variant is where attributes are pushed up and the new entity is superimposed on the main entity. CS size increases, but lattice complexity remains the same. Propagation of The extend of the former entity is revised completely. Nevertheless, the change change can be propagated in a fully automatic way, as the data only needs to

Appendix B : Catalogue of Semantic Change Patterns

233

be moved about. It is extracted from the old entity, reformatted (this may require some manual intervention) and restored in the new entity. The change may result in a considerable decrease of stored data volume, confirming users in their feeling that indeed something was wrong. Applicability The aim of the change is that a less important set of attributes is separated out. The position of the main entity in the CS lattice is not affected, and references and key-integrity properties remain unchanged. Considerations The main entity ought to be renamed to reflect that its definition (intent) is adjusted. But more often than not, it is easier to leave names unchanged so that the same name covers several meanings over time. Observations The pattern matches the well-known design pattern of normalization, and its regarding best- main features are well understood. practices Examples Benefit (J) Sites (P) Franchise (J) Franchise (S)

POLICY-ATTRIBUTE is moved out of POLICY TOWN and TOWN-PERIOD are moved out of OUTLET-PERIOD FRANCHISER is moved out of FRANCHISE VAT-RATE is moved out of TARIFF

a temporal variant the variant where attributes move into a superimposed entity another new entity, inserted at the same time, refers to the new entity

234

Exploring Conceptual Schema Evolution

B.4

Connect by reference

This semantic change pattern exactly matches an elementary change in any taxonomy. Even so, the change is not easy as the impact of change on existing data needs to be accounted for.

Relationship

Relationship Settlement Separation has-a

Benefit

Benefit separated-

Figure B.4. connect by reference Need

Users perceive that existing objects in the UoD are somehow related. Hence, the users demand that information about those objects is coordinated at the level of the CS. The diagram shows an example where the need for change is driven by a new UoD feature (here, 'settlement') that relates existing realworld objects Semantics The information content in the CS about the UoD is marginally extended, only a single reference is added. In the variant, existing entities are first extended with new specializations and next, these are connected. Accommoda- The change on the CS level is accomplished by one elementary change: tion of change a single reference is added between two existing entities. Overall complexity of the CS lattice increases by 1. Propagation of If the reference is compulsory, then immediate data conversion is required change for all instances of the referring entity. This is acceptable for small numbers of instances, or if data coercion is fully automated. Otherwise, the impact is avoided by leaving the new reference optional, even if real-world semantics demand it to be compulsory. As referential integrity must be ensured, data transaction routines for both the referring and the referred entity must be adjusted. As lattice complexity increases by 1, a new connection constraint must also be enforced. Applicability Although the pattern is attractively simple, the impact of change may be large and the pattern is not often applied. Considerations The new constraint may have retroactive impact, as previously recorded data may violate the new connection constraint. This is undesirable and therefore

Appendix B : Catalogue of Semantic Change Patterns

235

generally avoided. A common solution is to relax the new reference to optional, leaving the existing data as-is, and to enforce the new constraint only when new data is manipulated. Observations If the impact of change of this pattern on the stored data and running appliregarding best- cations is thought to be too large, consider reducing it by using the Connectpractices by-Intersection pattern as an alternative. Examples Settlement (B) BENEFIT is extended with the SEPARATED-BENEFIT specialization that refers to particular instances of RELATIONSHIP Benefit (I) a new reference is inserted to record that a regular BENEFIT is involved in an EXCHANGE of pension Settlement (D) SETTLEMENT is connected to SUM-OF- semantically, the reference is manDIVIDED-BENEFITS datory, but it is implemented as optional Settlement (F) a reference from SEPARATED-BENEFIT the reference ought to apply for to its generalization BENEFIT is SEPARATED-BENEFITs only. But it is inserted, duplicating the connection implemented as optional reference established by DEDUCTED-BENEFIT on the entire generalization Franchise (X) the INSTALLMENT↑MONTHLY-INVOICE reference records when a particular installment was paid

236

B.5

Exploring Conceptual Schema Evolution

Impose a reflexive reference

This semantic change pattern resembles the connect-by-reference pattern, but it differs in semantics and applicability.

previous

Outlet

Outlet

Figure B.5. impose a reflexive reference Need

Users become aware that important associations are not captured in the CS between semantically similar real-world objects. Semantics A reflexive reference is called for to record that instances of an existing entity are somehow interrelated. Accommoda- The change matches a single change from the taxonomy: tion of change - create a reference from the entity to itself. In our cases, cardinality of the new reflexive reference was always N:1 and optional. Lattice complexity increases by 1. Propagation of As the cardinality constraints of the new reference are always maximally change relaxed, the impact of change is small, and data conversion is not needed. A connection constraint must be specified to prevent semantically meaningless associations such as self-reference or cyclic references. Applicability In our case studies, the pattern always involved a core entity of the CS. Reflexive associations on other entities may perhaps exist, but we suspect that these will be captured by ordinary (i.e. non-reflexive) references. Considerations We noticed two reasons for the insertion of reflexive references in our cases: - to trace the life-cycle of a real-world object even if its key-identity in the CS changes, and - to record associations between real-world objects not captured in the formal structure of the CS. Reflexive references may capture other kinds of semantics, e.g., hierarchy of departments, or symmetric (marital) relationships). However, designers are generally more aware of these kinds of relationships and they are captured

Appendix B : Catalogue of Semantic Change Patterns

237

before the CS goes operational. Hence, there is no need to insert them in the CS later. Observations Self-referring links between instances are often known and noted by users regarding best- even if the designers have not included the reflexive reference in their CS practices designs. Therefore, core entities should be examined for the existence of a reflexive relationship. Examples Franchise (3) Settlement (F)

Sites (M)

a reflexive reference is imposed on OUTLET to record its predecessor a reflexive reference on BENEFIT is imposed to capture the regular benefit instance that a separated benefit is deducted from a reflexive reference is imposed on OUTLET to record its predecessor

similar in appearance to the change Franchise (3), but meeting a different user need

238

B.6

Exploring Conceptual Schema Evolution

Connect by intersection

We borrow our terminology from the concept of 'intersection-record' in classical networked databases. Such records served a rather technical purpose of establishing a many-to-many link in the database. This pattern is named after it because of the striking similarity in appearance.

Participation Participation Trail

Participation Participation Trail

Exchanged E.R.benefit

Benefit obtained by E.R.Exchange Figure B.6. connect by intersection Need

A new way of manipulating objects that were already relevant in the UoD is discovered, calling for a way to coordinate the information about those objects and to record the way of manipulating. Semantics The new reference is specified by the users. The specifications must also cover for the new additional connection constraint in the CS. Accommoda- The change is achieved in one step, but consists of three elementary changes: tion of change - insert one new entity, and - insert two compulsory references to existing entities. The figure shows a variant of the pattern that uses two entities. It first appends an entity, then executes the simple pattern. The combined effect is that the existing entities are connected by references emanating from two different entities. CS size increases, while lattice complexity goes up by 1. Propagation of Existing entities must have the update and delete routines, but not their change insertion routines adjusted to ensure overall referential integrity. The new connection also calls for a connection constraint to be enforced. Applicability Our case studies show several examples where the connect-by-intersection entity records expresses a derivation process, implementing some business rule(s). The intersection entity serves to record intermediate and/or final results of the derivation. Considerations As the new connection calls for a connection constraint, it may be discovered that existing data violate the constraint. In other words, the new constraint may have retroactive impact on previously recorded data. This is undesirable

Appendix B : Catalogue of Semantic Change Patterns

239

and therefore avoided, generally by opting for deferred data conversion or by relaxing some cardinality constraint to be optional. Observations If the connect-by-intersection entity expresses derived data, then it is better regarding best- to avoid it and look to improve the CS by extending retention times of source practices data and derivation applications instead. Examples EXCHANGED-E.R.BENEFIT-LEVEL-2 connects with PARTICIPATION, while a dependent entity BENEFIT-OBTAINEDBY-EXCHANGE is inserted to connect with PARTICIPATION-TRAIL Benefit (E) PARTICIPATION-TRAIL-OF-EX-SPOUSE LEVEL-2 has a compulsory reference to PARTICIPATION, and an optional reference to EARLY-RETIREMENT-BENEFITLEVEL-2 Benefit (G) PARTICIPATION-TRAIL-FOR-BENEFIT connects PARTICIPATION-TRAIL with BENEFIT Settlement (C) DIVIDED-BENEFIT connects BENEFIT with SETTLEMENT and SUM-OFREDUCTIONS Franchise (G) the TRANSACTION entity connects with OUTLET and TYPE-OF-TRANSACTION via its ACCEPTED specialization Franchise (N) BOUNDS connects OUTLET with TARIFF Benefit (D)

the variant where the connection is made through two entities

the lack of historical data is resolved by relaxing cardinality to optional

the intersection entity has an additional reference to a third entity semantically, the connection is compulsory. For implementation reasons, it is modeled as optional

240

B.7

Exploring Conceptual Schema Evolution

Eliminate a dependent entity (cluster)

We find that entity eliminations are not always committed as soon as possible. Instead, eliminations are often postponed until maintenance is done for some other reason.

Type-ofoperations Operational Unit Period performs

Operational Unit Period type-of

Operational type Period Figure B.7. eliminate a dependent entity (cluster) Need

Some UoD concept, or rather a closely related group of concepts, becomes so much less important that it is no longer worthwhile to represent it in the CS. And the cost of continuous upkeep of the data exceeds cost to dispose of the cluster of entities, and adjust the data access and update routines. Semantics Entities become obsolete if the real-world concepts that are modeled, no longer really matter. The new user requirement is that the corresponding entity cluster in the CS is no longer needed. It does not mean that the concepts as such cease to exist in the UoD. It may be that users just perceive the information to be no longer relevant enough to merit recording. Or the information may be relevant, but available through other means, e.g. when the entity is found to record derived data only Accommoda- The main variant is where a single reference connects the obsolete entity or tion of change entities to the remainder of the CS lattice. Elimination proceeds in two steps: - drop the reference and its associated referential integrity constraint, and - remove the entire cluster of entities A variant is elimination as a reversal of the Connect-by-Intersection pattern. The semantic change for this variant is more complex: - drop both references that connect the two different entities, and - remove the connecting entity or entities Overall size of the CS reduces. In the main variant, lattice complexity is unaffected. In the variant, lattice complexity decreases by 1. Propagation of Updating the remaining entities is simplified, as the referential integrity change constraints induced by the eliminated entities are no longer relevant.

Appendix B : Catalogue of Semantic Change Patterns

241

Applicability

On the CS level, elimination is the reversal of the Append pattern. On the data level, the patterns are not each others opposite. Elimination of source data may cause derived data to acquire the 'survives' status. Indeed, we found that derived data eliminations are often delayed to permit the corresponding source data to be entered in the database. Prior to elimination, the maintenance engineer should exclude the possibility that the remaining data in the database are affected by some retroactive event acting upon the entity that is earmarked for elimination. Considerations There are numerous reasons why information in the CS may turn obsolete, but obsolescence alone does not constitute a need to change the CS. Although it seems evident that any obsolete entity is eliminated, to do so requires an effort with no immediate business advantages, and there is often no business sponsor willing to pay for the change effort ! Thus, the single fact of being obsolete does not constitute a sufficient need to change the CS. The more so as extinction is forever, and there is always a chance that the data somehow becomes relevant again, as in change (Q) of the Sites case. Observations For entities involved in only 1 or 2 references, check that the entity (cluster) regarding best- is relevant to enough users in the organization now and in the near future. practices For derived data (i.e. calculated from other data by derivation), establish under what conditions the entity may be eliminated in the future, and work to fulfill those conditions. Examples Sites (H)

Benefit (L)

Benefit (M) Franchise (Q)

drop entity cluster consisting of TYPEOF-OPERATIONS, OPERATIONAL-TYPE, (and TYPE-OF-OPERATIONS-PERIOD) PARTICIPATION-TRAIL- and its associated PARTICIPATION-TRAIL-FORBENEFIT- are eliminated PARTICIPATION-TRAIL-FOR-BENEFIT is eliminated TYPE-OF-ARTICLE is dropped

this is the reverse of the connect-byintersection pattern variant that employs two entities this variant is the reverse of the connect-by-intersection pattern

242

B.8

Exploring Conceptual Schema Evolution

Redirect a reference to owner entity

Although a reference may be redirected to any arbitrary entity, we only saw a pattern where it is redirected towards an owner. This probably has to do with implementation by weak keys: the redirection is achieved by just restricting the foreign key to that part which already identified the owner entity instance.

Tariff

Type of Transaction Transaction Summary

Tariff

Type of Transaction Transaction Summary Figure B.8. redirect a reference to owner entity

Need

A clear and unambiguous change driver from the UoD could not be established for this pattern. We believe that it is related to a change of perception, where engineers and users feel that the current way of modeling a reference can be simplified. Semantics The change in semantics intends to improve the CS information structure without loss of semantics. The current way of modeling certain data, i.e., as depending on a certain entity, is perceived to be wrong. Instead, the 'real' semantics of that data is that it depends on the owner of that entity, and therefore, the CS becomes simpler and better if the reference is redirected to the owner entity. Accommoda- Accommodating the change in the CS is simple: tion of change - redirect the reference from an entity to an owner entity CS lattice complexity is not affected. Propagation of Generally, there is loss of information. The new reference was derivable in change the previous CS version as a transitive reference. But the old reference cannot generally be reconstructed from the new one in general. Applicability In most cases, redirection of references is undertaken only if the reference was implemented using weak-entity keys. Redirection is accomplished by restricting the foreign key. Only the part that identified the owner entity instance is retained, the remainder is simply ignored.

Appendix B : Catalogue of Semantic Change Patterns

243

Considerations Key integrity of the referring entity must remain unaffected. But if the entity primary key incorporated the former foreign key, then the redirection pattern is frustrated as the foreign key restriction would also affect the primary key. Observations The pattern appears to be applied in preparation for a further simplification regarding best- in a later CS version, when the former owner entity is eliminate practices Examples Franchiser (R)

Benefit (O)

Benefit (P)

Franchiser (F)

the reference from TRANSACTIONSUMMARY to TYPE-OF-TRANSACTION is redirected to TARIFF the reference from BENEFIT-OBTAINEDBY-EXCHANGE to PARTICIPATION-TRAIL is redirected to PARTICIPATION the reference from TRAIL-PREMIUM/ variant of redirection of a reflexive REDUCTION to PARTICIPATION-TRAIL is reference. The owner entity is a redirected to SUCCESSOR specialization in the member entity the reference from FIXED-AMOUNT to FRANCHISE is redirected to OUTLET

244

B.9

Exploring Conceptual Schema Evolution

Extend entity intent

This pattern is a clear illustration of extensibility. The semantic change is not very evident in the CS diagram. Rather the change is in how a broadened set of real-world objects is captured by the extended entity.

Exchanged E.R.Benefit

Exchange Exchanged E.R.Benefit

Figure B.9. extend entity intent Need

When the information system is used for new purposes, i.e. when its UoD extends to cover more objects that are similar to the ones previously captured Semantics Definitions can be extended implicitly, and the result will be that more instances are captured. Nothing else is forced to change as well. Accommoda- The overall CS diagram is not affected, because entity intents are shown in tion of change the CS diagrams. References (i.e. foreign keys) must be extended to match the extension of the referred entity, and in doing so, may be relaxed from compulsory to optional. Key constraints are generally not affected. Propagation of The change of entity intent needs only be propagated to the user applications change that capture, manipulate and delete instances of the entity. And the change may show up in documentation. As all existing instances comply with the extended definition, there is no inherent need of data conversion. Although conceptual key-integrity constraint is presumed to be unaffected, it may be discovered that the existing primary key is inadequate to identify instances after extension. It may be unavoidable to change the implemented primary key and adjust recorded data accordingly, which can be an extensive and time-consuming operation. Applicability Extension of entity intent is often done implicitly. But often the domains (i.e. constraints on conceptual attributes values) are extended as well. Conceptual attributes of the new instances may take values outside the previous domains. Considerations Instances that conform to the extended scope of the entity but not to the old scope, can and often will be viewed as a new specialization. However, it

Appendix B : Catalogue of Semantic Change Patterns

245

does not follow that the extension must be materialized as such in the CS. Observations The change of intent turns the entity name into a homonym: the same name regarding best- covers different concepts over time. Renaming the entity may be considered practices to avoid misunderstandings. Examples Benefit (I)

Benefit (A)

EXCHANGED-E.R.BENEFIT is extended to EXCHANGE

the extended entity is renamed, the old entity name and intent persist as a specialization

PARTICIPATION is extended to cover for various kinds of E.R.BENEFIT as well Settlement (A) RELATIONSHIP is extended to record terminated relationships, insofar these are subjected to a SETTLEMENT or SEPARATION of pensions Franchise (P) TYPE-OF-TRANSACTION is refined so the entity name and its definition that many more types are discerned (semantics) remain unchanged than the few dozen accounted for under the old CS Franchise (Z) TYPE-OF-TRANSACTION is extended with the specialization SPECIAL

246

Exploring Conceptual Schema Evolution

B.10 Restrict entity intent This pattern is the opposite of the previous pattern that extends entity intent.

Operational Unit has-a

Operational Unit Period

Outlet has-a

Outlet Period

Figure B.10. restrict entity intent Need

A real-world concept that is modeled by the entity is redefined either implicitly or explicitly. But it drives CS change only if users are aware of the limitation in scope and intent, and demand the CS to validly model the restricted real-world concept Semantics Intent restriction results in fewer instances being captured, amounting to a certain specialization being no longer recorded by the entity. Restriction need not affect the attributes although in most cases, one or more attributes, often the optional ones, are dropped. Accommoda- The overall CS diagram is not affected, because the exact entity intents are tion of change not depicted in the diagrams. Lattice complexity, key-, referential- or connection constraints generally are not impacted either. Sometimes an entity name is adjusted to fit the new definition. Propagation of Existing instances are dropped, or even an entire specialization that does not change comply with the new definition. And the change of entity intent shows up in user applications that capture, manipulate and delete less entity instances. Applicability Key identity is unaffected, and the existing primary key will suffice to identify instances after restriction. Considerations Restriction of entity intent is often done implicitly and is not recorded in conceptual documentation. Later, people using the outdated documentation are surprised to learn that certain domain values are not used any more. Observations Entity definitions are an important part of conceptual documentation and regarding best- should be updated when restricting an entity. practices

Appendix B : Catalogue of Semantic Change Patterns

247

Examples Sites (G) Sites (C) Sites (F)

Benefit (P)

OPERATIONAL-UNIT is restricted to OUTLET CLASS is restricted to record class types only for OPERATIONAL-UNIT DEPARTMENTS still exist, but they are no longer recorded in the CS as instances of MANAGERIAL-UNIT TRAIL-PREMIUM/REDUCTION records only instances referring to an instance of PARTICION-TRAIL-AS-OF-1-1-'96

the associated co-located-with reference is also eliminated from the CS the restriction is the consequence of the reference being redirected

248

Exploring Conceptual Schema Evolution

B.11 Promote a specialization This semantic change pattern resembles the 'horizontal fragmentation' or 'table divide' on the implementation level. It takes a generalized entity, divides it in two (or perhaps more) independent entities, and relocates data instances to exactly one of the new entities. The former generalization does not become invalid, it just becomes less important in the CS.

managedby

Unit has-a

Unit period Operational unit period

managedby

Managerial Unit has-a

Mgr. unit period managed-by

Operational unit has-a

Operational unit period Figure B.11. promote a specialization

Need

Semantics

Accommodation of change

Although the existing CS captures the perceived information structure of the UoD, users have a need for change because the generalized way of modeling does not match their real-world view. A particular concept is perceived to be so much more important and different from others that it must be made explicit in the CS. Users demand easy access to its data, without being cumbered by the irrelevant information of other instances. Moreover, specific data attributes and/or references are relevant only for the specialization. Users perceive the CS after the change as a better model because it is more explicit and easier to understand. The concept that was previously modeled as a specialization is now recognized as an entity in its own right. The same number of real-world objects is represented in the CS, and overall semantics of entities, specializations and references is unchanged. Almost the same UoD information structure is captured in a slightly different CS. A series of elementary changes accommodate the user needs: - define a new entity intent that is based on, but semantically different from the generalization, and aggregate into it all attributes from the the generalization that pertained to; - move attributes pertaining to the new specialization. Usually, very few attributes pertain to both the promoted specialization and the restricted generalization, and these will be duplicated;

Appendix B : Catalogue of Semantic Change Patterns

249

-

restructure the references of the former generalization. They may: - pertain only to the promoted entity, so they should be moved there, or - pertain only to the remainder of the generalization only, and can remain unchanged, or - pertain to both entities, and two new references are needed - a reference between the new specialization and its former generalization may be needed to substitute for the former is-a reference. The overall size of the CS increases, while lattice complexity increases only if some references must be duplicated for reasons of consistency. Propagation of All data instances are relocated either to the promoted specialization, or to change the restricted generalization. As the number of real-world objects is unchanged, the total number of instances remains the same. Only in exceptional cases will one instance of the former generalization devolve into instances of both entities. The new entity receives a new primary key. The primary key, inherited from the generalization, may be retained for compatibility reasons. The software routines that accessed the former entity have to be recompiled, summary reports on the generalization redefined, screen layouts adjusted etc. Applicability The important advantage of promoting is in the easy access to the instances of the former specialization. The generalization may completely disappear from the CS if this pattern is applied repeatedly. Considerations Before the change, an instance could be a member of the specialization at one time, and not at another. However, if specialization membership is permanent, then generalization holds no great advantage for users. Observations If one particular and permanent specialization dominates all the others, then regarding best- it is probably better not to create a single generalization. practices Examples the specialization OPERATIONAL-UNIT within UNIT promotes to full entity. Notice how the managed-by reference is duplicated Benefit (Q) a subset of the specialization SUCCESSOR within PARTICIPATIONTRAIL promotes to full entity Settlement (E) the specialization SEPARATED-BENEFIT within BENEFIT promotes to a full entity called BENEFIT-DEDUCTION Sites (J) AREA and ZONE with an hierarchic managed-by reference replace the generalization MANAGERIAL-UNIT with a reflexive managed-by reference Franchise (A) ABSENCE promotes to full entity Sites (B)

the former generalization is renamed to MANAGERIAL-UNIT

the promoted entity was renamed for clarity a complicated variant, as the former specialization SEPARATED-BENEFIT is retained

the former generalization is eliminated at the same time; the combined effect equals entity restriction

250

Exploring Conceptual Schema Evolution

B.12 Relax a reference This semantic change pattern is not very frequent. The few examples that we observed in our cases are all associated with minor CS design flaws that emerged in operation.

Franchise

Franchise Contract

Franchise

Franchise Contract

Figure B.12. relax a reference Need

Instances for a member entity need to be recorded that do not adhere to the current references recorded in the current CS. Apparently, relationships of real-world objects are captured inadequately by the CS Semantics Instances for the member entity need to be recorded that violate the current restrictions imposed by a particular reference. Accommoda- The change is accomplished by in a single elementary change. Either a 1:1 tion of change reference is relaxed to N:1 cardinality, or an N:1 reference is relaxed to optional. Size and lattice complexity of the CS remain unaffected. Propagation of Relaxing a reference cardinality does not affect the stored data, although it change does affect the software applications. All data access routines must be adjusted to be capable of handling multiple referenced instances instead of only the single one recorded under the 1:1 reference. If the reference is relaxed to optional, the data access routines must be capable of handling 'none' values for the reference Applicability Relaxing a reference to optional creates a key integrity problem if the dependent entity used a weak-entity key. A work-around is to insert a dummy instance in the owner entity. This however may cause unexpected side effects in other applications that were not designed to cope with dummy instances Considerations Observations Question the long-term validity of the cardinality of every reference. regarding best- If an 1:1 reference might ever relax to N:1, prepare all data access routines practices for N:1 cardinality and use a software routine to implement the business rule. If a N:1 reference might ever relax to optional, do not use a weak-entity key.

Appendix B : Catalogue of Semantic Change Patterns Examples Franchise (U)

Benefit (H) Franchise (4)

the FRANCHISE-CONTRACT↑FRANCHISE reference is relaxed from compulsory to optional the 1:1 cardinality of BENEFIT↑PARTICIPATION reference is relaxed to N:1 the INSTALLMENT↑CLAIM reference is relaxed to be able to record FINALIZED installments that are paid without being claimed

251

252

Exploring Conceptual Schema Evolution

References

253

REFERENCES Abiteboul S., Hull R. IFO: a formal Semantic Database Model ACM transactions on Database Systems vol 12 no 4 [1987 12] 525-565 Ades Y. Semantic Normal Form : Compliance [1998 10 12] Adriaans W. Winning support for your Information Strategy Long Range Planning vol 26 no 1 [1993] 45-53 Agarwal S., Keller A.M., Wiederhold G., Saraswat K. Flexible relation: an approach for integrating data from multiple, possibly inconsistent databases Proc.ICDE'95 11th Int. Conf. Data Engineering, IEEE [1995] 495-504 Ahrens J.D., Prywes N.S. Transition to a Legacy- and Re-use based Software Life Cycle IEEE Computer [1995 10] 27-36 Aiken P., Muntz A., Richards R. Reverse Engineering Data Requirements: DOD Department Of Defense legacy systems Communications of the ACM vol 37 no 5 [1994 05] 26-41 Aiken P., Girling B. Reverse Data Engineering into a distributed environment: metadata analysis Information Systems Management [1998summer] 47-55 Aiken P., Yoon Y., Leong-Hong B. Requirements-driven data engineering: case study Information & Management vol 35 no 3 [1999 03] 155-170 Akoka J., Comyn-Wattiau I. Entity-Relationship and O-O model automatic clustering Data & Knowledge Engineering vol 20 no 2 [1996] 87-117 Albert J. Theoretical Foundations of Schema Restructuring in Heterogeneous Multidatabase Systems CIKM'00 9th Int. Conf. on Information & Knowledge Management, ACM Press [2000 11] 461-470 Allen B.R., Boynton A.C. Information Architecture: in search of efficient flexibility MIS Quarterly [1991 12] 435-444 Alstyne M. van, Brynjolfsson E., Madnick S. Why not one big database ? Principles for data ownership Decision Support Systems vol 15 [1995] 267-284 Al-Jihar L., Léonard M. Transposed Storage of an Object Database to Reduce the Cost of Schema Changes ECDM'99 Advances in Conceptual Modeling LNCS 1727, Springer Verlag [1999 11] 86-97 Amikam A. On the automatic generation of internal schemata Information Systems vol 10 no 1 [1985] 37-45 Andany J., Léonard M., Palisser C. Management of Schema Evolution in Databases VLDB'91 Very Large Data Bases, Morgan Kaufmann, San Fransisco [1991 09] 161170 Andersson M. Extracting an Entity Relationship Schema from a Relational Database through Reverse Engineering ER'94 Entity-Relationship Approach 1994 LNCS 881, Springer Verlag [1994 12] 403-419 ANSI/X3/sparc, Special Interest Group on Management of Data Study Group on Data Base Management Systems: Interim Report ACM-SIGMOD Newsletter vol 7 no 2 [1975 02 08] ANSI/X3/sparc, Special Interest Group on Management of Data Study Group on Data Base Management Systems: Framework Report on Data Base Management Systems AFIPS conference proceedings, AFIPS Press, Montvale New Jersey [1978] Ariav G. Temporally oriented data definitions: managing schema evolution in temporally oriented databases Data & Knowledge Engineering vol 6 no 6 [1991] 451-467

254

Exploring Conceptual Schema Evolution

Assenova P., Johannesson P. Improving Quality in Conceptual Modelling by the use of Schema Transformation ER'96 Entity-Relationship Approach 1996 LNCS 1157, Springer Verlag [1996] 277-291 Atkins C. Prescription or Description: Some observations on the Conceptual Modelling Process ICSE'96 18th Int. Conf. on Software Engineering, IEEE [1996] 34-41 Atkinson M., De Witt D., Maier D., Bancilhon F., Dittrich K.R. Object-Oriented Database System Manifesto DOOD'90 1st Int. Conf. Deductive & OO Databases [1990] Atzeni P., Ceri S., Paraboschi S., Torlone R. Database systems: concepts, languages and architectures [1999] Bakker J.A. Semantic approach to enforce correctness of data distribution schemes Technical Report TU Delft 93-63 [1993] Banerjee J., Kim W., Kim H.-J., Korth H.F. Semantics and Implementation of Schema Evolution in O-O databases Communications of the ACM vol 30 no 3 [1987 05] 311-323 Banker R.D., Slaughter S. Field Study of Scale Economies in Software Maintenance Management Science vol 43 no 12 [1997 12] 1709-1725 Banker R.D., Davis G.B., Slaughter S. Software development practices, Software complexity, and software maintenance performance: a Field Study Management Science vol 44 no 4 [1998] 433-450 Barua A., Ravindran S. Reengineering information sharing behaviour in organizations J. of Information Technology vol 11 [1996] 261-272 Basili V.R., Briand L.C., Melo W.L. A Validation of Object-Oriented Metrics as Quality Indicators IEEE Transactions on Software Engineering vol 22 no 10 [1996 10] 751-761 Batini C.W., Lenzerini M., Navathe S. A comparative analysis of methodologies for database schema integration ACM computing surveys vol 18 no 4 [1986 12] 323-364 Batini C.W., Di Battista G. A Methodology for Conceptual Documentation and Maintenance Information Systems vol 13 no 3 [1988] 297-318 Batini C.W., Ceri S., Navathe S.B. Conceptual Database Design: An Entity-Relationship Approach Benjamin/Cummings Publishing Company CA [1992] Batra D., Hoffer J.A., Bostrom R.P. Comparing representations with Relational and EER models Communications of the ACM vol 33 no 2 [1990 02] 126-139 Batra D., Davis J.G. Conceptual data modelling in database design: similarities and differences between expert and novice designers Int. J. Man-Machine Studies vol 37 [1992] 83-101 Batra D. Framework for studying human error behaviour in conceptual database modeling Information & Management vol 25 [1993] 121-131 Batra D., Zanakis S.H. A Conceptual Database Design approach based on rules and heuristics European J. of Information Systems (O.R.) vol 3 no 3 [1994] 228-239 Becker J., Rosemann M., Schuette R. Grundsaetze ordnungsmaessiger Modellierung Wirtschaftsinformatik vol 37 no 5 [1995] 435-445 (in german) Belady L.A., Lehman M.M. A model of large program development IBM Systems Journal vol 15 no 3 [1976] 225-252 Bernstein P.A. Panel: Is Generic Metadata Management Feasible VLDB'00 Very Large Data Bases Cairo Egypt, Morgan Kaufman [2000 09] 660-662 Bettini C., Dyreson C.E., Evans W.S., Snodgrass R.T., Sean Wang X. A Glossary of Time Concepts Temporal Databases: Research & Practices LNCS 1399, Springer Verlag [1998 02] 406-413

References

255

Bézevin J. New Trends in Model Engineering Proc. IRMA'00 Int. Conf., Idea Group Publishing, Hershey Pennsylvania [2000 05] 1185-1187 Bird L., Goodchild A., Halpin T. Object Role Modelling and XML-Schema ER'00 EntityRelationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 309-322 Bisbal J., Lawless D., Wu B., Grimson J. Legacy Information Systems: issues and directions IEEE Software vol 16 no 5 [1999] 103-111 Blaha M., Premerlani W. Object-Oriented Modeling and Design for Database applications Prentice Hall, Upper Saddle River NJ [1998] Blum B. Taxonomy of Software Development Methods Communications of the ACM vol 37 no 11 [1994 11] 82-94 Bommel P. van Database Optimization: An Evolutionary Approach thesis, Catholic University of Nijmegen, the Netherlands [1995 10] Bonjour M., Falquet G. Concept bases: a support to information systems integration CAiSE'94 Advanced Information Systems Engineering LNCS 811, G.Wijers, S.Brinkkemper, T.Wasserman (eds) [1994 06] 242-255 Boogaard M. Defusing the Software Crisis: Information Systems Flexibility through Data Independence thesis, Tinbergen Research Institute&Thesis Publishers, the Netherlands [1994 06] Borgida A. Knowledge representation, Semantic Modeling: Similarities and differences ER'91 Entity-Relationship Approach 1991, Elsevier Science Publ. North-Holland [1991] 1-24 Brancheau J.C., Schuster L., March S.T. Building and implementing an Information Architecture Database [1989summer] 9-17 Brèche P. Views for Information System Design without Reorganization CAiSE'96 Advanced Information Systems Engineering LNCS 1080, Springer Verlag [1996] 496-513 Briand L.C., Bunse C., Daly J.W. A Controlled Experiment for Evaluating Quality Guidelines on the Maintainability of Object-Oriented Designs Technical Report Fraunhofer FhG IESE ISERN-99-07.pdf, Fraunhofer Institute IESE, Kaiserslautern (Dld) [1999] 43 Broek A. van de, Walstra M., Westein H. Gegevensconversie: sluitpost of basis van succes ? Informatie Management [1994 10] 11-17 (in dutch) Buelow R. The folklore of Normalization Journal of Database Management vol 11 no 3 [2000summer] 37-41 Burd E., Bradley S., Davey J. Studying the Process of Software Change: an analysis of software change Proc.WCRE'00 7th Work.Conf.on Reverse Engineering, IEEE [2000] 232-239 Calero C., Piattini M., Genero M. Applying Software Metrics to Databases Proc.12th Int. Conf. Intersymp2000, Int. Institute for Advanced Studies Systems Research & Cybernetics [2000 08] 93-100 Campbell L.J., Halpin T.A., Proper H.A. Conceptual schemas with abstractions: making flat conceptual schemas more comprehensible Data & Knowledge Engineering vol 20 [1996] 39-85 Cartwright M., Shepperd M. An empirical investigation of an object-oriented software system IEEE Transactions on Software Engineering vol 26 no 8 [2000] 786-796

256

Exploring Conceptual Schema Evolution

Caruso F., Cochinwala M., Ganapathy U., Lalk G., Missier P. Telcordia's Database Reconciliation and Data Quality Analysis Tool VLDB'00 Very Large Data Bases Cairo Egypt, Morgan Kaufman [2000 09] 615-618 Casanova M.A., Tucherman L., Laender A.H.F. On design and maintenance of optimized relational representations of Entity-Relationship schemas Data & Knowledge Engineering vol 11 [1993] 1-20 Castano S., de Antonellis V., Zonta B. Classifying and Reusing Conceptual Schemas ER'92 Entity-Relationship Approach 1992 LNCS 645, Springer Verlag [1992 10] 121-138 Castano S., de Antonellis V., Fugini M.G., Pernici B. Conceptual Schema Analysis: techniques and applications ACM transactions on Database Systems vol 23 no 3 [1998 09] 286-333 Cellary W., Jomier G. Consistency of Versions in Object-Oriented Databases in: Building an O-O Database System: the story of O2 Morgan Kaufmann Publishing, San Mateo CA [1992] 447-462 Chan Y.E., Huff S.L., Copeland D.G. Assessing realized information systems strategy J. of Strategic Information Systems vol 6 no 4 [1997 12] 273-298 Chen P.P.-S., Li M.-R. The lattice structure of entity sets ER'86 Entity-Relationship Approach 1986, Elsevier Science Publ. North-Holland [1986] 217-229 Cheung W., Hsu C. Model-assisted Global Query System for Multiple databases in distributed environments ACM transactions on Information Systems vol 14 no 4 [1996 10] 421-470 Chiang R.H.L., Barron T.M., Storey V.C. Reverse Engineering of relational databases: extraction of an EER model from a relational database Data & Knowledge Engineering vol 12 no 2 [1994] 107-142 Chiang R.H.L., Barron T.M., Storey V.C. Framework for the design and evaluation of Reverse Engineering methods for relational databases Data & Knowledge Engineering vol 21 [1997] 57-77 Chiang R.H.L., Lim, E.-P., Storey V.C. A Framework for acquiring domain semantics and knowledge for database integration Database for Advances in Information Systems vol 31 no 2 [2000spring] 46-64 Chidamber S.R., Kemerer C.F. A metrics suite for Object-Oriented Design IEEE Transactions on Software Engineering vol 20 no 6 [1994 06] 476-493 Chikofsky E.J., Cross II J.H. Reverse Engineering and design recovery: a taxonomy IEEE Software [1990 01] 13-17 Chudziak J.A., Rybinsky H., Vorbach J. Towards a Unifying Logic Formalism for Semantic Data Models ER'93 Entity-Relationship Approach 1993 LNCS 823, Springer Verlag [1993 12] 492-507 Claypool K.T., Jin J., Rundensteiner E.A. SERF: Schema evolution through an Extensible, Reusable and Flexible Framework CIKM'98 7th Int. Conf. on Information & Knowledge Management [1998 11] 314-321 Claypool K.T., Rundensteiner E.A., Heinemann G.T. ROVER: A framework for the evolution of relationships ER'00 Entity-Relationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 409-422 Combi C., Montanari A. Data Models with Multiple Temporal Dimensions completing the picture CAiSE'01 Advanced Information Systems Engineering LNCS 2068, Springer Verlag [2001] 187-202

References

257

Coplien J.O. Idioms and Patterns as Architectural Literature IEEE Software vol 14 no 1 [1997 01] 36-42 Crowe T.J. Integration is not synonymous with flexibility Int. J. of Operations & Production Management vol 12 no 10 [1992] 26-33 Da Silva A.S., Laender A.H.F., Casanova M.A. On Relational representation of Complex Specialization Structures Information Systems vol 25 no 6 [2000] 399-415 Darwin C. the Origin of Species. Selections, abridged and introduced by Richard E.Leakey Faber & Faber Ltd, London [1859] Date C.J. An introduction to Database Systems, 7th edition, Addison-Wesley Longman Inc. [2000] Davis K.H., Arora A.K. Converting a relational database model into an ER-model ER'87 Entity-Relationship Approach 1987, Elsevier Science Publ. North-Holland [1988] 271-285 Davis K.H. Need for 'flexibility' in a Conceptual Model Information & Management vol 18 [1990] 231-241 Davis K.H. Combining a Flexible Data Model and Phase Schema Translation in Data Model Reverse engineering Proc.WCRE'96 3rd Working Conf.on Reverse Engineering [1996] 141-151 de Brock E.O. Foundations of Semantic Databases Prentice Hall NY, London [1995] de Brock E.O. A generic treatment of Dynamic Integrity Constraints Data & Knowledge Engineering vol 32 no 3 [2000] 223-246 De Castro C., Grandi F., Scalas M.R. Schema versioning for multitemporal relational databases Information Systems vol 22 no 5 [1997] 249-290 De Troyer O. On Data Schema transformations thesis, Catholic University of Brabant, the Netherlands [1993 03 26] Delcambre L., Langston J. Reusing (shrink wrap) schemas by modifying Concept Schemas Proc.ICDE'96 12th Int. Conf. Data Engineering, IEEE [1996] 326-333 Delen G.P.A.J., Looijen M. Beheer van Informatievoorziening Cap Gemini Publishing [1992] (in dutch) DeLone W.H., McLean E.R. Information System Success: the Quest for the Dependent Variable Information Systems Research vol 3 no 1 [1992 03] 60-95 Deursen A. van, Kuipers T. Identifying Objects using Cluster and Concept Analysis ICSE'99 Int. Conf. Software Engineering [1999] Di Battista G., Kangassalo H., Tamassia R. Definition Libraries for Conceptual Modelling ER'89 Entity-Relationship Approach 1989, Elsevier Science Publ. North-Holland [1989 10] 251-267 Dittrich K.R., Gotthard W., Lockemann P.C. Complex entities for Engineering Applications ER'86 Entity-Relationship Approach 1986, Elsevier Science Publ. North-Holland [1986] 421-440 Dvorak J. Conceptual Entropy and its Effect on Class Hierarchies Computer [1994] 59-63 Earl M.J. Experiences in Strategic Management Systems Planning MIS Quarterly [1993 03] 1-24 Ebels E.J., Stegwee R.A. A Multiple Methodology Approach towards Information Architecture Specification Proc. IRMA'92 Int. Conf., Idea Group Publishing, Hershey Pennsylvania [1992 05] 186-193

258

Exploring Conceptual Schema Evolution

Effelsberg W., Mannino M. Attribute equivalence in global schema design for heterogeneous distributed databases Information Systems vol 9 no 3/4, Pergamon [1984] 237-240 Ehrensberger M. Data dictionary: more on the impossible dream AFIPS conference proceedings [1977 06] 9-11 Ekering C., Buitenhuis P., Croon F. Pakketten implementeren met referentiemodel: Modelleren van bedrijfsprocessen Tijdschrift voor Inkoop en Logistiek vol 14 no 4 [1998 04] 60-63 (in dutch) El Emam K., Höltje D., Madhavji N.H. Causal Analysis of Requirements Change Process for a Large System Technical Report Fraunhofer FhG IESE ISERN-97-23.pdf, Fraunhofer Institute IESE, Kaiserslautern (Dld) [1997] 9 Elmasri R., Navathe S.B. Fundamentals of Database Systems: third edition Addison-Wesley Longman Inc. [2000] Englebert V., Hainaut J.-H. DBMAIN : a next-generation meta-CASE Information Systems vol 24 no 2 [1999 04] 99-112 Etzkorn L.H., Davis C.G., Li W. A Practical look at the lack of cohesion in methods metric JOOP J. of OO Programming vol 11 no 5 [1998 09] 27-34 Ewald C.A., Orlowska M.E. A Procedural Approach to Schema Evolution CAiSE'93 Advanced Information Systems Engineering LNCS 685, Springer Verlag [1993 06] 22-38 Ewald C.A. Foundations of Conceptual Schema Evolution thesis, University of Queensland, Australia [1995] Feldman P., Miller D. Entity Model clustering: structuring a data model by abstraction The Computer Journal vol 29 no 4 [1986] 348-360 Fernandez E.B., Yuan X. Semantic Analysis Patterns ER'00 Entity-Relationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 183-195 Ferrandina F., Meyer T., Zicari R., Ferran G., Madec J. Schema and database evolution in the O2 object database system VLDB'95 Very Large Data Bases [1995 09] 170-181 Feurer R., Chaharbaghi K., Weber M., Wargin J. Aligning strategies, processes, and IT: a case study Information Systems Management vol 17 no 1 [2000 01] 23-34 Filteau M.C., Kassicieh S.K., Tripp R.S. Evolutionary Database Design and development Information & Management vol 15 [1988] 203-212 Fitzgerald G. Achieving Flexible Information Systems: the case for improved analysis J. of Information Technology vol 5 [1990] 5-11 Fong J. Converting Relational to Object-Oriented databases SIGMOD record vol 26 no 1 [1997 03] 53-58 Fonkam M.M., Gray W.A. An approach to eliciting the semantics of Relational Databases CAiSE'92 Advanced Information Systems Engineering, Springer Verlag [1992 05] 463-480 Foppema B. Informix Schema Evolution Tool: user's guide Technical Report TU Delft [1996 09] 1-39 Fox C.J., Frakes W.B. Sixteen questions about software reuse Communications of the ACM vol 38 no 6 [1995 06] 75-112 Franconi E., Mandreoli F. Schema Evolution and Versioning: a logical and computational characterization Proc.'00 Int. Conf. Database Schema Evolution & MetaModeling LNCS 2065, Springer Verlag [2000 09] 85-99 Furukawa M. Conceptual model for MIS: Flexibility Evaluation Proc. IRMA'01 Int. Conf., Idea Group Publishing, Hershey Pennsylvania [2001 05 20] 831-839

References

259

Gal A., Etzion O. Handling Change Management using Temporal Active Repositories ER'95 Entity-Relationship Approach 1995 LNCS 1021, Springer Verlag [1995 12] 378-387 Galliers R.D. Towards a Flexible Information Architecture: Integrating Business Strategies J. of Information Systems vol 3 no 3 [1993] 199-213 Gamma E., Helm R., Johnson R., Vlissides J. Design Patterns: Elements of Reusable ObjectOriented Software Addison-Wesley [1995] Gangopadhyay D., Barsalou T. On the Semantic Equivalence of Heterogeneous Representations in Multimodel Multidatabase Systems SIGMOD record vol 20 no 4 [1991 12] 35-39 Genero M., Jimenez L., Piattini M. Measuring the quality of Entity Relationship Diagrams ER'00 Entity-Relationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 513-526 Genero M., Olivas J., Piattini M., Romero F. Using metrics to predict OO information systems maintainability CAiSE'01 Advanced Information Systems Engineering LNCS 2068, Springer Verlag [2001] 388-401 Genero M., Jimenez L., Piattini M. A Prediction Model for OO Information System Quality based on Early Indicators ADBIS'01 Advances in Databases & Information Systems [2001 10] 211-224 Gerard P. Data modeling: a different point of view Datamation vol 56 [1994 04] 7-9 Giacommazzi F., Panella C., Percini B., Sansoni M. A normative model for the managing of Information systems integration in mergers and acquisitions Information & Management vol 32 [1997] 289-302 Gill G.K., Kemerer C.F. Cyclomatic complexity density and software maintenance productivity IEEE Transactions on Software Engineering vol 17 no 12 [1991] 1284-1288 Gomaa H. Reusable software requirements and architectures for families of systems J. Systems Software vol 28 [1995] 189-202 Goodhue D.L., Kirch L.J., Quillard J.A., Wybo M.D. Strategic Data planning: Lessons from the field MIS Quarterly [1992] 11-34 Goodhue D.L., Wybo M.D., Kirch L.J. The Impact of Data integration on the cost and benefit of information systems MIS Quarterly [1992 09] 293-311 Goralwalla I.A., Szafron D., Tamer Özsu M., Peters R.J. A temporal approach to managing schema evolution in object database systems Data & Knowledge Engineering vol 28 no 1 [1998] 73-105 Gray W.A., Wikramanayake G.N., Fiddian N.J. Assisting Legacy Database Migration IEEE Proc.Colloquium Legacy Information systems [1994 05 01] Gruber T.R. A translation approach to portable ontologies Knowledge Acquisition vol 5 no 2 [1993] 199-220 Gruhn V., Pahl C., Wever M. Data Model Evolution as a basis of Business Process Management ER'95 Entity-Relationship Approach 1995 LNCS 1021, Springer Verlag [1995 12] 270-281 Hainaut J.-L., Englebert V., Henrard J., Hick J.-M., Roland D. Program understanding in Databases Reverse Engineering DEXA'98 Database and Expert Systems Applications LNCS 1460, Springer Verlag [1998 08] 70-79 Halassy B. Normal Forms and Normalization: Practical Designer's View Information & Software Technology vol 33 no 6 [1991 07/08] 451-461 Halpin T. Fact-oriented approach to schema transformation [1991] 342-356

260

Exploring Conceptual Schema Evolution

Hammer M., McLeod D. Database description with SDM: a semantic database model ACM transactions on Database Systems vol 6 no 3 [1981 09] 351-386 Han T.-D., Purao S., Storey V.C. A methodology for building a repository of O-O Design Fragments ER'99 Entity-Relationship Approach 1999 LNCS 1728, Springer Verlag [1999 11] 203-218 Hars A. Natural Language-enabled Data Modeling: Improving Validation and Integration Journal of Database Management vol 9 no 2 [1998spring] 17-25 Hartmann S. On interactions of Cardinality Constraints, Key, and Functional Dependencies Proc. FoIKS 1st Int.Symp. on Foundations of Information & Knowledge Systems LNCS 1762, Springer Verlag [2000 02] 136-155 Hartmann S. Decomposition by Pivoting and Path Cardinality Constraints ER'00 EntityRelationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 126-139 Hay D.C. Data model patterns: Conventions of Thought Dorset House Publ. NY [1995] Henderson B. Object-Oriented Metrics: measures of complexity Prentice-Hall, upper Saddle River NJ [1996] Hitchman S. Practitioner perceptions on the use of some concepts in the ER model European J.of Information Systems (O.R.) vol 4 [1995] 31-40 Hitchman S. Object-Oriented Modelling in Practice: Class Model Perceptions in the ERM context ER'00 Entity-Relationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 397-408 Hsu C. Enterprise integration and modeling: the metadatabase approach Kluwer Academic Publishers [1996] Hull R., King R. Semantic Database Modeling: survey, applications, and reseasch issues ACM computing surveys vol 19 no 3 [1987 09] 201-260 Hull R. Managing Semantic Heterogeneity in Databases: A Theoretical Perspective Proc.16th Symp.on Principles of Database Systems, ACM Special Interest Groups SIGACT-SIGMOG-SIGART [1997 05 12] 51-61 IBM Team-Connection Information Model (Meta model) Enterprise specific, modifiable Meta Model-Library from the internet [1999] 1-6 Jaegers H.P.M. Beginselen van flexibiliteit Management & Informatie vol 5 no 3 [1997] 1625 (in dutch) Jaeschke P., Oberweis A., Stucky W. Extending ER Model Clustering by Relationship Clustering ER'93 Entity-Relationship Approach 1993 LNCS 823, Springer Verlag [1993 12] 451-462 Jajodia S., Ng P.A., Springsteel F.N. Problem of equivalence for Entity-Relationship diagrams IEEE Transactions on Software Engineering vol 9 no 5 [1983 09] 617-630 Jarke M., Jeusfeld M., Quix C., Vassiliadis P. Architecture and Quality in Data Warehouses: an Extended Repository Approach Information Systems vol 24 no 3, Elsevier Science [1999] 229-253 Jarvenpaa S.L., Ives B. Organizational fit and flexibility: IT design principles for a globally competing firm Research in Strategic Management and IT vol 1 [1994] 1-39 Jensen C.S., Snodgrass R.T. Semantics of Time-Varying Information Information Systems vol 21 no 4 [1996] 311-352 Jensen O.G., Böhlen M.H. Evolving Relations Proc.'00 Int. Conf. Database Schema Evolution & MetaModeling LNCS 2065, Springer Verlag [2000 09] 115-132

References

261

Jianhua Zhu, Nassif R., Pankaj G., Drew P., Askelid B. Incorporating a model hierarchy into the ER paradigm ER'91 Entity-Relationship Approach 1991, Elsevier Science Publ. North-Holland [1991] 75-88 Johannesson P. Schema transformations as an aid in View Integration CAiSE'93 Advanced Information Systems Engineering LNCS 685, Springer Verlag [1993 06] 71-92 Jones C. Software Metrics: good, bad, and missing Computer vol 27 no 9 [1994 09] 98-100 Jones L.M. Defining systems boundaries in practice: some proposals and guidelines J. of Applied Systems Analysis vol 9 [1982] 41-55 Jones M.C., Rundensteiner E.A. An Object Model and Algebra for the Implicit Unfolding of Hierarchical Structures from the internet [1999] 1-19 Jordan E., Tricker B. Information Strategy alignment with organization structure J. of Strategic Information Systems vol 4 [1995 04] 357-382 Josifovski V., Risch T. Integrating Heterogeneous Overlapping Databases through ObjectOriented Transformations VLDB'99 Very Large Data Bases, Edinburgh Scotland, Morgan Kaufman [1999 09] 435-446 Kahn H.J., Filer N.P. Supporting the Maintenance and Evolution of Information Models Proc. IRMA'00 Int. Conf., Idea Group Publishing, Hershey Pennsylvania [2000 05] 888-890 Kallio J., Saarinen T., Salo S., Tinnalä M., Vepsäläinen A.P.J. Drivers and tracers of Business Process Changes J. of Strategic Information Systems vol 8 no 2 [1999 02] 125-142 Kalman K. Implementation and critique of an algorithm which maps a Relational Database to a Conceptual Model Advanced Information Systems Engineering [1991] 393-415 Kalus C., Dadam P. Flexible Relations: Operational Support of Variant Relational Structures VLDB'95 Very Large Data Bases [1995 09] 539-550 Kaplan R., Norton D. The Balanced Scorecard: Measures that Drive Performance Harvard Business Review vol 70 no 1 [1992 01/02] 71-79 Kappel G., Kapsammer E., Rausch-Schott S., Retschitzegger W. X-Ray: towards integrating XML and Relational Database Systems ER'00 Entity-Relationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 339-353 Karahasanovic A., Sjøberg D. Supporting database schema evolution by impact analysis Proc.NIK'99 Norsk Informatikk Konferanse, TAPIR [1999 11 16] 303-314 Karahasanovic A. SEMT, Schema Evolution Management Tool: A tool for Finding Impacts of Schema Changes Proc.NWPER'00 Nordic Workshop on Programming Environment Research Rep283, University of Oslo [2000 05 28] 60-75 Kelly S., Smolander K. Evolution and issues in MetaCASE Information & Software Technology vol 38 [1996] 261-266 Kemerer C.F., Slaughter S. An empirical approach to studying software evolution IEEE Transactions on Software Engineering vol 25 no 4 [1999 07/08] 493-509 Kent W. Limitations of record-oriented information models ACM transactions on Database Systems vol 4 [1979 03] 107-131 Kent W. The Many Forms of a Single Fact, or: have you heard the one about the travelling salesman's data ? IEEE Proceedings COMPCON 27-2-89 / 3-3-89, [1989 03] 1-14 Kent W. Solving domain mismatch and schema mismatch problems with an object-oriented database programming language VLDB'91 Very Large Data Bases, MorganKaufmann, San Fransisco [1991 09] 147-160

262

Exploring Conceptual Schema Evolution

Kent W., Ahmed R., Albert J,. Ketabchi M. Ming-Chien Shan Object Identification in Multidatabases Proc. IFIP WG2.6 DS-5 1992 16-11 nov [1993] 313-330 Kesh S. Evaluating the quality of Entity Relationship models Information & Software Technology vol 37 [1995] 681-689 Kim Y.-G., Everest G.C. Building an IS architecture: Collective wisdom from the field Information & Management vol 26 [1994] 1-11 Kim Y.-G., March S.T. Comparing data modeling formalisms Communications of the ACM vol 38 no 6 [1995 06] 103-112 King P.J.H. The Database Design Process ER'86 Entity-Relationship Approach 1986, Elsevier Science Publ. North-Holland [1986] 475-489 King W.R., Thompson T.S.H. Key dimensions of facilitators and inhibitors for the strategic use of Information Technology J. of Management Information Systems vol 12 no 4 [1996spring] 35-53 Kitchenham B., Pflegger S., Fenton N. Towards a framework for software measurement validation IEEE Transactions on Software Engineering vol 21 no 12 [1995] 929-943 Knapp J. Uniqueness Conditions for ER Representations ER'95 Entity-Relationship Approach 1995 LNCS 1021, Springer Verlag [1995 12] 296-307 Koch C., Kovacs Z., LeGoff J.M., McClatchley R., Petta P., Solomonides T. Explicit Modeling of the Semantics of Large Multi-Layered Object-Oriented Databases ER'00 Entity-Relationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 52-65 Kotter J.P. Leading Change: why transformation efforts fail Harvard Business Review vol 73 [1995 03/04] 59-67 Kramer D. XML Evolution Management thesis, Worcester Polytechnic Institure, USA [2001 05] Krogstie J., Lindland O.I., Sindre G. Towards a deeper understanding of Quality in Requirements Engineering CAiSE'95 Advanced Information Systems Engineering LNCS [1995 06] 82-95 Kudrass T., Lehmbach M., Buchmann A. Tool-based re-engineering of a legacy MIS: an Experience Report CAiSE'96 Advanced Information Systems Engineering LNCS 1080, Springer Verlag [1996] 116-135 Kwon O.B., Park S.J. RMT: a modeling support system for model reuse Decision Support Systems vol 16 [1996] 131-153 Laender A.H.F., Flynn D.J. A Semantic Comparison of the Modeling Capabilities of the ER and NIAM models ER'93 Entity-Relationship Approach 1993 LNCS 823, Springer Verlag [1993 12] 242-257 Lammari N. An algorithm to extract IS-A inheritance hierarchies from a relational database ER'99 Entity-Relationship Approach 1999 LNCS 1728, Springer Verlag [1999 11] 219-232 Land F. Adapting to changing user requirements Information & Management vol 5 [1982] 59-75 Lautemann S.-E. A propagation mechanism for populated schema versions Proc.ICDE'97 13th Int. Conf. Data Engineering, IEEE [1997 04] 67-78 Lederer A., Salmela H. Towards a theory of strategic information systems planning J. of Strategic Information Systems vol 5 no 3 [1996 09] 237-253 Lee A.S. A scientific methodology for MIS case studies MIS Quarterly [1989 03]

References

263

Lee D., Chu W. Constraint-preserving transformation: from XML Document Type Definition to Relational Schema ER'00 Entity-Relationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 323-338 Lee H., Kim T., Kim J. A Metadata Oriented Architecture for building Datawarehouses Journal of Database Management vol 12 no 4 [2001fall] 15-25 Lehman M.M., Belady L.A. Programs, Life Cycles and Laws of Software Evolution IEEE Proceedings Special Issue Software Enginering vol 68 no 9 [1980 09] 1060-1076 Lehman M.M., Belady L.A. Program Evolution: Processes of Software Change Academic Press [1985] Lehman M.M. Software Engineering, the Software Process, and their Support Software Engineering Journal [1991 09] 243-258 Lehman M.M., Ramil J.F., Wernick P.D., Perry D.E., Turski W.M. Metrics and Laws of Software Evolution: the Nineties View Proc.Metrics'97 4th Int.Software Metrics Symp., IEEE [1997] 20-32 Lehmann T., Schewe K.-D. A Pragmatic Method for the integration of Higher-Order ER Schemata ER'00 Entity-Relationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 37-51 Lerner B.S., Habermann A.N. Beyond schema evolution to database reorganization Proc.OOPSLA Int. Conf. on OO Programming, Systems, Languages, and Applications vol 25 no 10, SIGPLAN notices 0362-1340 [1990 10] 67-76 Lerner B.S. A Model for Compound Type Changes encountered in Schema Evolution Technical Report Univ Massachusetts UM-CS-96-044 [1996 06] Levitin A.V., Redman T.C. Quality dimensions of a conceptual view Information Processing & Management vol 31 no 1 [1995] 81-88 Leymann F. A Practitioners Approach to Data Federation Proc. 4th Workshop FDBS'99 CEUR-WS vol 25, CEUR-WS //sunsite.informatik.rwth-aachen.de/ [1999 11 26] 1-18 Li H., Looijen M. Het moderniseren van legacy-systemen IT management select vol 5 no 2 [1999 02] 94-102 (in dutch) Lindland O.I., Sindre G., Sølvberg A. Understanding quality in conceptual modeling IEEE Software Vol 11 [1994 03] 42-49 Ling D.H.O., Bell D.A. Taxonomy of time models in databases Information & Software Technology vol 32 no 3 [1990 04] 215-224 Ling Liu Adaptive Schema Design and Evaluation in an Object-Oriented Information System ER'95 Entity-Relationship Approach 1995 LNCS 1021, Springer Verlag [1995 12] 2131 Liu Chien-Tsai, Chryssanthis P.K, Chang Shi-Kuo Database Schema Evolution through the Specification and Maintenance of Changes on Entities and Relationships ER'94 Entity-Relationship Approach 1994 LNCS 881, Springer Verlag [1994 12] 132-151 Looijen M., Vreven G. Business-architecturen: spil in de onderneming Informatie vol 40 [1998 06] 32-45 (in dutch) López J.-R., Olivé A. A Framework for the Evolution of Temporal Conceptual Schemas of Information Systems CAiSE'00 Advanced Information Systems Engineering LNCS 1789, Springer Verlag [2000 05] 369-386 Maciaszek L.A. Process Model for Round-trip Engineering with Relational Database Proc. IRMA'00 Int. Conf., Idea Group Publishing, Hershey Pennsylvania [2000 05] 468-472

264

Exploring Conceptual Schema Evolution

Makowsky J.A., Ravve E.V. Dependency preserving refinements and the fundamental problem of database design Data & Knowledge Engineering vol 24 [1998] 277-312 Männistö T., Sulonen R. Evolution of Schema and Individuals of Configurable Products ECDM'99 Advances in Conceptual Modeling LNCS 1727, Springer Verlag [1999 11] 12-23 March S.T. Reflections on Computer Science and Information Systems Research ER'00 Entity-Relationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 16-26 Marche S. Measuring the stability of data models European J.of Information Systems (O.R.) vol 2 no 1 [1993] 37-47 Marcos E., Vela B., Cavero J.M., Caceres P. Aggregation and Composition in ObjectRelational Database Design ADBIS'01 Advances in Databases & Information Systems [2001 10] 195-209 Masood N., Eaglestone B. Semantics Based Schema Analysis DEXA'98 Database and Expert Systems Applications LNCS 1460, Springer Verlag [1998 08] 80-89 McBrien P., Poulovassilis A. Formalisation of semantic schema integration Information Systems vol 23 no 5 [1998 07] 307-334 McBrien P., Poulovassilis A. A semantic approach to integrating XML and Structured Data Sources CAiSE'01 Advanced Information Systems Engineering LNCS 2068, Springer Verlag [2001] 330-345 McCabe T.J. A Complexity Measure IEEE Transactions on Software Engineering vol SE-2 no 4 [1976 12] 308-320 McKenzie E., Snodgrass R.T. Schema evolution and the relational algebra Information Systems vol 15 no 2 [1990] 207-232 McLeod D. A learning-based approach to meta-data evolution in object-oriented database systems Advances in Object-Oriented Database Systems LNCS 334, Springer Verlag [1988] 219-224 Meier A., Dippold R., Mercerat J., Muriset A., Untersinger J.-C., Eckerlin R., Ferrara F. Hierarchical to Relational database migration IEEE Software [1994] Meier A. Providing Database Migration Tools: A Practitioner's View VLDB'95 Very Large Data Bases [1995 09] 635-641 Mirbel I., Cavarero J.-L. An Integration Method for Design Schemas CAiSE'96 Advanced Information Systems Engineering LNCS 1080, Springer Verlag [1996] 457-475 Mistelbauer H. Datenmodellverdichtung: Vom Projectdatenmodell zur UnternehmensDatenarchitectur Wirtschaftsinformatik vol 33 no 4 [1991 08] 289-299 (in german) Möller J.-U., Wiese D. Editing Conceptual Graphs Proc. ICCS 4th Int. Conf. on Conceptual Structures LNAI 1115, Springer Verlag [1996 08] 175-187 Monk S., Sommerville I. Schema evolution in OODB using class versioning SIGMOD record vol 22 no 3 [1993] 16-22 Monk S., Mariani J.A., Elgalal B., Campbell H. Migration from relational to object-oriented databases Information & Software Technology vol 38 no 7 [1996 07] 467-475 Monroe T.J., Kompanek A., Melton R., Garlan D. Architectural styles, design patterns, and objects IEEE Software vol 14 no 1 [1997 01] 43-52 Moody D.L. The Seven Habits of highly effective Data Modellers (and Object Modellers ?) ER'95 Entity-Relationship Approach 1995 LNCS 1021, Springer Verlag [1995 12] 436-437

References

265

Moody D.L. Seven habits of Highly Effective Datamodellers Database Programming & Design [1996 10] 57-64 Moody D.L., Shanks G.G. What makes a good data model ? A framework for Evaluating and Improving the quality of entity relationship models Australian Computer Journal vol 30 no 3 [1998 08] 97-110 Moody D.L., Shanks G.G., Darke P. Improving the Quality of Entity Relationship Models: Experience in Research and Practice ER'98 Entity-Relationship Approach 1998 LNCS 1507, Springer Verlag [1998 11] 255-276 Moody D.L. Metrics for Evaluating the Quality of Entity Relationship Models ER'98 EntityRelationship Approach 1998 LNCS 1507, Springer Verlag [1998 11] 209-225 Moody D.L. Strategies for improving Quality of Entity Relationship Models: A "Toolkit" for Practitioners Proc. IRMA'00 Int. Conf., Idea Group Publishing, Hershey Pennsylvania [2000 05] 1043-1045 Mylopoulos J., Fuxman A., Giorgini P. From Entities and Relationships to Social Actors and Dependencies ER'00 Entity-Relationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 27-36 Navathe S.E., Kerschberg L. Role of Data Dictionaries in Information Resource Management Information & Management vol 10 [1986] 21-46 Navathe S.E. Evolution of data modeling for databases Communications of the ACM vol 35 no 9 [1992 09] 112-123 Neiling M. Datenintegration durch Object-Identification Proc. 4th Workshop FDBS'99 CEUR-WS vol 25, CEUR-WS //sunsite.informatik.rwth-aachen.de/ [1999 11 26] 1-27 (in german) Nelson K.M., Ghods M. Measuring technology flexibility European J.of Information Systems (O.R.) vol 7 no 4 [1998] 232-240 Neumann G. Reasoning about ER Models in a deductive environment Data & Knowledge Engineering vol 19 no 3 [1996 06 15] 241-266 Nguyen G.T., Rieu D. Schema Evolution in Object-Oriented Database Systems Data & Knowledge Engineering vol 4 [1989] 43-67 Nissen H.W., Jarke M. Repository support for multi-perspective requirements engineering Information Systems vol 24 no 2, Pergamon [1999] 131-158 Nordbotten J.C., Crosby M.E. The effect of graphic style on data model interpretation Information Systems Journal vol 9 no 2 [1999] 139-155 Nummenmaa J., Tuomi J. Constructing layouts for ER-diagrams from visibility-representations ER'91 Entity-Relationship Approach 1991, Elsevier Science Publ. NorthHolland [1991] 303-317 Oei J.L.H., Proper H.A., Falkenberg E.D. Evolving information systems: meeting the everchanging environment Information Systems Journal vol 4 [1994] 213-233 Olivé A., Costal D., Sancho M.-R. Entity Evolution in IS-A hierarchies ER'99 EntityRelationship Approach 1999 LNCS 1728, Springer Verlag [1999 11] 62-80 Olivé A. Taxonomies and Derivation Rules in Conceptual Modeling CAiSE'01 Advanced Information Systems Engineering LNCS 2068, Springer Verlag [2001] 417-432 Orna L. Using knowledge and information to manage and master change Managing Information vol 6 no 1 [1999 02] 42-45

266

Exploring Conceptual Schema Evolution

Overmyer S.P., Lavoie B., Rambow O. Conceptual Modeling through Linguistic Analysis using LIDA ICSE'01 23rd Int. Conf. Software Engineering, IEEE Computer Society & ACM SigSoft [2001 05 12] 401-410 Papazoglou M.P. Unraveling the semantics of Conceptual Schemas Communications of the ACM vol 38 no 9 [1995 09] 80-94 Patel N.V. The Spiral of Change Model for coping with changing and ongoing requirements Requirements Engineering vol 4 [1999] 77-84 Peckham J., Maryanski F. Semantic Data Models ACM computing surveys vol 20 no 3 [1988 09] 153-189 Pels H.J. Geïntegreerde informatiebanken: modulair ontwerp van het conceptuele schema Stenfert Kroese Leiden/Antwerpen, the Netherlands [1988] (in dutch) Perry D.E. Dimensions of Software Evolution ICSM'94 Int. Conf. Software Maintenance, IEEE [1994] 296-303 Perry D.E., Staudenmayer N.A, Votta L.G. People, Organizations, and Process Improvements IEEE Software vol 11 no 4, IEEE [1994 07] 36-45 Persson A., Stirna J. Why Enterprise Modelling ? An explorative study into current practice CAiSE'01 Advanced Information Systems Engineering LNCS 2068, Springer Verlag [2001] 465-468 Peters R.J., Tamer Özsu M. Reflection in a Uniform Behaviour Object Model ER'93 EntityRelationship Approach 1993 LNCS 823, Springer Verlag [1993 11] 34-45 Peters R.J., Tamer Özsu M. An axiomatic model of dynamic schema evolution in objectbase systems ACM transactions on Database Systems vol 22 no 1 [1997 03] 75-114 Poncelet P., Lakhal L. Consistent structural updates for object database design CAiSE'93 Advanced Information Systems Engineering LNCS 685, C.Rolland, F.Bodart, C.Cauvet (eds) [1993 06] 1-21 Poulovassilis A., McBrien P. A general formal framework for schema transformation Data & Knowledge Engineering vol 28 no 1 [1998] 47-71 Premerlani W.J., Blaha M.R. An Approach for Reverse Engineering of Relational Databases Communications of the ACM vol 37 no 5 [1994 05] 42-49 Proper H.A., Weide T.P. van der Towards a general theory for the evolution of application domains Proc.ADC'93 4th Australasian Database Conference, World Scientific Singapore [1993] 346-362 Proper H.A., Weide T.P. van der EVORM: a conceptual modelling technique for evolving application domains Data & Knowledge Engineering vol 12 [1994] 313-359 Proper H.A. A Theory for Conceptual Modelling of Evolving Application Domains thesis, Catholic University of Nijmegen, the Netherlands [1994 04 28] Proper H.A. Data schema design as a schema evolution process Data & Knowledge Engineering vol 22 no 2 [1997 04] 159-189 Proper H.A. Flexibiliteit van informatiemodellen Informatie vol 40 [1998 03] 28-33 (in dutch) Ra Y.-G., Rundensteiner E.A A transparent Schema Evolution System Based on ObjectOriented View Technology from the internet [1999a] 1-54 Ra Y.-G., Rundensteiner E.A. Towards Supporting Hard Schema Change in TSE from the internet [1999b] 1-6 Radeke E., Scholl M.H. Framework for Object Migration in Federated Database Systems Proc. Int. Conf. on Parallel & Distributed Information Systems [1994 09]

References

267

Ram S. Deriving functional dependencies from the Entity Relationship model Communications of the ACM vol 38 no 9 [1995 09] 95-107 Ratcliffe V., Sackett P. Information and small companies: chaos with intent AI & Society: J. of human & machine intelligence vol 15 no 1 [2001] 22-39 Rauh O., Stickel E. Entity Tree Clustering: a method for simplifying ER designs ER'92 Entity-Relationship Approach 1992 LNCS 645, Springer Verlag [1992 10] 62-78 Redman T.C. Data Quality for the Information Age Artech House Boston-London [1996] Reiner D. Research areas related to Practical Problems in Automated Database Design SIGMOD record vol 20 no 3 [1991 09] 79-82 Ritter N., Steiert H.-P. Enforcing Modeling Guidelines in an OR DBMS-based UMLrepository Proc. IRMA'00 Int. Conf., Idea Group Publishing, Hershey Pennsylvania [2000 05] 269-273 Rivero L., Doorn J. Managing Referential Integrity and Non-Key-based Dependencies in a Denormalized Context Proc. IRMA'00 Int. Conf., Idea Group Publishing, Hershey Pennsylvania [2000 05] 883-886 Roddick J.F., Patrick J.D. Temporal semantics in information systems: a survey Information Systems vol 17 no 3 [1992] 249-267 Roddick J.F., Craske N.G., Richards T.J. A Taxonomy for Schema Versioning Based on the Relational and Entity Relationship Models ER'93 Entity-Relationship Approach 1993 LNCS 823, Springer Verlag [1993 12] 137-148 Roddick J.F. A survey of schema versioning issues for database systems Information & Software Technology vol 37 no 7 [1995] 383-393 Rosenthal A., Reiner D. Tools and transformations for practical Database Design: rigorous and otherwise ACM transactions on Database Systems vol 19 no 2 [1994 06] 167-211 Rusinkiewicz M., Sheth A., Karabatis G. Specifying Interdatabase Dependencies in a Multidatabase Environment IEEE Computer vol 24 no 12 [1991 12] 46-53 Saiedian H. An evaluation of extended entity-relationship model Information & Software Technology vol 39 [1997] 449-462 Saltor F., Castellanos M.G., Garcia-Solaco M. Overcoming Schematic Discrepancies in Interoperable Databases Proc. IFIP WG2.6 DS-5 1992 16-11 nov [1992 11] 191-205 Santucci G., Batini C., Di Battista G. Multilevel Schema Integration ER'93 EntityRelationship Approach 1993 LNCS 823, Springer Verlag [1993 12] 327-338 Santucci G. Semantic Schema refinements for multilevel schema integration Data & Knowledge Engineering vol 25 [1998] 301-326 Sauter C. ein Ansatz fur das Reverse Engineering relationaler Datenbanken Wirtschaftsinformatik vol 37 no 3 [1995] 242-250 (in german) Schneidewind N.F. Methodology for Validating Software Metrics IEEE Transactions on Software Engineering vol 18 no 5 [1992 05] 410-422 Schuette R., Rotthowe T. The Guidelines of Modeling: An approach to enhance the quality in information models ER'98 Entity-Relationship Approach 1998 LNCS 1507, Springer Verlag [1998 11] 240-254 Seljée R.R., Swart H.C.M. de Three types of redundancy in integrity checking an optimal solution Data & Knowledge Engineering vol 30 no 2 [1999] 135-151 Sethi V., King R.W. Development of measures to assess the extent to which an Information Technology Application provides competitive advantage Management Science vol 40 no 12 [1994 12] 1601-1627

268

Exploring Conceptual Schema Evolution

Shanks G., Darke P. Understanding corporate data models Information & Management vol 35 no 1 [1999] 19-30 Sheth A.P., Larson J.A. Federated Database Systems for Managing Distributed, Heterogeneous and Autonomous Databases ACM computing surveys vol 22 no 3 [1990 09] 183-236 Sheth A.P., Kashyap V. So Far (schematically) Yet So Near (semantically) Proc. IFIP WG2.6 DS-5 1992 16-11 nov, (Lorne, Australia) [1992 11 16] 272-301 Shneiderman B., Thomas G. Architecture for automatic relational database system conversion ACM transactions on Database Systems vol 7 no 2 [1982 06] 235-257 Siegel M., Madnick S.E. A metadata approach to resolving semantic conflicts VLDB'91 Very Large Data Bases, Morgan-Kaufmann, San Fransisco [1991 09] 133-145 Sjøberg D. Quantifying Schema Evolution Information & Software Technology vol 35 no 1 [1993 01] 35-44 Skarra A.H., Zdonik S.B. The Management of Changing Types in an Object-Oriented Database Proc.OOPSLA Int. Conf. on OO Programming, Systems, Languages, and Applications [1986 09] 483-495 Skarra A.H., Zdonik S.B. Type evolution in an object-oriented database Research Directions in OO Programming, MIT Press, Cambrigde (Mass) [1987] 393-415 Smits M.T., Poel K.G. vd. The practice of information strategy in 6 information intensive organizations in the Netherlands J. of Strategic Information Systems vol 5 no 2 [1996 06] 93-110 Snodgrass R.T. Temporal Databases in: Advanced Database Systems Part II Morgan Kaufman [1997] Sockut G.H. Framework for logical-level changes within Database Systems Computer [1985 05] 9-27 Soutou C. Inference of Aggregate Relationships through Database Reverse Engineering ER'98 Entity-Relationship Approach 1998 LNCS 1507, Springer Verlag [1998 11] 135-149 Stanhope P.D. The Letters of the Earl of Chesterfield to His Son edition by Charles Strachey [1912 06] Steele P.M., Zaslavsky A.B. The Role of Meta Models in Federating System Modeling Techniques ER'93 Entity-Relationship Approach 1993 LNCS 823, Springer Verlag [1993 12] 315-326 Stickel E., Hunstock J., Ortmann A., Ortmann J. Data sharing: economics and requirements for integration-tool design Information Systems vol 19 no 8 [1994] 629-642 Storey V.C., Ullrich H., Sundaresan S. An Ontology for Database Design Automation ER'97 Entity-Relationship Approach 1997 LNCS 1331, Springer Verlag [1997] 2-15 Storey V.C., Goldstein R.C., Ding J. Common Sense Reasoning in Automated Database Design: An Empirical Test Journal of Database Management [2002winter] 3-14 Strong D.M., Lee Y.W., Wang R.Y. Data Quality in context Communications of the ACM vol 40 no 5 [1997 05] 103-110 Su H., Claypool K.T., Rundensteiner E.A. Extending the OQL Object Query Language for Transparent Metadata Access Proc.'00 Int. Conf. Database Schema Evolution & MetaModeling LNCS 2065, Springer Verlag [2000 09] 182-201

References

269

Su H., Kramer D., Chen L., Claypool K., Rundensteiner E.A. XEM: Managing the Evolution of XML Documents Proc.RIDE'01 Research Issues in Data Engineering, IEEE [2001 04 02] 103-110 Swanson E.B. IS 'maintainability': should it reduce the maintenance effort ? ACM SIGMIS the Data Base for Advances in information systems vol 30 no 1 [1999winter] 65-76 Tari Z. Interoperability between Database Models Proc. IFIP WG2.6 DS-5 1992 16-11 nov [1993] 101-118 Tauzovich B. An Expert system for Conceptual Data Modelling ER'90 Entity-Relationship Approach 1990, Elsevier Science Publ. North-Holland [1990] 205-220 Teorey T.J., Wei G., Bolton D.L., Koenig J.A. ER Model Clustering as an Aid for User Communication and Documentation in Database Design Communications of the ACM vol 32 no 8 [1989 08] 975-987 Teorey T.J. Database Modeling & Design: The fundamental Principles, second edition Morgan Kaufmann Publ.Inc. [1994] Ter Bekke J.H. Database Ontwerp Kluwer Bedrijfswetenschappen [1993] (in dutch) Terrasse M.-N. A Modeling Approach to Meta-Evolution Proc.'00 Int. Conf. Database Schema Evolution & MetaModeling LNCS 2065, Springer Verlag [2000 09] 202-219 Thalheim B. Entity-Relationship Modeling: Foundations of Database Technology Springer Verlag [2000] Thoinot Arbeau Orchesographie [1589], translation by Mary Stuart Evans [1967] Dover Publishing, NY Thompson C. Living with an Enterprise Data Model Database Programming & Design [1993 03] 37-44 Tolkien J.R.R. The Lord of the Rings, part III: The Return of the King Unwin Books [1955, 1974] Tresch M.T., Scholl M.H. Meta Object Management and its Application to Database Evolution ER'92 Entity-Relationship Approach 1992 LNCS 645, Springer Verlag [1992 10] 299-321 Truijens J., Winterink J. Complexiteitstoename van (geautomatiseerde) informatieverzorging belangrijke ontwikkeling voor de EDP-audit EDP-auditor [1996 03] 3-14 (in dutch) Tseng F.S.C., Chiang J.-J., Yang W.-P. Integration of relations with conflicting schema structures in heterogeneous database systems Data & Knowledge Engineering vol 27 no 2 [1998 09] 231-248 Tsichritzis D.C, Lochovsky F.H. Data Models Prentice Hall London [1982] Tufte K., Gang He, Shanmugasundaram J., Zhang C., Dewitt D., Naughton J. Relational databases for querying XML documents limitations and opportunities VLDB'99 Very Large Data Bases, Edinburgh Scotland, Morgan Kaufman [1999 09] 302-314 Türker C. Schema Evolution in SQL-99 and Commercial (Object-)Relational DBMS: an extended abstract Proc.'00 Int. Conf. Database Schema Evolution & MetaModeling LNCS 2065, Springer Verlag [2001] 1-32 Urtado C., Oussalah C. Complex entity versioning at two granularity levels Information Systems vol 23 no 3/4 [1998] 197-216 Veldwijk R. Tien geboden voor goed database ontwerp Database Magazine [1995 12] (in dutch) Venable J., Gruny J. Integrating and Supporting ER and ORM ER'95 Entity-Relationship Approach 1995 LNCS 1021, Springer Verlag [1995 12] 318-328

270

Exploring Conceptual Schema Evolution

Ventrone V., Heiler S. Semantic Heterogeneity as a Result of Domain Evolution SIGMOD record vol 20 no 4 [1991 12] 16-20 Visaggio G. Comprehending the knowledge stored in aged Legacy Systems to improve their qualities with a Renewal Process Technical Report Fraunhofer FhG IESE ISERN-9726.pdf, Fraunhofer Institute IESE, Kaiserslautern (Dld) [1997] 29 Vreven G., Looijen M. Strategie, systeemontwikkeling en systeembeheer Informatie vol 39 [1997 12] 13-21 (in dutch) Wai Y.M., Embley D.W. Using NNF to transform conceptual data models to object-oriented database designs Data & Knowledge Engineering vol 24 no 3 [1998 01] 313-336 Wand Y., Monarchi D.E., Parsons J., Woo C.C. Theoretical foundations for conceptual modelling in information systems development Decision Support Systems vol 15 [1995] 285-304 Wang R.Y., Strong D.M. Beyond (data) Accuracy: what data quality means to data consumers J. of Management Information Systems vol 12 no 4 [1996spring] 5-34 Wang R.Y. A Product perspective on Total Data Quality Management Communications of the ACM vol 41 no 2 [1998 02] 59-65 Waters R.C., Chikofsky E. Reverse Engineering: Progress along many dimensions Communications of the ACM vol 37 no 5 [1994 05] 23-24 Weber Y., Pliskin N. The effects of information systems integration and organizational culture on a firm's effectiveness Information & Management vol 30 [1995] 81-90 Wei H.-C., Elmasri R. Study and comparison of schema versioning and database conversion techniques for bi-temporal databases Proc. 6th Int. Workshop on Temporal Representation & Reason [1998] 1-11 Werner M. Facilitating Schema Evolution with Automatic Program Transformations from the internet [1999 05 12] 1 Wiederhold G. Modeling and System Maintenance ER'95 Entity-Relationship Approach 1995 LNCS 1021, Springer Verlag [1995 12] 1-20 Wilkin C., Hewett B., Carr R. Exploring the Role of Expectations in defining Stakeholder's Evaluations of IS Quality Proc. IRMA'00 Int. Conf., Idea Group Publishing, Hershey Pennsylvania [2000 05] 57-61 Wilmot R.B. Foreign keys decrease adaptability of database designs Communications of the ACM vol 27 no 12 [1984 12] 1237-1243 Winans J., Davis K.H. Software Reverse Engineering: from a currently existing IMS database to an E-R model ER'91 Entity-Relationship Approach 1991, Elsevier Science Publ. North-Holland [1991] 333-348 Winkler J. The Entity Relationship approach and the Information Resource Dictionary Standard ER'88 Entity-Relationship Approach 1988, North-Holland [1989] 3-19 Winter R. Design and Implementation of Derivation Rules in Information Systems Data & Knowledge Engineering vol 26 [1998] 225-241 Wohed P. Conceptual Patterns for Reuse of Information Systems Design CAiSE'00 Advanced Information Systems Engineering LNCS 1789, Springer Verlag [2000 05] 157-175 Wohed P. Tool support for Reuse of Analysis Patterns: A case study ER'00 EntityRelationship Approach 2000 LNCS 1920, Springer Verlag [2000 11] 196-209 Yin R.K. Case Study Research: design and methods SAGE Publications CA [1988]

References

271

Yormark B. ANSI/X3/sparc Study Group on Data Base Management Systems: Architecture conference paper, Elsevier North Holland [1976 12] 1-34 Zamperoni A., Löhr-Richter P. Enhancing the Quality of Conceptual Database Specifications Through Validation ER'93 Entity-Relationship Approach 1993 LNCS 823, Springer Verlag [1993 12] 85-98 Zaniolo C., Ceri S., Faloutsos C., Snodgrass R., Subrahmanian V., Zicari R. Advanced Database Systems Morgan Kaufmann, San Fransisco [1997] Zdonik S.B. Maintaining Consistency in a Database with Changing Types SIGPLAN Notices vol 21 no 10 [1986 10] 120-127 Zicari R. A Framework for Schema Updates in an Object-Oriented Database in: Building an O-O Database System: the story of O2 Morgan Kaufmann, San Mateo CA [1992] 146-182

272

Exploring Conceptual Schema Evolution

PUBLICATIONS BY THE AUTHOR Chapter 1: Design the flexibility, maintain the stability of Conceptual Schemas CAiSE'99 Conference on Advanced Information Systems Engineering, LNCS 1626, Springer Verlag [1999 06] 467-471 Chapter 2: A Method to Ease Schema Evolution Proceedings of IRMA'00 International Conference, Idea Group Publishing, Hershey Pennsylvania [2000 05] 423-425 (republished in Optimal Information Modeling Techniques, edited by C. van Slooten, Idea Group Publishing, Hershey Pennsylvania [2002] 69-77) Chapter 3: Engineering for Conceptual Schema Flexibility Proceedings RIDE'01 Research Issues in Data Engineering, IEEE [2001 04 02] 775-778 Chapter 4: Defining metrics for Conceptual Schema Evolution Proceedings 2000 International Conference on Database Schema Evolution and MetaModeling LNCS 2065, Springer Verlag [2000 09] 220-244 Chapter 5 Long-term evolution of a Conceptual Schema at a life insurance company Annals of Case study research in IT, Idea Group Publishing [2002 01], 280-296 and: De levenscyclus van een flexibel datamodel-ontwerp Informatie vol 41, ten Hagen & Stam [1999 10] 50-55 (in dutch) Chapter 6: Semantical Change Patterns in the Conceptual Schema ER'99 Advances in Conceptual Modeling, the ECDM workshop on Evolution and Change in Data Management, LNCS 1727, Springer Verlag [1999 11] 122-133 Chapter 7: Derived data reduce stability of the Conceptual Schema Proceedings Intersymp2000 International Conference, International Institute for Advanced Studies in Systems Research and Cybernetics [2000 08] 101-108 Chapter 9: Concrete Flexibility Proceedings of IRMA'01 International Conference, Idea Group Publishing, Hershey Pennsylvania [2001 05] 775-778

Nederlandse samenvatting

273

Veranderingen in Conceptuele Schemas: een verkenning Nederlandse samenvatting Wat is een Conceptueel Schema Bedrijven zijn voor hun administratie tegenwoordig erg afhankelijk van goed functionerende informatiesystemen. Centraal in dergelijke systemen staat het Conceptueel Schema, ook wel gegevensmodel of datamodel genoemd. Het Conceptuele Schema, afgekort CS, beschrijft de structuur van informatie: het legt op een eenduidige manier de betekenis en onderlinge samenhang vast tussen alle gegevens die relevant zijn in de werkelijkheid; althans in dat deel van de werkelijkheid waar het informatiesysteem en de administratie betrekking op hebben. Elk CS is opgebouwd uit 4 bouwstenen, namelijk: entiteit, referentie, attribuut, en beperkingsregel. In dit proefschrift bestuderen we wel de entiteiten en referenties, maar we laten de attributen en beperkingsregels buiten beschouwing. Wie niet vertrouwd is met de begrippen 'entiteit' en 'referentie' kan enkele diagrammen van CSs in het hoofdstuk 5 te bekijken. Elke rechthoek symboliseert een entiteit, en elke verbindingspijl staat voor een referentie. Waarom is het Conceptueel Schema belangrijk Het CS, met zijn centrale positie in het informatiesysteem, moet aan hoge kwaliteitseisen voldoen. Immers, elke fout of onduidelijkheid in de begrippen van het CS zal vroeger of later tot problemen leiden als de verschillende onderdelen van het informatiesysteem moeten samenwerken. Bij het bouwen van een informatiesysteem vindt daarom een grondige analyse plaats van de relevante begrippen, wat hun definitie precies is, en hoe die onderling samenhangen. Desondanks is het niet voldoende om een goed ontwerp voor het CS op te stellen: ontwerp is maar een eenmalige activiteit. Ook nadat het ontwerp is opgeleverd en het informatiesysteem in bedrijf is genomen kunnen er veranderingen optreden in de begrippen en in hun onderlinge samenhang. Dit soort veranderingen, die we 'semantisch' noemen, zullen op enig moment hun weerslag moeten krijgen in het CS. Flexibiliteit van een CS In de loop van de tijd zijn er veranderingen in elk bedrijf, in de manier van werken, in de gebruikte hulpmiddelen, en in de eisen die de klanten en medewerkers stellen. Daarom eist men van informatiesystemen dat ze 'flexibel' zijn, en deze eis geldt in het bijzonder voor het CS als centrale component. Maar wie er de literatuur op naslaat zal ontdekken dat er geen eenduidige en algemeen aanvaarde definitie bestaat wat flexibiliteit van een CS is, en dat er evenmin duidelijke maatstaven bestaan om flexibiliteit te beoordelen en te vergelijken. Auteurs die over dit onderwerp schrijven doen dat vaak alleen in kwalitatieve zin, en ze maken zelden duidelijk wat ze precies verstaan onder flexibiliteit. Wij hanteren in dit proefschrift als definitie van flexibiliteit: het vermogen van het CS om zich binnen een aanvaardbare tijdspanne te kunnen aanpassen aan wijzigingen in de structuur van informatie in de relevante werkelijkheid Waarom is het belangrijk de evolutie van het CS te onderzoeken Flexibiliteit volgens onze definitie is een vermogen dat pas in de toekomst wordt gerealiseerd, en het is dan ook niet mogelijk om op voorhand vast te stellen of een CS 'voldoende flexibel' zal zijn. We kunnen flexibiliteit dus niet rechtstreeks observeren. Gelukkig is het mogelijk om

274

Exploring Conceptual Schema Evolution

toch inzicht te krijgen in de eigenschappen van flexibiliteit door op een indirecte manier te werk te gaan. In plaats van toekomstige veranderingen bestuderen we in dit proefschrift de concrete veranderingen in vier CSs in het recente verleden. Hiermee hopen we meer inzicht te krijgen in wat flexibiliteit in de praktijk inhoudt. Een aanvullend argument om evolutie te onderzoeken is dat er nauwelijks literatuur is over de praktische ervaringen met flexibiliteit van CS. Hoe verandert een echt, operationeel CS in de loop van de tijd ? Welke omgevingsfactoren zijn van wezenlijk belang, en wat zijn maar bijzaken ? En wat betekent dat voor de lange-termijn flexibiliteit van een CS ? Het theoretisch deel In het onderzoek bewandelen we twee verschillende routes die we in de laatste twee hoofdstukken samenbrengen. Eerst moeten we het theoretisch kader opbouwen dat we nodig hebben om praktisch onderzoek te kunnen doen naar 'flexibiliteit van een CS', en om daaruit op verantwoorde wijze conclusies kunnen trekken. Hoofdstuk 2 bespreekt fundamentele concepten op het gebied van datamodellering en relationele databases, zoals de 3-Schema Architectuur van ANSI/X3/sparc, de notie van datamodel theorie en de eraan gerelateerde taxonomie, en de kwaliteitsattributen van een CS. Een raamwerk om flexibiliteit te beoordelen Hoofdstuk 3 ontwikkelt een theoretisch raamwerk om flexibiliteit van CS te kunnen duiden. Het omvat 3 dimensies die we nader onderverdelen in 8 richtlijnen: - de omgeving van een CS, waar de noodzaak tot verandering ontstaat onderverdeeld in de richtlijnen 'bepaal de beste afbakening' en 'leg de essentie vast' - de tijdigheid (snelheid) waarmee verandering in het CS kan worden doorgevoerd nader onderverdeeld in twee richtlijnen 'beperk de impact van wijzigingen' en 'zorg voor het soepel kunnen doorvoeren van een verandering' - de aanpasbaarheid van het CS als zodanig onderverdeeld in de vier richtlijnen 'hou het simpel', 'gebruik gelaagdheid', 'modelleer elke eigenschap eenmaal', en 'hou bij elkaar wat bij elkaar hoort'. Dit raamwerk passen we toe bij een analyse van ontwikkelmethoden en ontwerpstrategieën uit de literatuur waarin flexibiliteit wordt geclaimd. We onderscheiden hierbij drie groepen strategieën, en we analyseren waarop de claims van flexibiliteit zijn gebaseerd: - actieve strategieën: deze trachten het CS zodanig vorm te geven dat aanpassingen relatief eenvoudig zijn door te voeren - passieve strategieën: deze trachten de noodzaak voor veranderingen op voorhand te beperken door te anticiperen op mogelijke veranderingen - abstractie-strategieën: deze streven naar een zodanig abstract CS dat er geen wijzigingen zullen optreden op dat niveau van abstractie Een belangrijke bevinding van deze analyse is dat beweringen over flexibiliteit zijn gebaseerd op theoretische argumenten, en nooit op deugdelijk onderbouwde praktijkervaringen. Dit onderstreept het belang van ons onderzoek. Andere bevindingen uit de analyse zijn dat geen van de onderzochte strategieën gebruikmaakt van alle drie dimensies en acht richtlijnen van het raamwerk. Evenmin komt een bepaalde strategie als beste uit de bus.

Nederlandse samenvatting

275

Maatstaven voor CS evolutie Hoofdstuk 4 ontwerpt een aantal maatstaven om in een evoluerend CS de mate van verandering, en dus de mate van flexibiliteit te kunnen bepalen, in overeenstemming met de drie dimensies van het raamwerk. Dit ondervangt het probleem dat bij gebrek aan bruikbare maatstaven flexibiliteit alleen in kwalitatieve zin besproken kan worden: 1. 'gerechtvaardigde wijziging' een CS is een model van de werkelijkheid, en daarom behoort het CS uitsluitend te wijzigen naar aanleiding van een verandering in die werkelijkheid 2. 'omvang van de wijziging' de grootte van een wijziging in het CS behoort in een redelijke verhouding te staan tot de omvang van de verandering in de werkelijkheid. Maar we hanteren als maatstaf de 'omvang van wijziging in het CS' zonder dit in relatie te brengen met de grootte van verandering in de werkelijkheid, omdat daar geen bruikbare maat voor is 3. 'in overeenstemming met de oude structuur' een wijziging die overeenstemt met de bestaande structuur van het CS is eenvoudiger, sneller en goedkoper door te voeren dan een wijziging waarbij duidelijk wordt afgeweken van de bestaande structuur 4. 'uitbreiding op de oude structuur' een wijziging die zuiver als toevoeging op het bestaande CS is gedefinieerd is vanzelfsprekend in overeenstemming met de oude structuren. Een verborgen aanname is echter dat bij elke wijziging het oude begrippenkader altijd van toepassing blijft 5. 'complexiteit' een complex CS is moeilijker te wijzigen dan een simpel CS. Als maat voor complexiteit hanteren we het verschil tussen het aantal referenties en het aantal entiteiten (en tellen daar 1 bij op). Een eenvoudig CS heeft dan complexiteit 0; hoe hoger het getal, hoe complexer het CS 6. 'vatbaarheid voor wijziging: per entiteit' een vaak gehoorde veronderstelling is dat van de vier bouwstenen waaruit elk CS is opgebouwd, de entiteiten het minst vatbaar zijn voor wijzigingen. Om dit te toetsen tellen we het aantal wijzigingen in entiteiten 7. 'vatbaarheid voor wijziging: per referentie' ook per referentie stellen we vast hoe vatbaar ze zijn voor wijzigingen 8. 'behoud van identiteit' de identiteit van een entiteit is cruciaal om een begrip uit de werkelijkheid goed te kunnen bevatten. Een wijziging in de identiteit van die entiteit is alleen aan de orde als het betreffende begrip fundamenteel verandert. Hiermee beschikken we over 8 theoretisch onderbouwde én praktisch hanteerbare maatstaven. We passen ze in het praktisch deel van dit proefschrift toe bij het beoordelen van de evolutie in operationele CSs.

276

Exploring Conceptual Schema Evolution

Het praktisch deel De hoofdstukken 5, 6 en 7 bestuderen de lange-termijn evolutie van operationele CSs. Vier evoluerende Conceptuele Schemas Hoofdstuk 5 beschrijft de evolutie in vier CSs over een reeks van jaren. Dergelijk uniek onderzoeksmateriaal, afkomstig uit de bedrijfspraktijk van 2 verschillende bedrijven, is niet eerder gepubliceerd. Het gaat hier om de originele CSs, dus inclusief de ongebruikelijke structuren en onvolkomenheden. We hebben de schemas niet achteraf verfraaid of meer conceptueel gemaakt, wat immers afbreuk zou doen aan de waarde van ons onderzoek. We selecteerden deze vier CSs voor ons lange-termijn onderzoek op basis van drie criteria: - het betreffende informatiesysteem ondersteunt de primaire administratie van het bedrijf - het CS telt tussen de 3 en 30 entiteiten, zodat het nog door één persoon te onderzoeken is - er zijn tenminste drie behoorlijk gedocumenteerde versies van het operationele CS In totaal namen we 73 semantische wijzigingen waar in de evolutie van deze vier schemas. De korte termijn Hoofdstuk 6 bestudeert de semantische wijzigingen op de korte termijn: wat gebeurt er als een nieuwe versie van een CS operationeel wordt ? De beide bedrijven in onze studie blijken bij het wijzigen van het CS een vaste procedure te volgen met vier stappen: bewustwording, specificatie, wijzigen van het CS, en aanpassen van de vastgelegde data. Na specificatie kunnen de laatste twee stappen gelijktijdig of zelfs in omgekeerde volgorde plaatsvinden. Een tweede bevinding is dat men veranderingen vaak zo vormgeeft dat nadelige gevolgen beperkt worden: door opgeslagen gegevens niet direct aan te passen aan de nieuwe structuur, door het verwijderen van verouderde gegevens uit te stellen, of door een verandering te spreiden over twee opeenvolgende releases. De derde en wellicht belangrijkste constatering is dat veranderingen in CSs vaak bepaalde vaste patronen volgen. Bijlage B beschrijft een dozijn patronen die, met enkele varianten, goed zijn voor bijna 80% van de semantische wijzigingen. We verwachten dat de toepassing van dergelijke patronen het wijzigen van een CS eenvoudiger en beter beheersbaar zal maken. Lange-termijn trends in evolutie Hoofdstuk 7 analyseert de wijzigingen in CSs op lange termijn. Hierbij hanteren we de maatstaven zoals we die we in hoofdstuk 4 ontwikkelden. Onze bevindingen zijn als volgt: 1. 'gerechtvaardigde wijziging' in tegenstelling tot wat verwacht werd, blijkt bijna de helft van de wijzigingen in het CS niet gerechtvaardigd te worden door een verandering in de werkelijkheid waarvan het CS een model is, maar ligt de oorzaak voor de wijziging elders 2. 'omvang van de wijziging' uit ons beschikbare materiaal kunnen we geen duidelijke conclusies trekken 3. 'in overeenstemming met de oude structuur' de veronderstelling dat wijzigingen in overeenstemming zijn met de oude, bestaande structuren blijkt te worden bevestigd. Slechts bij 3 van de 73 semantische wijzigingen wijkt de nieuwe structuur op enkele punten af van de oude

Nederlandse samenvatting

277

4. 'uitbreiding op de oude structuur' circa tweederde van de wijzigingen breidt de bestaande structuur uit, terwijl circa een van de vijf een duidelijke inperking ervan is. Het restant van wijzigingen verandert het CS op nog andere manier. Uitbreidbaarheid is dus een belangrijk, maar niet een afdoende antwoord op de eis van flexibiliteit 5. 'complexiteit' hier zien we dat de complexiteit van een CS in de loop van de tijd toeneemt, totdat er duidelijke tegenmaatregelen worden genomen om de complexiteit terug te dringen 6. 'vatbaarheid voor wijziging: per entiteit' er blijkt hier dat er geen een-op-een verband bestaat tussen semantische verandering, en wijziging in de structuur van entiteiten. Daarnaast vinden we dat vrijwel elke entiteit in de loop van de CS evolutie een of meer wijzigingen ondergaat 7. 'vatbaarheid voor wijziging: per referentie' wijzigingen in referenties blijken duidelijk minder vaak voor te komen dan wijzigingen in entiteiten; anders gezegd, referenties zijn meer stabiel dan entiteiten. Daarentegen zijn reflexieve referenties en referenties met 1:1 cardinaliteit instabiel, getuige het feit dat al dit soort referenties in de case studies veranderden 8. 'behoud van identiteit' de veronderstelling dat bij wijzigingen meestal de identiteit van entiteiten behouden blijft wordt bevestigd; bij 15% van de semantische wijzigingen is sprake van een aanpassing in de identiteit. Voorts signaleren we als trends in de lange-termijn ontwikkelingen dat: - de omvang van het CS, gemeten naar het aantal entiteiten en referenties, toeneemt, - het aantal semantische wijzigingen in de loop van de tijd afneemt, - het CS geleidelijk steeds minder abstract wordt, en dat - afgeleide gegevens in het CS later oorzaak worden van ongerechtvaardigde wijzigingen Op basis van onze ervaringen met de evolutie van CSs formuleren we een groot aantal bestpractice aanbevelingen. Met deze aanbevelingen beogen we hulp te bieden aan de analisten in de dagelijkse bedrijfspraktijk om een betere flexibiliteit, en een soepeler evolutie te bereiken als het CS verandert. De synthese Hoofdstuk 8 brengt de theorie uit het eerste deel, en de praktische inzichten uit het tweede deel bij elkaar. Zo kunnen we op basis van de praktijk concluderen dat het theoretisch raamwerk voor flexibiliteit, met zijn 3 dimensies en 8 richtlijnen, een deugdelijk, en in de praktijk toepasbaar instrument is. Ons praktijkgedeelte laat ook zien dat ontwerpstrategieën op zich niet toereikend zijn om een CS flexibel te houden nadat het operationeel is geworden. Tenslotte formuleren we vijf nieuwe wetten voor de evolutie van Conceptuele Schemas: - elk CS evolueert, - de bestaande structuren in een CS worden zo min mogelijk gewijzigd, en vrijwel altijd passen de oude structuren naadloos in het nieuwe CS - de top-down hiërarchie in een CS blijft altijd gehandhaafd, - elk CS wordt omvangrijker, complexer, en steeds minder abstract, en

278 -

Exploring Conceptual Schema Evolution elk CS blijft relevant als beschrijving van de werkelijkheid, ook als het informatiesysteem wordt vervangen waarvan het een onderdeel is.

Hoe verder Dit proefschrift is een eerste verkenning van de evolutie in Conceptuele Schemas. Daartoe hebben we de evolutie van CSs uit de operationele praktijk van bedrijven beschreven en geanalyseerd. Voor zover wij weten is een dergelijke analyse nog niet eerder uitgevoerd. Ons materiaal levert een uniek inzicht in belangrijke aspecten van CS evolutie, en het biedt uitstekend basismateriaal waarop verdere onderzoeken kunnen voortborduren om flexibiliteit in Conceptuele Schemas te leren begrijpen, voorspellen, en beheersen.

Lex Wedemeijer Maastricht 5 mei 2002

Summary

279

Exploring Conceptual Schema evolution Summary What is a Conceptual Schema Enterprises nowadays are very dependent upon information systems to handle massive amounts of data. A core component of such systems is the Conceptual Schema, also called data model or information model. The Conceptual Schema, abbreviated to CS, models the structure of all relevant information 'out there' in the real world, being the unique and unambiguous representation of all semantics and all interrelationships in that real-world data. Every CS is composed out of four constructs: entity, reference, attribute, and constraint. In this thesis, we restrict ourselves to the entity and reference construct, largely ignoring the attribute and constraint constructs. Those unfamiliar with the notions of entity and reference may look at some of the CS diagrams in chapter 5. Each rectangle in the diagrams symbolizes an entity, and lines connecting the rectangles represent the references. Why is the Conceptual Schema important The CS, as a core component of information systems, must meet high quality standards. Any errors or ambiguities in the notions captured by the CS may cause problems later, when different parts of the information system are made to cooperate. This is why so much effort is spend on information analysis: what are relevant concepts in the real world, what are their exact definitions, and how are these notions interrelated. Still, it is not enough to create a superior design: design is a once-only effort. After the design is delivered and the information system has gone operational, changes in the real world can and will emerge. Such changes in the structure of the real-world information, which we will call 'semantic changes', must have an effect in the information system eventually. CS flexibility Over time, every enterprise and its ways of doing business will change, technologic capabilities will continue to improve, and user demands will change. Therefore, enterprises need flexible information systems. In particular the Conceptual Schema as a core component of such information systems must be flexible. However, a review of literature reveals that there is no consensus on how flexibility should be defined. Moreover, no clear guidelines exist how to assess the flexibility of a particular CS, or how to compare different CSs for their flexibility. Most authors discuss the subject only in qualitative terms, and do not specify their precise understanding of the notion of flexibility. Our working definition of flexibility is: the potential of the Conceptual Schema to accommodate changes in the structure of real-world information that we are interested in, within an acceptable period of time Why is it important to study CS evolution As we define it, flexibility is a potential to be harvested in the future, and there is no way to verify beforehand that a CS will prove to be 'sufficiently flexible'. It is impossible to assess flexibility in a straightforward way. However, we can learn to understand important aspects of flexibility in an indirect way. Instead of future flexibility, this thesis investigates the actual changes in four operational CSs over the past decade. Through this investigation, we hope to

280

Exploring Conceptual Schema Evolution

arrive at a better understanding of the notion of flexibility in the operational business environment. A second reason to look into flexibility is that literature on the flexibility of CSs in live business environments is scarce. What is the natural evolution of an operational CS in the business environment ? What features in that environment will affect its evolution, and which aspects are merely circumstantial ? And what are the implications for long-term flexibility of the CS ? The Theoretical Track Our research of evolving CSs follows two separate tracks that come together in the two final chapters. The theoretical track lays the groundwork that we need in order to investigate flexibility in operational CSs, and to draw reliable conclusions from our explorations. Chapter 2 introduces fundamental concepts in conceptual modeling of relational databases such as the 3-Schema Architecture, the data model theory and its associated taxonomy, and quality aspects of the CS. It also reviews related topics in conceptual modeling. Framework for flexibility Chapter 3 introduces our framework for flexibility of operational CSs. The framework is made up of three dimensions, subdivided into eight guidelines to enhance CS flexibility: - environment is where changes come from this dimension is subdivided in two guidelines: select the best scope for, and capture the essence of the relevant environment, - timeliness is how swift a change in the environment is accommodated into the CS this is subdivided in the guidelines to minimize impact of change, and to ease change propagation to other components of the information system, - adaptability is the ease of changing the CS to accommodate new requirements this dimension is subdivided in four guidelines: keep the CS simple, use abstraction layering, model each feature once, and provide a modular composition of the CS. The framework is applied in a survey of the effectiveness of current CS design strategies. The survey is presented in three separate sections, as we distinguish: - active strategies that strive to improve adaptability of the CS, - passive strategies that attempt to prevent the need for CS change in advance, and - abstraction strategies that make the CS more abstract in order to capture less information that is liable to change. Our survey brings out that claims towards CS flexibility are generally based on theoretical ideas about CS flexibility, not on substantiated operational evidence. We find that no strategy or best practice employs all guidelines for flexibility, and no single best theory emerges from the survey. This unsatisfactory state of affairs emphasizes the relevance of our work. Metrics for CS flexibility Chapter 4 develops a set of metrics to assess the degree of change, and hence the degree of flexibility in evolving CSs. This set enables us to go beyond the kinds of informal discussions of flexibility that abound in the literature due to the lack of serviceable metrics. The metrics are developed in accordance with the framework of chapter 3:

Summary

281

1. 'justified change' as the CS is a valid, complete and correct model of the information structure in the real world, every change in the CS ought to be justifiable by some contemporary change in that environment. 2. 'size of change' the size of a change in the CS ought to be proportional to the severity of the corresponding real-world change. However, as there is no reliable way to quantify 'severity of change' in the real world, we have to simplify this metric to measuring the size of change in the CS only. 3. 'compatibility' a change that is compatible with the existing layout of the CS is simpler, faster and cheaper to accommodate than a change that deviates from it. 4. 'extensibility' a change that merely extends the CS is evidently compatible with the existing layout. An implicit assumption in CS extension is that all notions of the old CS remain valid in the new CS. 5. 'lattice complexity' a complex CS is more difficult to change than a simple CS. We define our metric of lattice complexity as the number of references minus the number of entities (plus one for good measure). A simple CS has lattice complexity equal to zero; higher numbers correspond to more complex CS lattices. 6. 'susceptibility to change: per entity' it is often implied that of the four constructs that make up every CS, the entity construct is least susceptible to change. To check this assumption, our metric counts the number of changes in entities. 7. 'susceptibility to change: per reference' a similar metric counts the number of changes in references. 8. 'preservation of entity identity' entity identity is of vital importance in order to capture a real-world notion. A change in entity identity is relevant only if the corresponding real-world notion alters in a fundamental way. This provides us with a set of eight metrics that are founded on sound theory. Next, we apply the metrics in the Practical Track of this dissertation to assess the evolution in operational CSs. The Practical Track Chapters 5, 6 and 7 explore the long-term evolution of operational CSs. Four evolving Conceptual Schemas Chapter 5 reports on the long-term evolution of four CSs over a number of years. Case studies of this nature, conducted in cooperation with two different companies, have not been reported before. We present each CS version as-is, including its unusual constructions and imperfections. We did not edit the schemas to improve their appearance or conceptual quality, as this would diminish the validity of our research. We based our selection of these four cases for longitudinal research on three main criteria:

282

Exploring Conceptual Schema Evolution

- the CS is incorporated in an information system that supports a primary business area - the CS contains between 3 and 30 entities, to allow investigation by a single researcher - there are at least three adequately documented versions of the operational CS In all, we observed 73 semantic changes in the long-term evolution of the four CSs. Short-term view Chapter 6 studies semantic changes in the short term: what happens when a new CS version goes operational ? We describe a comprehensive change process that both companies participating in our research apparently conform to. The process comprises four stages: awareness, specification, CS accommodation, and data coercion. The two stages following specification can be executed concurrently or even in reverse order. A second observation is that changes in the CS are often molded in such a way that the adverse effects are limited. It is accomplished by deferring data conversions, by postponing the elimination of obsolete data, or by breaking up a single semantic change over several CS versions. The third and perhaps most important observation is that many changes are modeled on some underlying pattern. Appendix B lists a dozen semantic change patterns accounting for almost 80% of the semantic changes. We argue that the application of these semantic change patterns will make the effort to change the CS simpler and more manageable. Chapter 7 analyzes the long-term effects of CS changes by inspecting trends in the metrics as developed in chapter 4. Our findings are as follows: 1. 'justified change' contrary to our expectations, almost half of the CS changes are not readily justifiable by some contemporary change in the environment that the CS is a model of. Instead, the cause for change originates elsewhere. 2. 'size of change' no clear conclusions could be drawn from our research material. 3. 'compatibility' the hypothesis that CS changes will be compatible with the existing layout of the CS is confirmed. The new construction was incompatible with the previous one only in 3 out of 73 semantic changes. 4. 'extensibility' approximately two out of three changes are extensions of the CS, and one out of five is a clear restriction. The remainder of changes alters the CS in other ways. Thus extensibility is clearly the dominant mode of change, but it is inadequate to meet all demands for flexibility. 5. 'lattice complexity' we observe lattice complexity to slowly increase over time, until it is curbed when specific countermeasures are undertaken. 6. 'susceptibility to change: per entity' no clear relationship is observed between semantic changes, and changes in entity structures. Another observation is that with few exceptions, the structure of every entity changes in the course of the CS evolution.

Summary

283

7. 'susceptibility to change: per reference' changes in references are less frequent than changes in entities. To put it another way: references are more stable than entities. On the other hand, reflexive references and 1:1 references are unstable as all of these were changed in the course of the schema evolutions. 8. 'preservation of entity identity' the hypothesis that entity identity is preserved is confirmed; we observe a change in entity identity in 15% of the semantic changes. Other long-term trends that we observe in the four evolving CSs are: - the size of the CS, counted as the number of entities plus references, increases, - the number of semantic changes seems to decrease over time, - the schemas appear to become less abstract over time, and - derived data in the CS are a cause of unjustified change. From our experiences in these case studies, we formulate a series of best practices. Our aim is to enable the maintenance engineer to achieve a higher level of flexibility in, and a more graceful evolution of, the evolving CS. Synthesis Chapter 8 presents a synthesis of the Theoretical and Practical Tracks. We argue that our research results provide a practical proof of concept for the Framework for Flexibility, and its dimensions and guidelines. We also conclude from the Practical Track that design strategies alone are inadequate to ensure CS flexibility once the designed CS has gone operational. Next, we propose five new Laws of Conceptual Schema Evolution: - every CS evolves, - changes in the CS are compatible, so that constructs of the old CS usually fit seamlessly in the new CS, - the top-down hierarchy in an evolving CS remains fixed, - each CS grows larger, more complex, and less abstract, and - the CS as a conceptual model of information 'out there' in the real world survives the information system that it is a core component of. Future directions This thesis is a first exploration of Conceptual Schema evolution, describing and analyzing the evolution of real CSs in their natural environment, the operational business. To the best of our knowledge, this kind of work has not been reported before. Our work gives some unique insights into main features of CS evolution, and it provides an excellent basis for further research that is needed in coming to understand, forecast, and control CS flexibility.

Lex Wedemeijer Maastricht May 5, 2002

284

Exploring Conceptual Schema Evolution

INDEX 3-Schema Architecture 7, 22 CS version 29, 85 ABP 11 data abstraction 200 access 80 adaptability 38 coercion 29 aggregation 43, 126, 225 derived 23, 130, 226 analytical power 26 distribution 45, 80 anomaly 3, 40 format 72, 225 artificial difference 26 ownership 44 attribute 5, 225 quality 24 artificial 226 validity 147, 199 base data 177 data access routine 39 boundary 138, 182 data coercion 147 business environment 2, 80, 221 data independence 22 business function 11 data model theory 225 business process 85, 177 essential 26 cardinality 168, 226 extended 26 case study 12, 85, 221 database management system change 1, 28, 85 39 coordination 29 database reorganization 29 justified 65 data-independent 23, 69 propagation 39 deferred data conversion 145 proportional 66 dependence 42, 177 semantic 8 derivation function 177 succession 146 design phase 8 change driver 10, 37 design quality 8 coherent 39, 77 design strategy 36 compatible 68 diagram 30, 227 complete 23 dimension 37 complexity 71 discrepancy 26, 69 composition 24 dissociate 123, 169 comprehensibility 26 diversification 31 Conceptual Schema. zie CS documentation 11 connection constraint 71, 178, documentation quality 14 227 domain 113 consistent 23 domain mismatch 225 constraint 5, 47, 225 elimination 145, 161, 240 construct 177, 225 encapsulation 40, 75 conversion 29 enterprise model 48 corporate data model 48 enterprise source data 178, 180 correct 23 entity 225 CS 22 complex 44 design proposal 8 composition 167 CS lattice 71 extent 225 CS quality 23, 210, 222 generalization 226

identity 74 integrity 225 intent 30, 69, 225 member 226 owner 226 specialization 226 structure 30, 225 environment 37 evolution 2, 11, 32, 85 evolution operator 151 experience 80, 221 expressibility 26 extension 69 External Schema 22 Extra-temporal Schema 25 flexibility 2, 31, 220 fragmentation 248 functional dependency 74, 177 generalization 43, 152, 225 global CS 48, 183 granularity 72, 75 guideline 39 hierarchy 43, 100, 209, 227 historical data 88, 104 impact of change 39 implementation phase 9 incompatible 68 information chain 178 information structure 2 information system 1, 85 legacy 9 inheritance 32 instance 225 integration 25, 48 integrity constraint 225 Internal Schema 22 intersection 238 irreducible 23, 180 justified change 3, 65 key 47, 74, 225 weak-entity 146 lattice complexity 71 layering 200 level of abstraction 26, 71

Index life cycle 28, 43, 210 life cycle phase 9 localized change 66, 75 maintainability 24, 64 maintenance 66 major entity 43 Majorant Schema 25 member entity 226 metric 24, 64 minimal 23, 40 mismatch 183, 225 modeling error 9 module 40, 75 name change 225 naming conflict 25 none 226 null 226 ontology 53, 206 operational phase 9 owner entity 226 package 43 partner 104 pattern 50, 148 pension 104 persistent 3 population 225 Postkantoren 11 postponed elimination 145 precedes 180 program evolution 32 property 225 proportional change 66 protocol 12 quality 219 rate of change 67

285 reduction 161 redundant 23, 179 reference 226 aggregation 71, 209 compulsory 226 optional 227 reference path 178, 227 referential integrity constraint 226 referred entity 226 referring entity 226 reflective capability 45 reflexive reference 168 reification 152 relation 226 relational data model 7, 225 relationship 104 relax 250 replacement phase 9 representation 40, 225 restriction 161 retention-time 179 retroactive 25 reuse 25 reversal 97, 147 reverse engineering 14, 25 rigidity 67 schema development 9 schema evolution 27, 29 schema persistence 210 schema size 80 schema transformation 41 scope 39 semantic change 8

semantic change pattern 148, 228 semantic heterogeneity 26, 41 semantic homogeneity 183 significant 15 simple 23 site 110 snapshot data 88 software complexity 32, 66 specialization 226 stability 3, 101 strict typing paradigm 29, 145 structural conflict 25 subschema 22, 71 succession 25, 146 survives 180, 210, 283 susceptibility to change 72 syntax 6, 32 taxonomy 41, 177, 227 temporal 88, 177 temporarily specified 180 test phase 9 theoretic difference 26 timeliness 38 transitive closure 43 type persistence 73, 171 unit 8, 113 Universe of Discourse. zie UoD UoD 22 user perception 3 user-view 22 valid 8, 18, 23, 59 volatility 24, 80 weak-entity key 146

286

Exploring Conceptual Schema Evolution

CURRICULUM VITAE Lex Wedemeijer 6 februari 1956 Geboren te Velsen augustus 1968 - juni 1974 VWO: Gymnasium Felisenum te Velsen augustus 1974 - augustus 1980 Studie Wiskunde aan de Rijks Universiteit Groningen Doctoraal examen 'Cum Laude' in de Theoretische Wiskunde met als specialisatie Lineaire Analyse oktober 1980 - september 1984 Werkzaam aan de Rijks Universiteit Utrecht, wetenschappelijk medewerker bij de vakgroep Toegepaste Wiskunde van faculteit Wiskunde en Natuurwetenschappen oktober 1984 - december 1995 Werkzaam bij aanvankelijk het Staatsbedrijf der PTT, later Koninklijke PTT Nederland, in functies binnen werkmaatschappijen Post en Postkantoren B.V. te Groningen september 1987 - juni 1990 Postdoctoraal studie Bedrijfskunde afgerond bij de Stichting Academie voor Bedrijfskunde te Groningen januari 1996 - heden Werkzaam bij Stichting Pensioenfonds ABP in functies binnen de afdeling Informatiemanagement van ABP Pensioenen te Heerlen