Data Modelling and Database Requirements for ... - CiteSeerX

7 downloads 10331 Views 2MB Size Report
Wire frame . ..... to translate into the datalogical realm, and there is a limit to the amount of ...... Internet URL: http://softail.cecer.army.mil/grass/viz/VIZ.html, 1995.
Data Modelling and Database Requirements for Geographical Data

Håvard Tveite January, 1997

Abstract An overview of the fields of data modelling, database systems and geographical information systems is presented as a background. Requirements to a data model and a data modelling methodology for geographical data are investigated and presented. Another contribution is an extension of the traditional ER-diagrams to better communicate the semantics of geographical data. The approach is based on earlier work on Sub-Model Substitution, and adds new symbology that is relevant for geographical data. Database system requirements for geographical data servers are investigated and presented, together with new ideas on distribution of geographical data for parallel processing.

Table of Contents

Chapter 1

Chapter 2

Chapter 3

Introduction Motivation . . . . . . . . . . . Contributions . . . . . . . . . . Related work . . . . . . . . . . How this document is organised

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Database Systems and Data Models Data modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Modelling concepts 5 Infological data models and the infological and datalogical realm Metadata versus “ordinary” data 8 Semantic data models . . . . . . . . . . . . . . . . . . . . . . . . . . ER models and diagrams 9 EER models and diagrams 11 Object-oriented data models 13 Database systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . Brief history 15 Definitions 15 The three-schema architecture 16 Features/services of database systems 17 Distributed database systems 18 Database machines 19 Status of database systems 19 Database models . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hierarchical DBMSs 20 Network DBMSs 21 Relational DBMSs 23 Object-oriented DBMSs 26 Deductive DBMSs 28 Geographical Information Systems History . . . . . . . . . . . . . . . . . . . . . . . . . . . Definitions of GIS . . . . . . . . . . . . . . . . . . . . . The utility of geographical information systems . . . . . . Local administration GIS, an example application area Geographical data . . . . . . . . . . . . . . . . . . . . . Geographical maps 33 Spatial geographical data 34 Non-spatial or “catalogue type” GIS data 36 Historical data 36 Data quality 37 Data distribution and sharing 37 Models for geographical data . . . . . . . . . . . . . . . .

. . . . . . . . . . . . 32 . . . .

. . . .

. . . .

1 1 2 2 3

5 . . 5 7 . . 9

. 14

. 20

29 . . . 29 . . . 30 . . . 31 . . . 33

. . . . . . . 38

ii

The raster paradigm 38 The vector paradigm 40 Representation of the interior of spatial objects Queries and operations . . . . . . . . . . . . . . . GIS queries 42 Use of the different GIS query types 44 Current GIS technology . . . . . . . . . . . . . . ARC/INFO 45 System 9 48 TIGRIS 50 Smallworld GIS 50 GRASS 51 Summary 52 Trends . . . . . . . . . . . . . . . . . . . . . . . Hardware trends 53 Technology trends 54 GIS trends 55 The GIS of the future . . . . . . . . . . . . . . . . Servers of geographical information 56 Research and research issues . . . . . . . . . . . . Chapter 4

41 . . . . . . . . . . . . 42

. . . . . . . . . . . . 45

. . . . . . . . . . . . 53

. . . . . . . . . . . . 55 . . . . . . . . . . . . 57

Data model requirements Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . Geographical data revisited . . . . . . . . . . . . . . . . . . . Borders of geographical phenomena 60 Features of geographical data 60 Requirements to high level geographical data models . . . . . . Traditional ER model abstractions 66 Geometrical object types 67 Spatial relationships 68 Implicit geographical relationships 69 Topology 69 Aggregation 73 Generalisation 74 Categories 76 History and time 76 Quality/ accuracy 77 Derived objects 79 Sharing of geometrical objects among geographical objects Roles and scale 80 Spatial constraints 81 Groups of related objects (themes) 81 Distributed ownership 82 Behaviour 82 Modelling implications . . . . . . . . . . . . . . . . . . . . . . Proposed data models and exchange standards for GIS data . . ATKIS 84 SDTS 86

59 . . . . . 59 . . . . . 59

. . . . . 65

79

. . . . . 83 . . . . . 84

iii

NGIS and FGIS MetaMap 95 Chapter 5

Chapter 6

90

Sub-Structure Abstraction in Geographical Data Modelling Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geographical data modelling using structure abstractions . . . . Extending ER-diagrams with sub model substitution 102 A forestry research example 109 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future work 111

99 . . . . 99 . . . . 101

. . . . 110 . . . . 111

Database management system issues for geographical data Basic requirements . . . . . . . . . . . . . . . . . . . . . . . . . Data volumes and data types . . . . . . . . . . . . . . . . . . . . Samples 115 Raster data 116 Vector data 118 Time 118 Generalisation levels 118 Summary 119 Multimedia (integrated) database systems . . . . . . . . . . . . . Hypertext 120 Spatio-temporal databases . . . . . . . . . . . . . . . . . . . . . Concepts of time in databases 121 Representing time in databases 122 TQuel 122 Time in geographical databases 122 Metadata and data dictionaries . . . . . . . . . . . . . . . . . . . Quality in geographical databases 125 Data dictionary issues for geographical data 126 Geographical Query Languages . . . . . . . . . . . . . . . . . . Different ways of organising geographical information 130 Spatial query language proposals 131 Query optimisation 134 Spatial data types 134 Spatial constraints 137 Operations 137 Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transactions on temporal geographical data 143 Transaction management 143 Concurrency Control 144 Distribution issues . . . . . . . . . . . . . . . . . . . . . . . . . Parallel processing 150 Distribution of spatial data 151 Replication 155 Heterogeneous database system integration 156 Fast geometrical processing 157

113 . . . 113 . . . 115

. . . 119 . . . 121

. . . 124

. . . 129

. . . 142

. . . 149

iv

Data exchange formats 157 Some limitations of currently used database models . . . . . . . . . . 158 Network database models 159 The relational database model 160 Object-oriented database models 164 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Appendix A Data structures for spatial databases Basic data structures . . . . . . . . . . . . Digital computer storage media 167 Sequences (lists/arrays) 168 Randomised sequences 169 Hierarchical structures . . . . . . . . . . . Multi-dimensional trees . . . . . . . . . . Points 172 Lines 172 Regions in 2D 173 Grid partitioning and spatial hashing . . . Multi resolution image trees (pyramids) Region quad trees 175 Linearisation 175 EXCELL 176 Grid file 176 Appendix B Representation of 3D structures 3D objects . . . . . . . . . . . . Storage organisation . . . . . . . Point sampling . . . . . . . . . . Wire frame . . . . . . . . . . . . Triangulated Irregular Network . Parametric representations . . . . Constructive Solid Geometry . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

167 . . . . . . . . . . . . . . . 167

. . . . . . . . . . . . . . . 170 . . . . . . . . . . . . . . . 171

. . . . . . . . . . . . . . . 175 175

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Appendix C The NHS Electronic Navigational Chart Database Introduction . . . . . . . . . . . . . . . . . . . . . . . . Background . . . . . . . . . . . . . . . . . . . . . . . . . Navigational Charts . . . . . . . . . . . . . . . . . . . . ENC and ECDIS 186 The ENCDB 186 Data management 187 Relating the traditional chart data to other data 188 Structures for the ECDIS database . . . . . . . . . . . . . Data modelling for ECDIS . . . . . . . . . . . . . . . . . DBMS-aspects of an ENC-server . . . . . . . . . . . . . The amount of data 193 The data 193 Response time 194 Concurrency and recovery 194

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

179 179 180 180 181 182 182 184

185 . . . . . . . 185 . . . . . . . 185 . . . . . . . 185

. . . . . . . 188 . . . . . . . 191 . . . . . . . 193

v

Security 194 Reliability 195 Billing 195 The choice of a database system for the ECDIS server 195 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Bibliography

197

Index

217

vi

vii

Acknowledgements

I wish to thank my supervisor, professor Kjell Bratbergsengen, for encouragement and support through the 8 years that have passed since I started these studies. Without his continuous commitment and goodwill, I would have given up a long time ago. Thanks also to professor Ingolf Hådem, who agreed to be my advisor on photogrammetry. The contact with Hådem has been sporadic after the focus of my work was directed to data modelling and database management. The first part of this study took place when I was employed as a research assistant at the Department of Computer Systems and Telematics, NTH (now a part of NTNU) for two and a half years from 1988 to 1990. Then, I was supported by a research grant from the Norwegian Research Council for one and a half year from 1990 to 1991. The rest of the work has been done now and then while being employed at the Department of Surveying at the Agricultural University of Norway (NLH). I would like to thank everyone at the Department of Computer Systems and Telematics in Trondheim for a friendly atmosphere. In particular, I would like to mention the members of the database group. They have always been very helpful. My employer during the last years of this work, the Department of Surveying at NLH, also deserves some thanks for encouraging me to finish this work. I have some "friends" who have been annoying me by asking about the status of my thesis work on all occasions during the last 4 years. I am not quite sure if I should thank them or not! Last, but not least, the friendly atmosphere of "Munkholmens Lægeforening" has been an important inspiration. Without such a stimulating environment, it would have been difficult to get the necessary inspiration for finishing this work.

viii

Chapter 1 Introduction

Digital geographical data are indispensable for monitoring and managing the environment and for managing and planning geographically based human activities such as land use, utility networks, long-distance transportation and mining in efficient ways. Sharing of digital geographical data both between and within organisations is of utmost importance for the efficient use of geographical information systems (GISs). One reason for this is the large efforts and costs that are involved in collecting and maintaining high quality geographical data sets: The more users that can share the data, the easier it is to cover these expenses. Another important reason for sharing is that the availability of high quality data sets has the potential of making environmental (and land use -) planning and management better and more cost-efficient. To be able to share digital geographical data, standards are necessary. Standard data models for spatial data, standard encoding formats for the exchange of spatial data and standard communication protocols for distributing spatial data are all necessary parts of a foundation for efficient geographical data sharing, with the spatial data model as the basic component. A number of national standards for the digital encoding of topographic and thematic maps have emerged in the last decade. The problem with todays standards is that they cover only a limited part of the semantics necessary for general purpose exchange of geographical data. The lack of an agreed on data model that covers the essential aspects of spatial data has been impeding the development of powerful exchange standards. This thesis looks into the problems of geographical data models and geographical data modelling, and outlines some possible solutions. To be able to share geographical data between organisations and within organisations, it is necessary to have a system for managing the data. This thesis presents a set of requirements to database management systems that are to act as servers of geographical data/information.

1.1 Motivation Research on geographical information systems has suffered under the lack of a solid foundation. Many GIS concepts need clarification, spatial data modelling methodologies should be developed, spatial database systems and data structures need elaboration [Günther90], digital cartography and GIS user interfaces need sophistication and finally, there is

2

Chapter 1: Introduction

an urgent need for standards. The use of GISs is particularly impeded by the lack of standards and the resulting limited availability of high quality data sets. Investments in GISs are risky in such a situation. It is difficult for users to find a suitable system for covering their needs for spatial data support when there is no consensus on what kind of functionality and which kinds of interfaces such a system should provide. Efforts to develop and market geographical data sets is difficult when there are no generally accepted standards for their structuring, storage and exchange. When such standards are in place, there will be a market for geographical data servers and services. Such servers should be connected to an international public computer network, giving "everyone" access to useful spatial data. An international system of spatial data servers will have to be supported by mechanisms for finding the right data, and sophisticated spatial database systems are required to manage the geographical data on these servers. Data modelling techniques supporting spatial data become increasingly important as the use of GISs becomes more and more widespread. There is a need for simple concepts and intuitive models in the communication process between the computer scientists and GIS experts on the one side and the spatial science experts on the other. A standardised high level approach to geographical data modelling would be a very useful tool. Such a platform for integrated use of all kinds of geographical information would be a good starting point for GIS application and database development. If a more solid foundation for GISs can be achieved, the activity in the field must be expected to increase significantly. The serious use of GISs could blossom, and GIS related research and the use of GIS as a tool in other kinds of spatial research would accelerate.

1.2 Contributions The main contributions of this work are in two areas. The first area is geographical data modelling, and the second area is database support for geographical data, with special emphasis on the distribution/partitioning of geographical information.

• Modelling concepts specific to geographical and spatial information are identified. • Spatial sub-structure abstractions in ER-like diagrams [Tveite92] are proposed. • Database issues for geographical data are outlined and investigated. • Distribution issues for geographical data [Tveite93] are identified and a distribution strategy for geographical data is outlined.

1.3 Related work Research on databases for spatial data is one of the branches of database system research that has been receiving increasing attention during the last decade ([Günther90]). There are now well attended special purpose conferences for advances in spatial databases (SSD 89 [Buchmann90], SSD91 [Günther91], SSD93 [Abel93]), SSD95 [Egenhofer95]. Data models for spatial and geographical databases and geographical information systems have received some attention in the 1980ies and early 1990ies. As in the database commu-

How this document is organised

3

nity, object-oriented methods have been particularly popular recently. Among the early publications on these topics are Egenhofer [Egenhofer87] [Egenhofer89a] [Egenhofer89b], Feuchtwanger [Feuchtwanger89] [Feuchtwanger93], Frank [Frank88] [Egenhofer87] [Egenhofer89a] [Egenhofer89b], Worboys [Worboys90a] [Worboys90b], Hearnshaw [Worboys90a] [Worboys90b], Maguire [Worboys90a] [Worboys90b], Morehouse [Morehouse90], Orenstein [Orenstein86] [Orenstein88] [Orenstein90b], Peucker [Peucker75] [Peucker78], Scholl [Scholl90] and Voisard [Scholl90]. Within the area of distribution and parallelisation, there has been work on the use of parallel technology for geographical information analysis at the University of Edinburgh. In Edinburgh, they have been looking into parallelisation of GIS algorithms. Some other efforts on algorithms have also been made, for instance by Mower [Mower92], but the use of parallel technology for organising general purpose spatial databases has not been given much attention. The main part of this thesis has been written while working with the database technology group at IDT, NTH. Distributed database technology (both hardware and software) has been the focus of the group, and several prototype parallel database machines have been developed for research purposes. The research performed by this group has provided valuable input to the “distribution-part” of this thesis.

1.4 How this document is organised The thesis can roughly be divided into two parts. The first part includes the chapters 2 to 6, and contains the central aspects of the thesis, namely data modelling and database system topics for GIS. The second part comprises the appendices A to C. Appendix A is a very short overview of spatial data structures. Appendix B is a short presentation of representation techniques for three-dimensional (3D) structures. Appendix C is a report submitted to the Norwegian Hydrographic Service discussing database issues for an electronic navigational chart database that was under construction some years ago. The server was to provide authorised chart information to ships.

4

Chapter 1: Introduction

Chapter 2 Database Systems and Data Models

This chapter is an introduction to the fields of data modelling, database systems and database models, a necessary background for the rest of the thesis. The review will be limited to a short summary of the most common data modelling approaches, an overview of the features expected from a database system and some short notes on the most popular database models.

2.1 Data modelling An information system that is to support an activity should cover all aspects of the real world pertinent to that activity. To be able to develop such an information system, a good model of this so-called mini-world must be developed. Such a high-level data model should abstract and structure descriptions of the phenomena in the mini-world in such a way that the information becomes manageable and understandable for humans. It is important for a useful data model to [Tsichritzis82]: “… capture the appropriate amount of meaning as related to the desired use of the data”. Much research has been devoted to the development of powerful modelling formalisms, emphasising the communication (presentation and visualisation) of mini-worlds between humans and the translation of the models into formats suitable for computer handling.

2.1.1 Modelling concepts To be able to talk about the world and our representation of the world in a model, a certain vocabulary must be defined. The following is a blend of terminology taken from different sources ([Tsichritzis82], [Chen76], [Ng81], [Elmasri89], [Rumbaugh91], [Sindre90]). Abstraction is used to hide detail, so that one can concentrate on overall structure. Recognised data abstraction mechanisms: Classification is the formation of an object type from a group of similar tokens (the reverse process is called instanciation). Generalisation is the abstraction of similar object types into a higher level object type (the reverse process is called specialisation). Aggregation is the abstraction by which an object is constructed from its constituent objects [Tsichritzis82]. Aggregation and generalisation hierarchies are orthogonal,

6

Chapter 2: Database Systems and Data Models

and can therefore be specified separately [Tsichritzis82]. The term Association [Elmasri89] is also used for type level aggregations. Association [Sindre90] is related to aggregation, but is a weaker relationship between independent objects (not really structural). Grouping [Hull87] covers the same abstraction as association. Category [Elmasri89] is also similar to association. One use of association is grouping of different classes that play the same role in a relationship to some other class (the owner of a property can be either an organisation or a person). Associations can often be represented using generalisations. Identification ensures that all abstract concepts and concrete objects can be made uniquely identifiable. This can be accomplished by unique names or by other means [Elmasri89]. Attribute: A named domain that represents a semantically meaningful object … [Tsichritzis82] (for example the name of a person, the geometry of an area feature, the speed limit of a road, …). Class: The group of all objects obeying the class’ membership condition/predicate. The set of all objects of a certain object type. Category [Tsichritzis82] is a similar concept to class. Data in the same category are supposed to have similarities [Tsichritzis82]. Constraint: In a data modelling context, inherent constraints are limitations imposed by the structures of the data model. Explicit constraints enable the modeller to include more semantics in the model than the structures of the data model itself conveys. Datum (plural: Data): an existing description of some phenomenon or phenomena (measurement recordings, images, information catalogues, …). Domain: In data modelling, homogeneous sets are called domains (examples of some traditional domains in data modelling: integers with values between 0 and 80, real numbers, strings of characters of maximum length 15, date, …). Extensional property: token-/object-level property Intentional property: (object) type-level property Object: The human interpretation of a phenomenon in a modelling context (in some modelling formalisms this is represented as an aggregation of attributes). (Object) Type: The common characteristics of a set of similar objects can be covered by a type (Abstraction is used to define a type from a class of similar tokens [Tsichritzis82]). Strictly typed data models are data models where each datum must belong to some category; Loosely typed data models do not make any assumptions of categories [Tsichritzis82]. (Object) Token: An instance of an object type (A token is an actual value or a particular instance of an object [Tsichritzis82]. Phenomenon: Some interesting “thing” (event, object, …) in the real world (for example a flow of water, an organism, a building, a car accident, …). The phenomenon concept covers the Entity concept (entity: “… something with objective reality which exists or can be thought of”, as suggested by Hall in 1976 [Tsichritzis82]). Phenomenon

Data modelling

7

will be used for references to the real world in this thesis. Entity will be reserved for use in the context of the Entity-Relationship (ER) modelling formalism. Relation: A mathematical relation is a set that expresses a correspondence between (or aggregation of -) two or more sets [Tsichritzis82]. In the relational model, both the entities and the relationships from the ER-model are formalised using relations. N-ary relations can be visualised as tables where n-tuples constitute the rows. Relationship: An observed or intended connection between phenomena that is interesting for the modelling of a mini-world. An n-ary relationship connects n phenomena. The most common relationship type is the binary relation, connecting two phenomena. Rumbaugh et al. call the relationship concept an association [Rumbaugh91]. Set: In data modelling, a set is any collection of objects that is properly identified and is characterised by a membership condition [Tsichritzis82]. A classical mathematical set is not ordered, and duplicates are not allowed. An extended mathematical set allows ordering. Groupings [Vossen91] and sets are similar concepts. Tuple: The row of a relational table or a list of values. In the relational model, each value comes from a pre-defined domain. n-tuple: a set of n values from a set of n domains.

2.1.2 Infological data models and the infological and datalogical realm The concepts of infological and datalogical data models were introduced by Langefors in a series of publications starting in 1963 [Tsichritzis82]. Infological data models represent information in ways that are supposed to be similar to how people perceive the information (infological realm), without considering their final computer-related representations (datalogical realm). The ideal situation for an information system designer is to have a powerful infological data model that can be easily communicated between humans, and that there is a way to perform a non-loss translation of this infological data model into the datalogical realm. Infological data models In the early theoretical work on infological data models, the concepts of object, property, relationship and time were identified as basic. An elementary fact is in this framework represented as a triple (a collection of objects + a property or relationship + time), called an elementary constellation. Structured textual descriptions (natural language), formal logic (specification in for instance the logic-based programming language Prolog [Clocksin84]) and other structural techniques (with visualisation through diagrams) have been proposed as infological data models. Structured textual descriptions can express things in a human readable format, but have severe limitations when it comes to data structuring and formalisation for translation into the datalogical realm.

8

Chapter 2: Database Systems and Data Models

Logic has the advantage of being a formal description, having its roots in mathematics. It is therefore more easily translated into the datalogical realm. A problem is that logic lacks mechanisms for efficient communication of structure. Diagrams have the advantage that they can show structure (relationships) in a human readable way (usually as two-dimensional maps), and diagrams have therefore become very popular for “semantic” data modelling. A problem with diagrams is that they can be difficult to translate into the datalogical realm, and there is a limit to the amount of information that can be put into a diagram without making it difficult to comprehend. Semantic data models [Hull87] [Peckham88] introduce many useful methods for data structuring and abstraction, and constitute the most interesting branch of infological data models for database modelling. In this chapter, the ER model and an EER model are described to give a background in high level data modelling. The entity-relationship (ER) approach (or ER diagrams, initially proposed by Chen [Chen76]) has been the most popular diagrammatic representation for data modelling in the last decade. The expressiveness of the original ER model has been extended in many directions to capture more real-world semantics in the diagrams. The latest direction in real world modelling for computer representation is the object-oriented approach. Object-oriented methods add encapsulation and behaviour to the traditional structuring mechanisms of semantic data models. The datalogical realm Many different lower level data models (more closely tied to the datalogical realm) have been used through the years. They are by definition computer oriented, but the evolutionary trend of these data models is that they are approaching infological data models in expressiveness. The first low level data models from the 1950s and 1960s were based on simple file and record structures. Beginning in the late 1960s, there has been an “evolution” of the datalogical models, starting with the hierarchical data models and continuing with network data models and relational data models. In the last decade the object-oriented data models have been proposed. Reaching object-oriented data models, the distinction between the datalogical and the infological realm is getting fuzzy. Object-oriented models are claimed to cover both the infological and the datalogical realm, being directly implementable through object-oriented database systems. As datalogical data models are approaching infological data models in expressive power and sophistication, their implementation is becoming more and more complicated.

2.1.3 Metadata versus “ordinary” data The semantics of data in a database can be described using metadata. In a relational database system, some metadata are available through the data description in the system catalogues, where all the relations (tables) are described (with relation names, field names, field types and keys). In the context of geographical information standardisation work within CEN*, the term metadata is defined as [CEN95b]: *

CEN - Comité Européen de Normalisation (European Committee for Standardisation

Semantic data models

9

Data that describes the content, representation, extent (both geographic and temporal), spatial reference, quality and administration of a geographic data set. The inclusion of more semantics through more elaborate (and higher levels of) data description is often desirable. As much as possible of the information from the semantic data model underlying the database should be available within the final database. Data quality, the time of validity/acquisition of the data, the constraints that pertain to the data, and the data set location and ownership in a distributed database environment are all examples of useful metadata. Metadata could be provided at a separate level, or they could be integrated with the basic data using attributes or relationships to metadata. In general, it will be difficult to draw a sharp line between what constitutes the metadata and what constitutes the “ordinary” data. The method of metadata representation (integrated or separated) will often be a matter of preferences, but could also be dictated by the application type. For example: should the spatial extent/position of a geographical object be considered a metadata attribute or a basic attribute of the object. It is important to arrive at standards for the representation of metadata. If such standards are available, databases can be more self-contained (representing more of the real world semantics), and easier to utilise and validate by a larger class of users.

2.2 Semantic data models Semantic data models [Hull87] has been a popular investigation topic since the late 1970s. One of the early data models in this category was the ER (Entity Relationship) model proposed by Chen [Chen76]. The SDM [Hammer78] is an example of a semantically richer data model, using terminology such as class, entity, object, aggregate, abstraction, event, name, attribute, subclass, restriction and subset. Semantic data models have a strong advantage over the traditional “database models” for real-world modelling since they incorporate a wider range of data abstraction mechanisms. Developers and database designers working with complex data (CAD, CASE, GIS) are facing problems when they try to model their applications and data sets within the limits of the network or relational data model. The semantic data models are useful for infological data modelling, but the translation of complex semantic data models into, for instance, the relational model can be non-trivial. A common “solution” to this problem for many application areas has been to avoid traditional databases, developing custom data structures instead.

2.2.1 ER models and diagrams The basic Entity Relationship (ER) model proposed by Chen [Chen76] and later elaborated on by Ng [Ng81] and others offer the following primitives for modelling:

• Regular and weak entities. Weak entities are entities that cannot exist in isolation, and depend on other entity types for full identification. In the diagrams, a regular entity is represented by a labelled rectangle, and a weak entity by a double-sided labelled rectangle.

10

Chapter 2: Database Systems and Data Models

• Named relationships, involving two or more entities. In the diagrams, an n-ary relation is usually represented by a labelled diamond with one line to each of the n participating entities.

• Constraints, such as existence dependencies (arrows instead of plain lines in the diagram) and relationship cardinalities (numbers put with the relationship lines in the diagram). A structure example showing the symbology of ER diagrams, as proposed by Chen (except there are no labels on the entities and relationships in the figure), is presented in Figure 2-1.

Figure 2-1 Original ER diagram symbology.

The expressiveness of the original ER diagrams has been extended (trivially) with:

• Attributes, with names and value sets / domains (value sets are represented as labelled circles) that can be attached (with a line) to both entities and relationships in the diagrams. The attribute name is placed along the line that attaches the value set circle to the entity rectangle

• Constraints on attributes, such as keys (illustrated by underlining the attribute name). The resulting EAR model is described by an ISO document (ISO/TC97/SC5-N695). EAR (entity-attribute-relationship) diagrams, have been extensively used in modelling, especially for relational database design. Whether to include attributes or not in the diagrams is a matter of preferences. The problem with including attributes is that the diagrams tend to become cluttered and hence more difficult to communicate. Complex objects (aggregations) can be modelled using the ER model by introducing consists-of/part-of (component-of) relationships between the complex entities and their member entities. Generalisation and specialisation is often modelled in the ER model by defining is-a relationships between the specialised object types and the more general object types (the

Semantic data models

11

vehicle object type is connected via is-a relationships from the more specialised object types: car, bus, bicycle, lorry, tractor, tram, …). Associations can be modelled using is-member-of relationships. Temporal relations(ips can be modelled by using precedes relationships, but history data or versioned objects do not have a particular modelling primitive (time is not included in the ER model). Time can be supported using attributes (time of creation, time of destruction). The ER modelling formalism was intended as a data modelling tool. The behavioural part of modelling is not addressed. The big advantage of the basic ER model is that there are methods for translation of all its concepts into many popular database models (hierarchical, network and relational) [Ng81]. It is therefore fairly straightforward to implement as a database schema something that has been specified using the original ER model. Another advantage of the ER model is its limited amount of modelling primitives, which makes the model easy to learn. The limited number of modelling primitives is also a problem with the ER model. The pure ER approach can result in diagrams that are difficult to comprehend/communicate because of the necessary overloading of the very limited number of primitives. An abstraction mechanism that would allow the recognition of overall structure by grouping and hiding independent sub-models is lacking in the ER model. Omission of attributes is the only information hiding mechanism available, so it is not possible to perform multi-level modelling. As the number of entities and relationships in ER models increases, the diagrams tend to become visually unmanageable. In psychology it has been found that humans only can process 5 to 9 information items at a time (George Millers paper in Psychological Review, march 1956, pp. 81-97 [Coad90]). According to this, diagrams with 10 to 100 information items will be very hard to digest when there is no apparent way of grouping them into more manageable pieces. In practical ER modelling of large structures it is already normal to split the diagrams in one way or another. The ER model does, however, not offer any abstraction mechanisms to support such a partitioning of the model into sub-models. The choice of representation for a phenomenon will in many cases be a matter of preferences. There are no basic rules for when to apply entities and when to apply relationships. All relationships can, in theory, also be represented as entities. This can be confusing to the users of the data modelling formalism.

2.2.2 EER models and diagrams Extended Entity Relationship (EER) models and diagrams have been proposed to overcome some of the deficiencies of the first generation of ER models ([Teorey86], [Batini86], [Elmasri89]). These models provide new abstraction mechanisms in addition to those provided by the original ER model. The EER approach also introduces new symbology for some of the most common abstractions to produce more easily comprehensible diagrams. Elmasri and Navathe’s proposal for an EER-model [Elmasri89] includes the notion of a class (that encompasses entity types), subclasses, superclasses (the set of members of a subclass is always a subset of the set of members of the superclass) and categories

12

Chapter 2: Database Systems and Data Models

(associations). All classes can participate in relationships. The following symbology is added to the ER diagrams (see Figure 2-2 for an illustration):

Figure 2-2 EER symbology as used by Elmasri and Navathe [Elmasri89].

• superclass - subclass: The superclass’ and the subclass’ rectangles are connected with a line containing the subset symbol (⊂). The open end of the subset symbol points towards the superclass. A subclass can be defined by a predicate on the superclass’ group of attributes. In this case, the predicate is attached as a label to the subclass - superclass line.

• generalisation/specialisation: This is represented as a circle with a “d” (disjoint specialisation) or an “o” (overlapping specialisation) in it, connected with one line (or a double line, if the specialisation is total) to the superclass, and subset-lines (with the subset symbol (⊂) on) to all the subclasses. A specialisation can be based on the value of a single attribute, in which case it is called attribute defined. The name of this attribute is used to tag the specialisation at the superclass end of the symbol.

• categories: This is represented in diagrams as a circle with a “∪” (union) in it, having one subset line (double lined, in case the categorisation is total) to the category class (the open end of the subset symbol pointing towards the circle), and lines to all the defining classes. Predicates can be attached as labels to these lines to specify which of the members of each defining class that should be members of the category. The concept of categories makes it possible to group very different classes that play the same role in a relationship. A labelled rectangle is introduced for each category. This notion of category is similar to association.

Semantic data models

13

• constraints: superclass - subclass: A predicate to determine which characteristics a member of the superclass should have to be a member of the subclass. specialisations: A double line from the superclass to the circle to indicate that all the members of the superclass must be members of some subclass. A “d” or an “o” in the circle to indicate whether the specialisation is disjoint (no superclass member can be member of more than one subclass) or overlapping. categorisation: A double line from the category class to the circle to indicate that all the members of the defining classes must be members of the category class; Predicates to determine which members of the defining classes should be members of the category class. This EER model does not include aggregation as a special concept, and that must be considered a weakness in the context of complex object modelling. By using aggregations it would be easier to hide detailed information and emphasise overall structure by using a levelled or black-box based method. The structure of the EER model makes it possible to do some sort of multi-level modelling, but it is meant to be a single-level approach, hence it inherits the one-level weakness from the ER model. The EER model performs reasonably well for semantic modelling when compared to other popular modelling formalisms. In an empirical study, comparing data modelling formalisms [Kim95c], the EER model [Teorey86] was compared to the NIAM [Nijssen77] model, one of the most popular object-relationship (a sort of binary model [Tsichritzis82]) model [Biller77]. The findings of this empirical study can be summarised as follows (six hypothesis were tested). (1) There was no significant difference between the NIAM user group and the EER user group in their model comprehension performance, (2) the NIAM users group did not perform significantly better than the EER user group in the discrepancy-checking task, (3a) there was no significant difference between the NIAM user group and the EER user group in their perceived difficulty of formalism, but (3b) the EER users valued their modelling formalism significantly more than the NIAM users, (4) EER analysts produced a data model of significantly higher semantic quality than NIAM analysts, (5) EER analysts did not produce a data model of significantly higher syntactic quality than NIAM analysts, (6) the EER users perceived their modelling formalism to be significantly more useful than the NIAM users.

2.2.3 Object-oriented data models Object-oriented modelling research, starting in the 1980s, had its roots in semantic data models and object-oriented programming languages (such as SIMULA [Birtwistle73] and Smalltalk [Goldberg83]). Object-oriented data models incorporate such things as encapsulation and behaviour in addition to the structuring mechanisms of semantic data models [Rumbaugh91] [Coad90]. Direct realisations of object-oriented data models into object-oriented database systems has received a lot of attention, gaining momentum in the mid 1980s [Abiteboul90]. Ideas of richer database models than the relational model was, however, starting to emerge already in the late 1970ies (e.g. the SIMULA based ASTRA with the ASTRAL (extended Pascal) language [Bratbergsengen83] and PASCAL/R [Schmidt83b]). Modelling approaches that incorporate mechanisms from semantic data models are called structurally object-oriented, while those using mechanisms from object-oriented program-

14

Chapter 2: Database Systems and Data Models

ming languages are termed behaviourally object-oriented. An object-oriented modelling approach should incorporate both the behavioural and the structural aspects. Object-oriented programming languages The behavioural aspect of object-oriented data models has evolved from the field of object-oriented programming languages, having their roots in SIMULA [Birtwistle73] in the late 1960s, and continuing with Smalltalk [Goldberg83] and C++ [Stroustrup91] in the 1970s and 1980s. The key features of object-oriented programming languages are:

• Abstract data types, including methods for presenting and manipulating the state of the objects.

• Communication by message passing. To inquire an object about some property, a message is sent to the object. The messages constitute the interface to the object.

• Encapsulation/information hiding. Access to the internals of the objects is restricted, so information on an object is generally only available through its public interface (methods).

• Generalisation/specialisation hierarchies. A car, a bus and a lorry all have some common properties that can be captured by the more general class of vehicles. Cars, busses and lorries are specialised subsets of vehicles.

• Inheritance: properties and methods are inherited from the root of a generalisation tree and out to the leaves. Object-Oriented modelling and analysis Object-oriented data models combine abstractions from semantic data modelling and object-oriented programming languages. This makes them useful for many classes of real world modelling. Their advantage is in areas where behaviour is important. Simulation is such an application domain, often used in decision support systems. Object-oriented approaches provide an integrated framework for modelling both applications and the data the applications will be working on [Coad90]: … it combines the data and process model into one complete model Object-oriented methodology has a great potential for GIS modelling, but for the geographical data modelling undertaken in this thesis, structural methods are considered sufficient, as explained further in chapter 5.

2.3 Database systems Database systems facilitate data sharing and easy access to data. This is made possible by providing standardised interfaces to the data in the database and applying mechanisms that ensure consistent access to the data for concurrent users. In addition to this, the database systems ensure database consistency after system failure.

Database systems

15

2.3.1 Brief history The history of electronic data management started with the “process-oriented” period (1960-1970). In this period, before database systems were introduced, applications and their data were tied intimately together. Files could be shared between applications, but the structuring of the data was embedded within the applications. This meant that in order to apply a small modification to the data structure in a file it was necessary to change all the applications that were using it. By far the easiest approach for such systems was therefore to let the data structures remain static. Consequently, new and more efficient data structuring methods were difficult to take advantage of. In this period, work on data management system started, and early commercial systems emerged (e.g. IDS in 1962, and IMS-2/VS in 1968 [Wiederhold81]) with standardised access methods. The “data-oriented” period (1970-) followed this first period. The necessity of controlled sharing of data was recognised, particularly for business data within large organisations. The introduction of the database system approach, as we know it today, occurred early in this period. Standard database models with standard interfaces to the data were developed (network, hierarchical and relational), hiding the internal structure of the database (access structures and internal data formats) from the applications. The security and integrity of data in multi-user centralised - and distributed - database systems has continuously been enhanced through advances in transaction management research (concurrency control mechanisms, recovery protocols and commit protocols). Reaching the beginning of the 1990ies, the database needs of most business type applications have been satisfied by current commercially available database system technology. Engineering applications and other applications based on complex data do, however, seem to have demands on databases that go beyond the capabilities of current database technology [Carey90] [Maier89] [Frank84] [Egenhofer87] [Frank88]. These applications have, for efficiency and modelling reasons, until now not been utilising database systems for the management of their data. Some database systems have been constructed to meet the special needs of technical application, such as the extended relational system TECHRA [TECHRA93]. During the last decade, the need for database system support has become apparent also for applications that work on complex data. To try to meet these needs, extensions to the now maturing relational database management systems have been proposed (in competition with object-oriented databases). These new database systems should provide a more flexible and efficient environment for integrating applications and data.

2.3.2 Definitions There have been many attempts on defining a good and consistent terminology for the research field of database systems. The descriptions provided below, taken from Elmasri and Navathe’s book on database systems [Elmasri89], apply for this thesis and reflect the most common terminology in the database literature. Database “A database is a logically coherent collection of data with some inherent meaning. A random assortment of data cannot be referred to as a database.” “A database is designed, built, and populated with data for a specific purpose. It has an intended group of users and some preconceived applications in which these

16

Chapter 2: Database Systems and Data Models

users are interested.” “A database represents some aspect of the real world, sometimes called the mini-world. Changes to the mini-world are reflected in the database.” Database management system (DBMS) “A database management system (DBMS) is a collection of programmes that enables users to create and maintain a database. The DBMS is hence a general-purpose software system that facilitates the processes of defining, constructing, and manipulating databases for various applications.” Database system ( = database + DBMS) … “ - we usually have a considerable amount of software to manipulate the database in addition to the database itself. The database and software are together called a database system.” Self-contained nature of a database system “A fundamental characteristic of the database approach is that the database system contains not only the database itself but also a complete definition or description of the database. This definition is stored in the system catalogue, …” Distributed DBMS (DDBMS) “A distributed DBMS (DDBMS) can have the actual database and DBMS software distributed over many sites connected by a computer network. Homogeneous DDBMSs use the same software at multiple sites. A recent trend is to develop software to access several autonomous pre-existing databases stored under heterogeneous DDBMSs. This leads to a federated DBMS (or multidatabase system), where the participating DBMSs are loosely coupled and have a degree of local autonomy.”

2.3.3 The three-schema architecture The three-schema architecture (or the ANSI/X3/SPARC DBMS Framework [Yormark77][Tsichritzis78]) is a recognised three level model for database system architecture (Figure 2-3). The internal schema/level is the direct interface to the data structures used to implement the database. Low level features, such as pointers, hash tables and other data structures are available at this level. All the mechanisms provided by the conceptual schema must be translated into the operations and data structures of the internal schema. The internal schema is only used by system programmers to implement data formats and operations at the conceptual level of the database system. The conceptual schema is described as follows [Elmasri89]: The conceptual schema is a global description of the database that hides the details of physical storage structures and concentrates on describing entities, data types, relationships, and constraints. A high-level data model or an implementation data model can be used at this level. The external schema/level provides specialised views of the database. Each external view is tailored to a user or a group of users, so that only the data and operations that are of interest

Database systems

17

Figure 2-3 The ANSI/X3/SPARC three-schema architecture for database systems.

to these users are accessible through the view. The external level can be used both to hide data from unauthorised usage and to customise interfaces to the database.

2.3.4 Features/services of database systems A set of requirements expected to be met by database systems has evolved, and some of the most central features are listed below.

• A database system must be able to store large amounts of data • A database system should conceptually organise the data according to an accepted (“standard”) data model, and should allow access to the data through a well defined (“standard”) interface (at the least a data manipulation language (DML) for interfacing to general purpose programming languages), hiding details of the internal data structures from the user. Both interactive interfaces, integrated application development environments and embeddings in the most popular general purpose programming languages are expected. Content-based (associative) retrieval should be provided through set oriented operations and it should be possible to find related objects by navigating through the structures of the conceptual schema. The data model of the conceptual schema therefore has to be able to represent complex data structures and relationships.

• Metadata, or descriptions of the information present in the database should be available in the database, both to the DBMS itself and to users through a query interface. The system catalogue (of relational systems) or a data dictionary (an extended system catalogue) have traditionally been used for these purposes.

• It shall be possible to specify constraints on the data, such as domains of attributes, cardinality of relationships, optional or mandatory features, … These constraints should after specification be automatically enforced in the database system.

18

Chapter 2: Database Systems and Data Models

• A database system should provide multiple users with concurrent and controlled access to the data through transaction management [Bernstein87]. Transaction management should provide atomic transactions through the recovery system and serialisability or other correctness criteria through concurrency control. An atomic transaction should have the ACID transaction properties. ACID stands for: Atomic, Consistency preserving, Isolated and Durable transactions [Elmasri94]. The notion of atomic transactions imply that either the whole transaction (all of the operations) is done or nothing is done. No partial execution of transactions are allowed. A recovery system shall monitor transactions and log all changes made to the database on secure/permanent storage. If the system crashes for some reason, the recovery system will go through this log and bring the database back to a consistent state. This can be done by making sure that all changes made by committed transactions (transactions that have finished as the crash occurred) are reflected in the database (REDO-ing changes made by these transactions that are not reflected in the database), while non of the changes made by transactions that were aborted by the system crash are left in the database (UNDO-ing these changes). Serialisability is currently the most recognised correctness criterion for concurrency control mechanisms in database systems. A sequence of database operations belonging to different concurrent transactions is serialisable if the resulting state of the database could have been obtained by performing some serial execution of the involved transactions. Serialisability does not seem to be a good criteria for co-operative work, such as in design and planning. New kinds of concurrency control mechanisms are needed to control the complex interactions between co-operating concurrent processes.

• Multiple views on the data should be supported to provide customised interfaces to the data and to enforce access restrictions, avoiding unauthorised usage of the data.

• Fault tolerance is a desirable feature of database systems containing vital information that has to be kept on-line at all hours. Fault tolerance means that the database system should be able to continue to operate normally (having the complete database available) also in the case of failures. Failures could be a disk-crash, memory errors, loss of power, communication failure, program error, etc. Fault tolerance can be obtained through controlled redundancy. Mirroring of disks can be used to take care of disk crashes. RAID* technology provide the same functionality [Chen94] [Ganger94] [Patterson88]. Duplication can be used for most hardware elements in a database system to provide fault tolerance (processors, communication channels, tape drives, disk drives and controllers). In addition to these basic features, monitoring of the database (usage statistics) is provided by most commercial database management systems.

2.3.5 Distributed database systems Distributed database systems is an active area of research [Özsu91] [Garcia-Molina95]. By storing logically connected data at different sites or computers, many interesting issues arise. Distributed transaction management (atomicity, serialisability, concurrency control, commit protocols), distributed query optimisation, reliability of distributed databases and the *

Redundant Array of Inexpensive/Independent Disks [Chen94]

Database systems

19

use of redundancy are all good examples of the complex problems that are receiving attention in this field [Bernstein87] [Breitbart92] [Ceri88]. Multidatabases or federated database systems are loosely connected database systems where the individual databases could be organised according to different database models, and each database system has a high degree of local autonomy [Hsiao92]. Methods for achieving (transparent) data sharing in this kind of environment are emerging, but still constitute a topic for research [Breitbart92] [Kim95d]. Object-oriented approaches to distributed data management have been proposed, using object-oriented abstractions to specify high level interfaces to the databases through for instance a distributed conceptual schema [Papazoglou90].

2.3.6 Database machines The management of large databases has become a problem in many application areas. This has encouraged research in reliable, high capacity database systems. Special purpose database machines (or database computers) [Su88] have come out of this research. One of the most promising approaches are the parallel database machines, where multiple processors are co-operating in storing and retrieving data from a shared database (generally distributed over a number of disks). Such architectures are used both to achieve better performance and to improve availability [Kim84]. This research has lead to commercial products, among which the Tandem (NonStop System) was of the first (the NonStop fault-tolerant architecture came in 1976 [Katzman78], the (distributed) transaction manager ENCOMPASS came a little later [Kim84]). Parallel database machines can provide improved performance [DeWitt85] through distribution and parallel processing and reliability through duplication of hardware and data. The relational database model has proved itself as a good model for parallelisation, and most current parallel database machines are based upon the relational paradigm [Omiecinski95]. In Norway there have been experiments on parallel relational database machines, and several generations of experimental parallel database machine have been built at NTH in Trondheim [Bratbergsengen89].

2.3.7 Status of database systems Vossen gives a short and nice overview of the status of database systems entering the 1990ies [Vossen91]. The following is partly based on his observations. The database systems of the 1980ies are good at handling:

• Simply structured data objects (record oriented) • Simple data types (number, character string, …) • Short transactions • High transaction rates • Frequent in-place updates

20

Chapter 2: Database Systems and Data Models

New areas of database applications differ significantly from the traditional database application areas, and need support for:

• Complex (evolving) data models • New data types, for instance spatial data types, such as images and topological structures, with associated data structures and operators

• Integration of very different data types • Relaxed consistency constraints • Long transactions with few serious access conflicts (which must lead to re-evaluation of concurrency control and recovery mechanisms)

• Fault tolerance and 100% availability • High data rates with guaranteed service, as required by for instance video servers • Extremely low response times, as demanded by real time applications (“real time DB”) These features are not well supported by the database systems of the 1980ies, and must be given more emphasis in the years to come. Geographical information systems is one example of these “new” application areas.

2.4 Database models The three-schema architecture’s conceptual schema can presently be specified using three or four major approaches. The different approaches to conceptual schema definition will here be termed database models. The most popular models, up to 1990, have been the two set models (the hierarchical and the network model), the relational model, and recently also object-oriented models.

2.4.1 Hierarchical DBMSs In the middle of the 1960s the first commercial hierarchical database management systems was on the market, one of them being IMS of IBM* (1968). GIS** of IBM was a hierarchical query and update system that was out even earlier (1966). There is no formal theory on hierarchical database models, but some common characteristics of the family can be identified [Tsichritzis82] [Elmasri89]. The abstractions used in hierarchical models are records (entities) and parent-child relationships. A parent-child relationship type has one owner record type (parent) and one member record type (child). A record type can act as the owner of many different parentchild relationship types, but can only act as a member of one parent-child relationship type, thereby forming a strict hierarchy. An instance of the parent-child relationship type has a unique owner record (from the owner record type) and zero or many member records (from the member record type). *

International Business Machines

** General Information System

Database models

21

Figure 2-4 Spatial topology as modelled using a hierarchical diagram.

Hierarchical models support one-to-many (1:N) hierarchical relationships in a natural way, but many-to-many (M:N) relationships and non-hierarchical structures are impossible to handle without introducing some kind of data duplication. N-ary relationships are even more problematic. Virtual records have been introduced to allow other relationship types than one-to-many. Hierarchical data models can be displayed in a hierarchical definition tree [Tsichritzis82], as illustrated by the spatial topology example in Figure 2-4 (spatial topology is described in chapter 4). Virtual record types are shown with a thicker outline, and their real record types are indicated by thin arrowed lines in the figure. Many of the early database systems were hierarchical (Mass Gen Hospitals MUMPS from 1966, Informatics’ MARKIV from 1967, IBMs IMS-2/VS from 1968, Control Datas MARS from 1969, MRI’s System 2000/S2K from 1970 [Wiederhold81]), and many installations of these systems are still in use. The hierarchical data model’s limited expressiveness makes it inferior to the CODASYL DBTG network model for most non-hierarchical applications. The hierarchical model is optimised for hierarchical structures, and performs well in such settings.

2.4.2 Network DBMSs The first network database management system that appeared was Honeywells IDS in 1962. This was also the first commercial database management system to appear [Elmasri89]. The first standardisation effort in the field of data base systems was done by the CODASYL* Data Base Task Group (DBTG). The results of this work were a series of proposals for a standardised interface to database systems (1969, 1971, -73 and -78) [Tsichritzis82]. These proposals have been collectively referred to as the CODASYL network data model. Many database systems that follow this standard have been implemented, and a large number of *

Conference on Data System Languages

22

Chapter 2: Database Systems and Data Models

Figure 2-5 Spatial topology as modelled using a data structure diagram (DBTG network model). databases are organised and managed by CODASYL systems. The CODASYL network data model is more conveniently called the network model. The abstractions used in the network model are about the same as the abstractions used in hierarchical models. The DBTG network data model’s set type corresponds to the parentchild relationship type of the hierarchical data model (but should not be confused with a mathematical set). Each set type consists of an owner record type and a member record type. In the network model, a record type can be a member of more than one set type, but a member record can have at most one owner record for each set type. This means that a member record only can take part in one set occurrence for each set type it participates in. Many-to-many (M:N) relationships can therefore only be supported “non-redundantly” by introducing a “dummy” record type between the two participating record types (Island, L-border and R-border in Figure 2-5 are examples of such “dummy” record types). In addition, the network model supports the relationships available in the hierarchical model. Network models can be represented graphically using data structure diagrams, also called Bachman diagrams [Elmasri89]. Spatial topology as modelled in a data structure diagram is shown in Figure 2-5. The CODASYL proposals include a DDL (data definition language) to describe the database structure textually and a navigational DML (data manipulation language) to query and modify the database. The notions of user work area (UWA) and currency indicators are introduced to facilitate programming language interfaces and database navigation. The NDL (Network Definition Language) standard for network languages was proposed by ANSI in 1985 [Elmasri89]. After the CODASYL DBTG report in 1971 [CODASYL71], several commercial products were developed (Honeywell’s IDS II, Burrough’s DMS II, Univac’s DMS1100, DEC’s DBMS10 and 11, HP’s IMAGE, Cullinet’s IDMS [Wiederhold81]). The network data model is very good at navigation, that is - one item at a time retrieval. It was not made for set-based retrieval, and is not very good at this. The fixed structure dictated by the model makes it painful to change the schema. This means that the model is too rigid for applications in dynamic environments. Distribution and parallelisation has not been considered useful or feasible using the network and hierarchical data models. For the many organisations that have requirements that suits this technology, the robust network database systems are still of the most powerful. Early in the 1990s, a large part of production database systems are network systems, but their share of the database market seems to be decreasing.

Database models

23

2.4.3 Relational DBMSs The relational data model, introduced by Codd [Codd70], is a database model that builds on the mathematical concepts of sets and relations. Functional dependencies and keys are two other concepts that are important in the modelling and design of relational databases.

• The properties of sets important to the relational model are the following: Duplicates are not allowed in a set, and a set imposes no ordering on its members.

• A relation establishes a connection between an arbitrary number of domains (n-ary relations are relations which include n domains). Relations are represented as tuples. A tuple is a collection that contains one instance of each of the domains participating in the relation. The tuples of a relation are organised as unordered rows in a two-dimensional table.

• Functional dependencies. If, in a relation R, a set of attributes, B, is functionally dependent on a set of attributes, A, this means that if two tuples of R have the same value for A, they must also have the same value for B.

• Keys. A key of a relation is a minimal set of attributes that functionally determines all the attributes of the tuple (since duplicates are not allowed, no tuples in a relation can have the same key). A relation can have many keys (e.g. the set of all attributes of a relation makes up a key), in which case one of them is chosen as the primary key. Relations are created to describe relevant features of the phenomena being modelled. These features include relationships between phenomena in addition to the individual phenomena with their characteristics/attributes. A person could, in the relational model, be described by attributes such as name, date of birth and colour of the eyes, and by its relationships to other phenomena such as father, mother, employer and place of living. All these properties can be grouped together into an (unnormalised) person relation in the relational model. Relations are used to store most of the system information in a relational database system. A table is established for each relation in the (normalised) data model. Operations in the relational model are defined in the relational algebra or the relational calculus. The relational algebra consists of the relational operators selection (σ), that is a set-operation that retrieves tuples based on values of the attributes of a relation, projection (π), that picks out certain domains/attributes/columns from a relation, and join ( ) that is a sophistication of the cartesian product, where two relations are combined into a new relation on the basis of the values of some common domain(s) of the relations, the new relation will consist of all the domains of the original relations. In the new relation, a row from the first relation is combined with all the rows of the second relation that satisfies the condition on the join attributes. Natural join (*) is an equi-join (the condition on the join attributes is equality), where the join-domains are not duplicated. In addition, the general set operations union (∪), intersection (∩) and difference (-) are available in the relational model.

24

Chapter 2: Database Systems and Data Models

The relational calculus is related to first-order predicate calculus, using the logical symbols ∧, ∨, ¬, ∀, ∃ (and, or, not, for all, exists). In tuple relational calculus the variables have tuples as their range, while in domain relational calculus the variables have attribute value domains as their range. pAB(σC=1(R)) in the relational algebra is equivalent to {t.A, t.B | (R(t) ∧ (t.C=1)} in the tuple relational calculus and {A, B | (∃C) (R(ABC) ∧ C=1) } in the domain relational calculus. Both SQL (Structural Query Language) and QUEL (the query language of the INGRES database management system) are related to the tuple relational calculus. QUEL is much closer related to the relational calculus than is SQL [Elmasri89]. QBE (Query By Example) is related to the domain relational calculus. Normalisation To avoid the problems that duplication of information can introduce, normalisation is performed on relational data models before realisation in a database system [Date86] [Elmasri89]. A measure of a relational design is provided by the normal form metric, describing the properties of a relational design. Normal forms were introduced by Codd in 1971-1972 [Tsichritzis82]. In this first effort, a series of three normal forms was defined. The notion of functional dependency as introduced in Codd’s original paper [Codd70] is very important for specifying these original normal forms.

• The first normal form (1NF) requires that all attributes in a relational scheme are atomic (no group of values are allowed for a single attribute).

• The second normal form (2NF) requires that the relation is on first normal from, and that all attributes that are not part of the primary key shall be functionally dependent on the primary key of the relation, but not functionally dependent on a subset of the primary key.

• The third normal form (3NF) requires that the relation is on second normal from, and that no transitive functional dependencies exist in the relation. Further normal forms have been specified since then, among them Boyce-Codd normal form (BCNF, which is stronger than 3NF), 4NF (introducing multi-valued dependencies) and 5NF (introducing join dependencies). The more normalised a relational schema is, the more well-behaved it will be in the case of queries and updates. A relational schema can be normalised by splitting the relations that violate the conditions of normalisation. For some kinds of normalisation, it is not possible to split relations without loosing functional dependencies or introducing replication. There is also a penalty on splitting relations, because of all the joins that must be performed to reconstruct the universal relation (a relation consisting of all the attributes of the relational schema). The choice of how far to normalise will depend on the application. SQL SQL (Structured Query Language) is the standard interface to relational databases. The SQL “standard” has been enhanced in a stepwise fashion to meet new user requirements [Melton90].

Database models

25

The traditional SQL (SQL-86 and SQL-89) data types are INTEGER, SMALLINT, CHARACTER, DECIMAL, NUMERIC, REAL, FLOAT and DOUBLE PRECISION. The SQL2 standard includes commonly available extensions such as CHARACTER VARYING, DATE, TIME, BIT, TIMESTAMP, INTERVAL and BIT VARYING [Melton90]. Traditional SQL uses the following operators: SELECT, INSERT, UPDATE, DELETE, join (a join between the tables X and Y on the column COL is specified by the condition X.COL=Y.COL), project (the column in the projection are specified in the SELECT part of the query), UNION, comparison (=, ~=, >, >=, ~>,

Suggest Documents