REPLICATION IN DISTRIBUTED MANAGEMENT ...

1 downloads 0 Views 2MB Size Report
1.2 REPLICATION ON A DISTRIBUTED MIB .................................................................. 2. 1.3 THE WORK.
REPLICATION IN DISTRIBUTED MANAGEMENT SYSTEMS

EVANGELOS GRIGORIOS KOTSAKIS

Telford Research Institute Department of Electrical and Electronic Engineering

The University Of Salford

Submitted in Partial Fulfilment of the Requirements for the Degree of Doctor of Philosophy

1998

This thesis is dedicated to my daughter Dimitra

Η διατριβή αυτή αφιερώνεται στην κόρη μου Δήμητρα

ii

TABLE OF CONTENTS LIST OF FIGURES................................................................................................................................... V LIST OF TABLES ................................................................................................................................. VII ACKNOWLEDGMENTS..................................................................................................................... VIII ABBREVIATIONS .................................................................................................................................. IX ABSTRACT ............................................................................................................................................... X 1. INTRODUCTION .................................................................................................................................. 1 1.1 DISTRIBUTED MANAGEMENT SYSTEMS ............................................................................................... 1 1.2 REPLICATION ON A DISTRIBUTED MIB ................................................................................................. 2 1.3 THE WORK........................................................................................................................................... 4 1.4 ROAD MAP OF THE THESIS .................................................................................................................. 5 2. REPLICATION MANAGEMENT SYSTEM ARCHITECTURE .................................................... 8 2.1 MANAGEMENT FUNCTIONAL AREAS ................................................................................................... 8 2.2 MANAGEMENT ARCHITECTURAL MODEL .......................................................................................... 10 2.3 PROTOCOLS FOR CONTROLLING MANAGEMENT INFORMATION......................................................... 11 2.3.1 OSI Management Framework ................................................................................................... 11 2.3.2 Internet Network Management .................................................................................................. 12 2.4 OBJECT ORIENTED MIB MODELLING ................................................................................................. 13 2.5 DISTRIBUTED MANAGEMENT INFORMATION BASE (MIB) ................................................................. 15 2.6 DISTRIBUTED NETWORK MANAGEMENT ........................................................................................... 18 2.7 CORBA SYSTEM .............................................................................................................................. 22 2.8 IMPLEMENTING OSI MANAGEMENT SERVICES FOR TMN ................................................................. 23 2.9 REPLICATION IN A MANAGEMENT SYSTEM ....................................................................................... 26 2.10 NEED FOR REPLICATION TECHNIQUES IN A MANAGEMENT SYSTEM .................................................. 29 2.11 SYNCHRONOUS AND ASYNCHRONOUS REPLICA MODELS ................................................................. 33 2.12 REPLICATION TRANSPARENCY AND ARCHITECTURAL MODEL ........................................................ 33 2.13 SUMMARY ....................................................................................................................................... 37 3. FAILURES IN A MANAGEMENT SYSTEM .................................................................................. 38 3.1 DEPENDABILITY BETWEEN AGENTS .................................................................................................. 38 3.2 FAILURE CLASSIFICATION .................................................................................................................. 39 3.3 FAULTY AGENT BEHAVIOUR ............................................................................................................. 40 3.4 FAILURE SEMANTICS ......................................................................................................................... 41 3.5 FAILURE MASKING ............................................................................................................................ 42 3.6 ARCHITECTURAL ISSUES ................................................................................................................... 46 3.7 GROUP SYNCHRONISATION ............................................................................................................... 47 3.7.1 Close Synchronisation ............................................................................................................... 47 3.7.2 Loose synchronisation ............................................................................................................... 48 3.8 GROUP SIZE....................................................................................................................................... 49 3.9 GROUP COMMUNICATION .................................................................................................................. 49 3.10 AVAILABILITY POLICY ..................................................................................................................... 50 3.11 GROUP MEMBER AGREEMENT ........................................................................................................ 51 3.12 SUMMARY ....................................................................................................................................... 53 4. REPLICA CONTROL PROTOCOLS ............................................................................................... 55 4.1 PARTITIONING IN A REPLICATION SYSTEM ......................................................................................... 55 4.2 CORRECTNESS IN REPLICATION ......................................................................................................... 56 4.3 TRANSACTION PROCESSING DURING PARTITIONING .......................................................................... 59 4.4 PARTITION PROCESSING STRATEGY................................................................................................... 60 4.5 AN ABSTRACT MODEL FOR STUDYING REPLICATION ALGORITHMS .................................................. 62 4.6 PRIMARY SITE PROTOCOL ................................................................................................................. 66 iii

4.7 VOTING ALGORITHMS ........................................................................................................................ 69 4.7.1 Majority Consensus Algorithm .................................................................................................. 70 4.7.2 Voting With Witnesses ............................................................................................................... 73 4.7.3 Dynamic Voting ......................................................................................................................... 73 4.7.4 Dynamic Majority Consensus Algorithm (DMCA) - A novel approach .................................... 79 4.8 SUMMARY ......................................................................................................................................... 91 5. ANALYSIS AND DESIGN OF THE SOFTWARE SIMULATION ............................................... 93 5.1 INTRODUCTION TO SIMULATION MODELLING ..................................................................................... 93 5.2 USING AN OBJET-ORIENTED TECHNIQUE FOR MODELLING A SIMULATION SYSTEM .......................... 94 5.3 OBJECT ORIENTED DISCRETE EVENT SIMULATION............................................................................ 95 5.4 THE SIMULATION MODELLING PROCESS ........................................................................................... 96 5.4.1 Problem formulation ................................................................................................................. 96 5.4.2 Model Implementation .............................................................................................................. 97 5.5 OBJECT ORIENTED ANALYSIS AND DESIGN ....................................................................................... 97 5.5.1 Analysis ..................................................................................................................................... 98 5.5.2 Design ....................................................................................................................................... 98 5.5.3 Implementation .......................................................................................................................... 99 5.6 ATS REQUIREMENTS ........................................................................................................................ 99 5.7 ATS ANALYSIS................................................................................................................................ 101 5.7.1 Object Model ........................................................................................................................... 103 5.8 DYNAMIC MODEL ........................................................................................................................... 106 5.9 EVALUATION OF THE SYSTEM ......................................................................................................... 108 5.10 SUMMARY ..................................................................................................................................... 108 6. SIMULATION AND ESTIMATION OF REPLICA CONTROL PROTOCOLS ....................... 110 6.1 PERFORMANCE EVALUATION .......................................................................................................... 110 6.2 THE SIMULATION MODEL................................................................................................................ 111 6.3 FAULT INJECTION ............................................................................................................................ 113 6.4 SIMULATED ALGORITHMS ............................................................................................................... 114 6.5 THE PROTOCOLS’ ROUTINES ............................................................................................................ 115 6.6 IMPLEMENTING GROUP COMMUNICATION....................................................................................... 116 6.7 FUNCTIONAL COMPONENTS OF THE SIMULATION. ........................................................................... 117 6.8 PARAMETER OF THE SIMULATION .................................................................................................... 119 6.9 AVAILABILITY AND THE CONTRIBUTION OF THE DMCA ALGORITHM .............................................. 119 6.10 RESULTS OF THE SIMULATION ....................................................................................................... 122 6.11 SUMMARY ..................................................................................................................................... 135 7. CONCLUSIONS................................................................................................................................. 138 7.1 CONTRIBUTIONS OF THIS WORK ....................................................................................................... 138 7.2 FUTURE RESEARCH DIRECTION ....................................................................................................... 139 7.3 CONCLUDING REMARKS .................................................................................................................. 141 APPENDIX-A PAPERS............................................................................................................. PAPERS 1 APPENDIX-B TABLES............................................................................................................. TABLES 1 APPENDIX-C SOURCE CODE ................................................................................................... CODE 1 LIST OF REFERENCES................................................................................................ REFERENCES 1

iv

LIST OF FIGURES FIGURE 2-1: BASIC MANAGEMENT MODEL.................................................................................................... 9 FIGURE 2-2: VIEWS OF SHARED MANAGEMENT KNOWLEDGE....................................................................... 17 FIGURE 2-3: SIMPLIFIED MANAGEMENT SYSTEM. ....................................................................................... 17 FIGURE 2-4 NETWORK MANAGEMENT APPROACHES (A) CENTRALISED (B) PLATFORM BASED (C) HIERARCHICAL (D) DISTRIBUTED ......................................................................................................... 21 FIGURE 2-5: INTER-WORKING TMN ........................................................................................................... 25 FIGURE 2-6: (A) REPLICATION (B) NO REPLICATION ..................................................................................... 28 FIGURE 2-7: NETWORK MANAGEMENT REPLICATION EXAMPLE .................................................................. 31 FIGURE 2-8: SYNCHRONOUS REPLICATION .................................................................................................. 33 FIGURE 2-9: ARCHITECTURAL MODEL FOR REPLICATION. (A) NON TRANSPARENT SYSTEM (B) TRANSPARENT REPLICATION SYSTEM (C ) LAZY REPLICATION (D) PRIMARY COPY MODEL. ......................................... 35 FIGURE 3-1: RELATIONSHIP BETWEEN USER AND RESOURCE. ...................................................................... 39 FIGURE 3-2: FAILURE MASKING ................................................................................................................... 42 FIGURE 3-3: GROUP MASKING ..................................................................................................................... 43 FIGURE 4-1 REPLICATION ANOMALY CAUSED BY CONFLICT WRITE OPERATIONS. A) BEFORE ISOLATION B) AFTER ISOLATION ................................................................................................................................ 57 FIGURE 4-2: LOGICAL AND PHYSICAL OBJECTS OF THE SENSOR ENTITY. ..................................................... 64 FIGURE 4-3. REPLICATION USING PRIMARY SITE ALGORITHM. ..................................................................... 66 FIGURE 4-4. READ IN A PRIMARY SITE PROTOCOL ...................................................................................... 68 FIGURE 4-5. WRITE IN A PRIMARY SITE PROTOCOL..................................................................................... 68 FIGURE 4-6. MAKE CURRENT IN A PRIMARY SITE PROTOCOL ..................................................................... 69 FIGURE 4-7. READ IN A MAJORITY CONSENSUS ALGORITHM ...................................................................... 71 FIGURE 4-8. WRITE IN A MAJORITY CONSENSUS ALGORITHM ....................................................................... 72 FIGURE 4-9. MAKE CURRENT IN A MAJORITY CONSENSUS ALGORITHM ..................................................... 72 FIGURE 4-10. ISMAJORITY IN THE DYNAMIC VOTING PROTOCOL ............................................................... 75 FIGURE 4-11. READ FUNCTION IN THE DYNAMIC VOTING PROTOCOL ......................................................... 75 FIGURE 4-12. WRITE (UPDATE) IN THE DYNAMIC VOTING PROTOCOL ......................................................... 76 FIGURE 4-13 UPDATE IN THE DYNAMIC VOTING PROTOCOL ...................................................................... 78 FIGURE 4-14 MAKE CURRENT IN THE DYNAMIC VOTING PROTOCOL ......................................................... 78 FIGURE 4-15 READPERMITTED IN THE DMCA ........................................................................................... 82 FIGURE 4-16 WRITEPERMITTED FUNCTION IN THE DMCA.......................................................................... 83 FIGURE 4-17 DOREAD FUNCTION IN THE DMCA........................................................................................ 84 FIGURE 4-18 DOWRITE FUNCTION IN THE DMCA ...................................................................................... 86 FIGURE 4-19 MAKE CURRENT FUNCTION IN DMCA................................................................................... 88 FIGURE 4-20: SEQUENCE DIAGRAM FOR DOREAD OPERATION .................................................................... 89 FIGURE 4-21: SEQUENCE DIAGRAM FOR DOWRITE OPERATION ................................................................... 90 FIGURE 4-22: SEQUENCE DIAGRAM FOR MAKECURRENT OPERATION ......................................................... 90 FIGURE 5-1. ATS PROCESS DIAGRAM ........................................................................................................ 102 FIGURE 5-2. ATS OBJECT MODEL .............................................................................................................. 106 FIGURE 5-3. ATS DYNAMIC MODEL .......................................................................................................... 107 FIGURE 6-1: NETWORK MODEL................................................................................................................. 112 FIGURE 6-2: FAULT INJECTION SYSTEM .................................................................................................... 113 FIGURE 6-3: COMPONENTS OF THE SIMULATION MODEL ............................................................................ 118 FIGURE 6-4: AVAILABILITY CURVE ............................................................................................................ 120 FIGURE 6-5: TOTAL AVAILABILITY Λ=4. ................................................................................................... 121 FIGURE 6-6: BOUNDARIES OF TOTAL AVAILABILITY .................................................................................. 122 FIGURE 6-7: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR DELAY=0.1 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY. .................. 126 FIGURE 6-8: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR DELAY=0.2 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 127 FIGURE 6-9: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR DELAY=0.3 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 128 FIGURE 6-10: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR DELAY=0.4 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 129 FIGURE 6-11: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR DELAY=0.5 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 130

v

FIGURE 6-12: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR DELAY=1.0 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 131 FIGURE 6-13: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR DELAY=1.5 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 132 FIGURE 6-14: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR DELAY=2.0 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 133 FIGURE 6-15: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR DELAY=2.5 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 134 FIGURE 6-16: AVAILABILITY PROVIDED BY THE TESTED REPLICA CONTROL PROTOCOLS FOR MEAN REPAIR DELAY=3.0 (A) READ AVAILABILITY (B) WRITE AVAILABILITY (C) TOTAL AVAILABILITY ................... 135

vi

LIST OF TABLES TABLE 2-1: CMISE SERVICES AND FUNCTIONS ........................................................................................... 12 TABLE 2-2: SNMP SERVICES AND FUNCTIONS ............................................................................................ 13 TABLE 4-1: DMCA MAPPING ...................................................................................................................... 91 TABLE 6-1: SIMULATION PARAMETERS ..................................................................................................... 123

vii

ACKNOWLEDGEMENTS I would like to thank my supervisor Dr. B. H. Pardoe who has assisted me and guided me in preparing this thesis. His kind assistance in my struggles with the English language was very helpful. He has been willing to answer any technical question and provide me with the information and knowledge to cope with the difficult task of preparing a Ph.D. thesis. Without his kind help and encouragement, this research would never have been done I would especially like to express sincere appreciation for the financial support, encouragement and love given to me by my parents Grigorios and Dimitra Kotsakis. They have given me much more than moral and material support through my University studies. They provide me a rock - solid support system which was proved helpful during my studies. Without their support and love, this research would never have been done Special thanks are due to my wife Chaido for encouraging me over the last three years. I would like to thank her for her unconditional love, unfailing enthusiasm, unending optimism and confidence in my abilities. Her patient and support are boundless. Finally, I would like to thank little Dimitra, whose birth two years ago gave me a great joy, for being quiet while I was writing this thesis

viii

ABBREVIATIONS ANSA

Advanced Network System Architecture

ATM

Asynchronous Transfer Mode

ATS

Availability Testing System

CMIS

Common Management Information Protocol

CMISE

Common Management Information Service Element

DMCA

Dynamic Majority Consensus Algorithm

FE

Front End

IP

Internet Protocol

ISO

International Standards Organisation

LAN

Local Area Network

MIB

Management Information Base

OMT

Object Modelling Technique

OSF/DCE

Open Software Foundation / Distributed Computing Environment

OSI

Open System Interconnection

ROSE

Remote Operation Service Element

SNA

Systems Network Architecture

SNMP

Simple Network Management Protocol

TCP

Transmission Control Protocol

UDP

User Datagram Protocol

WAN

Wide Area Network

ix

ABSTRACT Systems management is concerned with supervising and controlling the system so that it fulfils the requirements. The management of a system may be performed by a mixture of human and automated components. These components are abstract representations of network resources and they are known as managed objects. A distributed management system may be viewed as a collection of such objects located at different sites in a network. Replication is a technique used in distributed systems to improve the availability of vital data components and to provide higher system performance since access to a particular object may be accomplished at multiple sites concurrently. By applying replication in a distributed management system, we can locate certain management objects at multiple sites by copying their internal data and the operations used to access or update those data. This is considered as a great advantage since it increases the reliability and availability, it provides higher fault tolerance and it allows data sharing; improving in that way the system performance. This thesis is concerned with methods that may be used to apply replication in such a system, as well as certain replica control algorithms that may be used to control operations over a replicated managed object. Certain replication architectures are examined and the availability provided by each of them is discussed. A new replica control algorithm is proposed as an alternative to providing higher availability. A tool for evaluating the availability provided by a replica control algorithm is designed and proposed as a benchmark test utility that examines the suitability of certain replica control algorithms.

x

1. INTRODUCTION

The importance of replication techniques for providing high availability in distributed systems has been known for over two decades. But the use of these techniques in network management systems have been minimal. A reason for this has been that the machinery needed to cope with the partitions and reunions is excessively complex. This thesis addresses the methods that may be safely used for replicating management objects and it shows how one can use replication techniques in a management system that preserves usability and availability. The rest of this chapter discusses replication techniques and the related problems. It begins with a discussion on network management systems and the use of replication in a distributed Management Information Base (MIB) for improving the availability of the managed objects. It then discusses the goal of this thesis and it concludes with a road-map for the rest of the thesis.

1.1 Distributed Management Systems The management of a communication environment is a distributed information processing application where individual components of the management activities are associated with network resources. Management applications perform the management activities in a distributed and consistent manner that guarantees transparency and system operability. Management information is stored in a special database which is known as Management Information Base (MIB). The MIB is the conceptual repository of the management information and each object stored in the MIB is associated with an 1

individual network resource, an attribute used to represent a network activity. When the MIB is distributed over sites, one site may fail, while other sites continue to operate. Distributed MIBs may also increase performance since different managed objects located at different hosts may be accessed concurrently. A fundamental problem with a distributed MIB is data availability. Since managed objects are stored on separate machines, a server crash or a network failure that partitions a client from a server can prevent a manager from accessing managed objects. Such situations are very frustrating to a manager because they impede computation even though client resources are still available. The problem of object availability increases over time for two reasons 1. The frequency of the network failures will increase. Networks get larger, they cover larger geographical area, encompass multiple administrative boundaries and consist of multiple sub-networks via routers and bridges. Furthermore, there is an increased need for better network resource management that increases the availability of managed objects and the management performance 2. The introduction of mobile network managers will increase the number of occasions on which management agencies are inaccessible. Wireless technologies such as packet-radio suffer from inherent limitations like short range and line of sight. Due to these limitations, the network connections between management agents and mobile managers will exhibit frequent partitions. 1.2 Replication on a distributed MIB Replication is a technique used in distributed operating systems and distributed databases to improve the availability of system resources. In the case of the MIB, replication can be used to increase the performance of management activities and to provide high availability of management objects. Replicating the same management

2

object at different sites can improve the availability remarkably because the system can continue to operate as long as at least one site is up. It also improves performance of global retrieval queries, because the result of such a query can be obtained locally from any site; hence a retrieval query can be processed at the local site where it is submitted. To deal with replicated objects in a management information base, a control method is needed to keep all the replicas in a consistent state even during partitioning. The proposed techniques used to assure consistency may be divided into two families; those based on a distinguished copy and those based on voting. The former technique is based on the idea of using a designated copy of each replicated copy of each replicated object in such a way that requests are sent to the site that contains that copy (ALSBERG 1976, BERNSTEIN 1987, GARCIA 1982). Voting replica control algorithms are more promising. They do not use a distinguished copy; rather a request is sent to all sites that include a copy of the replicated object. An access to a particular copy is granted if a majority of votes is collected (GIFFORD 1979, JAJODIA 1989, KOTSAKIS 1996b). Voting algorithms are fully distributed concurrency control algorithms and they exhibit higher flexibility over those based on a distinguished copy. Despite the fact that the voting algorithms pass many messages among sites, it is anticipated that good performance will be gained, if the round trip time becomes shorter. Today’s technology can improve the round trip time through the use of high speed networks like ATM. Thus, messages may be transferred from one machine to another faster and more reliably. In the voting scheme, replicated objects can be accessed in the partition group that obtains a majority vote. In the distinguished (primary) copy

scheme, availability is

significantly limited in the case of a network or link failure. Primary copy algorithms exhibit good behaviour only for site failures . On the other hand, voting algorithms 3

provide higher availability tolerating both network and site failures. Voting algorithms guarantee consistency at the expense of the availability. To provide higher availability, one may use either a consistency relaxation technique that allows concurrent access of replicated objects across different partitions (optimistic control algorithm) or to improve the existing voting pessimistic algorithms by forming more sophisticated schemes. Optimistic control algorithms must be supported by an extra mechanism to detect and resolve diverging replicas once the partition groups are reconnected. This complicates the control of replication task and allows, at least for a short interval of time, inconsistency between replicas. Such an approach requires long time to retrieve the state of the database after a site failure and it does not seem appropriate for databases such as those used to store management information. Therefore, the invention of a more sophisticated replica control algorithm based on voting seems to be a promising approach that may provide higher availability preserving strong consistency between replicated objects. Finally, the following questions are addressed in this thesis:  Can we improve further the availability of managed objects by utilising voting techniques?  Can replication be used effectively in a distributed MIB in order to ensure fault tolerance in a management system?

1.3 The work The goal of this work is to investigate the potential use of replication in a network management system and examine the practical aspect of applying such a technique in real time systems. Intuitively this work appears viable for the following reasons.

4

1. There is a great proliferation of different network technologies and a great need for network management. Keeping management information available and in a consistent state is of great importance, since the network operability depends on management activities. 2. Availability of network information may be obtained only by applying redundancy. Replication is one of the most widely used techniques that ensures high availability keeping the replicated objects in a consistent state. 3. The development of replication schemes for insuring higher object availability and for tolerating site and communication failures is very promising and should be studied further. 4. Failures always happen. No system can work forever within its specifications. Exogenous or endogenous factors can affect the operability of the system causing temporary or permanent failures. 5. The need for developing fault tolerant techniques for network management systems is of great importance.

1.4 Road Map of the Thesis The rest of the thesis consists of six chapters. Chapter 2 describes systems management and provides a discussion about the architectural models of management systems. It examines object replication in terms of system performance and availability and illustrates the replication architectural models for managing failures.

5

Chapter 3 discusses the nature of failures in a management system. It decomposes a management system into management agents and defines the concept of dependability between agents. It also classifies certain failures according to their disruptive behaviour and it then ends by specifying the architectural impact of a group of agents to the availability of replicated objects.

Chapter 4 presents the correctness criteria that should be taken into account when designing a replication system. It also introduces an abstract model in order to study formally certain replica control algorithms. It then presents a variety of replica control algorithms. It ends with a thorough discussion on the DMCA (Dynamic Majority Consensus Algorithm) which is a novel approach that enriches the up to date knowledge on replication techniques and improves the overall management of replicated objects providing higher availability.

Chapter 5 mainly evaluates the DMCA algorithm and presents quantitative results regarding the availability provided by DMCA. It starts by specifying the way in which one can measure the performance of certain replica control protocols and introduces the simulation model for building the benchmark test utility ATS (Availability Testing System) that estimates the effectiveness of the algorithms. It also shows the fault injection mechanism for generating faults and repairs. It ends with a thorough discussion on the results of the simulation justifying the superiority of the DMCA.

Chapter 6 presents the object -oriented development process of the ATS tool. It first discusses the advantages of using object oriented technology to develop such a complex system and it then presents the static object model and dynamic model of the ATS.

6

The thesis concludes with chapter 7 which presents the contributions and it includes a discussion of future work and a summary of key results.

7

2. REPLICATION MANAGEMENT SYSTEM ARCHITECTURE

This chapter introduces the fundamental idea behind systems management and illustrates the main features of a management system. It provides a brief discussion about the architectural model of a management system and introduces the concept of distributed MIB as a naturally distributed database. It highlights issues related to the object oriented MIB modelling and definition of managed objects. It also justifies the use of object replication in terms of system performance, data reliability and availability. Finally it discusses the type of failures that may occur in a management system as well as the replication architectural models that may be used to maintain multiple replicas. 2.1 Management Functional Areas Management of a system is concerned with supervising and controlling

the

system so that it fulfils the operational requirements. To facilitate the management task , Open System Interconnection (OSI ) divides the management design process into five areas known as OSI management functional areas (ISO 1989). The fundamental objectives of the OSI functional areas are to fulfil the following goals. 1. To maintain proper operation of a complex network (fault management). 2. To maintain internal accounting procedures (accounting management). 3. To maintain procedures regarding the configuration of a network or a distributed processing system (configuration management).

8

Figure 2-1: Basic Management model

4. To

provide

the

capability of

performance

evaluation

(performance

management). 5. To allow authorised access-control information to be maintained and distributed across a management domain (security management). In other words, a network that has a management system, must be able to manage its own operations, performance, failures, modifications, security and hardware/software configuration. To fulfil the above requirements, it is necessary to develop a management model that is capable of incorporating a vast amount of services covered under the specifications of the OSI functional areas. The actual architecture of the network management model varies greatly, depending on the functionality of the platform and the details of the network management capability. A management architectural model that has been proposed by OSI defines the fundamental concepts of systems management (ISO 1992). This model describes the information, functional, communication and organisational aspect of systems management.

9

2.2 Management Architectural Model The management of a communication environment is an information processing application. Because the system being managed is distributed, the individual components of the management activities are themselves distributed. Management applications perform the management activities in a distributed manner by establishing associations between management entities. As shown in Figure 2-1 there are two fundamental type of entities that exchange management information; one takes the manager role and the other the agent role. An entity, that plays the manager role is supposed to be the entity that generates queries to obtain management information. An entity plays the agent role when it accepts those queries returning a response back to the manager and generates notifications regarding the state of the objects located in the domain of the agent. An agent performs management operations on managed objects as a consequence of the communication with the manager. A manager may be seen as the part of the distributed application that is responsible for generating messages related to one or more management activities (collection of information, controlling the state of remote objects, change the configuration of managed devices etc.). The agent, on the other hand, could be viewed as a local corespondent of the manager in the managed system controlling access to the managed object and looking after the distribution of events occurring in the managed system. As Figure 2-1 shows, three kinds of messages are transferred between manager and agent. These are the following: 

request messages transferred from the manager to the agent.



response messages are bi-directional



notification messages transferred from the agent to the manager

10

The database of

systems management information called Management

Information Base (MIB) is associated with both the manager and the agent. The MIB is the conceptual repository of the management information stored in an OSI-based network management system. The definition of the MIB describes the conceptual schema containing information about managed objects and relations between them. It actually defines the set of all managed objects visible to a network management entity. The MIB may be viewed as the interface definition - it defines a conceptual schema which contains information about specific managed objects, which are instantiations of managed object classes. The schema also embodies relationships between these managed objects, specifies the operations which may be performed on them and describes the notifications which they may emit (ISO 1993).

2.3 Protocols For Controlling Management Information All types of management exchanges consist of requests and/or requests responses. There are currently two basic architectural frameworks related to the standardisation of the exchange of messages passed between managers and agencies; the OSI management framework (ISO 1989) and the Internet network management framework (CERF 1988).

2.3.1 OSI Management Framework The Common Management Information Protocol (CMIP) is a utility designed to convey the requests or responses between managers and agencies (ISO 1991a). The CMIP offers specific system management functions and services for the remote holding of management data. The CMIP implements the services offered by the remote operation service elements (ROSE) (ISO 1988), in order to perform the create, set, delete, get, action, and event-report operations. The Common Management Information 11

Service Element (CMISE) is the standardised application service element that is used to exchange management information in the form of requests and/or requests-responses (ISO 1991). The CMISE is a basic vehicle that provides individual management applications with the means of executing management operations on objects and issuing notifications. The CMISE provides the means of supporting distributed management operations using application associations. The CMISE services shown in Table 2-1 constitute the kernel functional unit of the CMISE. A system supporting the CMIP must implement the kernel functional units of the CMISE.

Table 2-1: CMISE services and functions

Notifications

Operations

Service

Type

Function

M_EVENT_REPORT

C/NC

M_GET M_SET M_ACTION M_CREATE M_DELETE

C C/NC C/NC C C

Gives notifications of an event occurring on a managed object Request for management data Modification of management data Action execution on managed object Creation of a managed object Deletion of a managed object

C = Confirmed NC = Not Confirmed

M stands for management

2.3.2 Internet Network Management The Simple Network Management Protocol (SNMP) (CASE 1990) is used to convey management information in an Internet management system just like the CMIP is used in an OSI management system. The SNMP includes a limited set of management requests and responses. The managing system issues get, get_next, and set requests to retrieve single or multiple objects or to set the value of a single object. The managed system sends a response to complete the get, get_next or set requests. The managed system also sends an event notification called trap to the managing system to identify

12

the occurrence of an event. Table 2-2 lists the SNMP request and response messages along with their types and functions.

Table 2-2: SNMP services and functions

Service

Type

Function

Notifications

Trap

C/NC

Operations

GetRequest GetNextRequest

C C

GetResponse

NC

SetRequest

C

An agent sends a trap to alert the manager that an event has been occurred Retrieves the state of a single object Retrieves the state of the next object in an a sequence of objects Response sent by the agent to the manager Sets the state of a managed object

C = Confirmed NC = Not Confirmed

2.4 Object oriented MIB modelling The central point in a management system is the managed object. A managed object may be seen as the management view of a resource and is described by the following characteristics: 

Attributes, that denote specific characteristics of the resource.



Operations, that are performed on a set of attributes



Behaviour, that specifies how the object reacts to operations performed on it.



Notifications, that may be emitted to the managing station through a protocol as a reaction in an external event or as a repeated action.

The managed object class provides a way to specify a family of managed object. A managed object class is a template for managed objects that share the same attributes, operations, notifications and behaviour. A managed object is an instantiation of the managed object class.

13

The MIB is the conceptual repository containing all the related information about managed objects. The MIB modelling encompasses an abstract model and an implementation model (KOTSAKIS 1995). The abstract model defines 

Principles of naming objects



The logical structure of management information



Concepts related with management object classes and the relationship between them.

The implementation model (BABAT 1991, KOTSAKIS 1995) defines the following 

The platform for hosting a MIB.



The architectural principles for partitioning MIB information



Database type (Object oriented or relational)



Translation of MIB object model into a schema

The Management Information Model (ISO 1993) defines two types of management operations 

operations intended to be applied to the object attributes



operations intended to be applied to the management object as a whole.

Attribute oriented operations are as follows 

get attribute value



replace attribute value



replace with default value



add member



remove member

14

Any operation may affect the state of one or more attributes. The operations may also be performed atomically (either all operations succeed or none is performed). Operations that may be applied to the managed object as a whole are the following 

create



delete



action

An action operation requests the managed object to perform the specified action and to indicate the result of this action. 2.5 Distributed Management Information Base (MIB) Roles are not permanently assigned to a management entity. Some management entities may be restricted to only taking an agent role, some to only taking a manager role while other are allowed to take an agent role in one interaction and to take a manager role in a separate interaction. In order to perform system management and share management knowledge, it is sometimes necessary to embody manager and agent within a single open system (see Figure 2-2). Shared management knowledge is implied by the nature of the management framework since the management applications are distributed across a network. Therefore the management information base may be naturally viewed as a distributed database containing the managed objects that belong to the same management system but is physically spread over multiple sites (hosts) of a computer network. The MIB is considered as a superset of managed objects. Each subset of this superset may constitute a set of objects associated with a device physically separated from any other managed device. (ARPEGE 1994) Therefore the managed objects in each location may be viewed as a local management description of the

15

managed device. The distributed design of a MIB may be considered as a great advantage for the following reasons: 

Increased reliability and availability: Reliability is defined as the probability that a system is up at a particular moment, whereas availability is the probability that a system is continuously available during a time interval. When

the MIB is

distributed over several sites, one site may fail while other sites continue to operate. Only the objects associated with the failed site cannot be accessed. This improves both reliability and availability. On the other hand a failure in a centralised MIB may makes the whole system unavailable to all users. 

Allowing data sharing while maintaining some measure of local control: A distributed MIB allows the control of objects locally at each agent. Objects, that may be available to a specific manager may be hidden to some other managers.



Improved Performance: A distributed MIB implies the existence of smaller databases at each site. If a site combines both the role of manager and agent, the manager may gain faster access to the local MIB than any other manager located in a remote site. This increases the performance of the system since a set of managed objects may be accessed locally without the need to open a communication transaction over the network. In addition a distributed MIB decreases the load (number of transactions) submitted to an agent compared with the load executed by a centralised MIB. Since different agents may operate independently, different agents may proceed in parallel, reducing response times.

16

Figure 2-2: Views of shared management knowledge

A typical arrangement of a management system is shown in Figure 2-3. The nodes may be located in physical proximity and connected via a Local Area Network (LAN), or they may be geographically distributed over a interconnected network (Internet). It is possible to connect a number of diskless workstations or personal computers as

Figure 2-3: Simplified Management System.

17

managers to a set of agents that maintains the managed objects. As illustrated in Figure 2-3, some nodes may run as managers (such as the diskless node 1, or the node 2 with disks), while other nodes are dedicated to run only agent software, such as the node 3. Still other nodes may support both manager and agent roles, such as the node 4. Interaction between manager and agent might proceed as follows: 1. The manager parses a user query and decomposes it into a number of independent queries that are sent separately to independent management agent nodes. 2. Each agent node processes the local query and sends a response to the manager node. 3. The manager node combines the results of the subqueries to produce the result of the original submitted query. 4. If something occurs in an agent that changes its operational state, a notification may be generated from the agent and an associated message is sent urgently to the manager for further processing. The agent software is responsible for local access of managed objects while the manager software is responsible for most of the distribution functions; it processes all the user requests that require access to more than one management node and it keeps a truck where each managed object is located. An important function of the manager is to hide the details of data distribution from the user, that is, the user should write global queries as though the MIB were not distributed. This property is called MIB transparency. A management system that does not provide distribution transparency makes it the responsibility of the user to specify the managed node related with a managed object.

2.6 Distributed Network Management

18

Systems are increasingly becoming complex and distributed. As a result, they are exposed to problems such as failures, performance inefficiency and resource allocation. So, an efficient integrated network management system is required to monitor, interpret and control the behaviour of their hardware and software resources. This task is currently being carried out by centralised network management systems in which a single management system monitors the whole network. Most of existing management systems are platform-centred. That is, the applications are separated from the data they require and from the devices they need to control. Although some experts believe that most network management problems can be solved with a centralised management system, there are real network management problems that cannot be adequately addressed by the centralised approach. (MEYER 1995). Basically, there are four basic approaches for network management systems centralised, platform based, hierarchical and distributed (LEINWARD

1993).

Currently, most network management systems are centralised. In a centralised management system (Figure 2-4.a ) there is a single management machine (manager) which collects the information and controls the entire network. This workstation is a single point of failure and if it fails, the entire network could collapse. In case the management host does not fail, but the fault partitions the network, the other part of the network is left without any management functionality. Centralised network management has shown inadequacy for efficient management of large heterogeneous network. Also, a centralised system cannot be easily scaled up when the size of complexity of the network increases. In the platform based approach (Figure 2-4.b), a single manager is divided in two parts; the management platform and the management application. The management platform is mainly concerned with information gathering while management 19

applications use the services offered by the management platform to handle decision support. The advantage of this approach is that, applications do not need to worry about protocol complexity and heterogeneity. The hierarchical architecture (Figure 2-4.c) uses the concept of Manager Of Managers (MOM) and manager per domain paradigm (LEINWARD

1993). Each

domain manager is only responsible for the management of its domain and it is unaware of other domains. The manager of managers sits at the higher level and request information from domain managers.

20

The distributed approach (Figure 2-4.d) is a peer architecture. Multiple managers, each one responsible for a domain, communicate with each other in a peer system. Whenever information from another domain is required, the corresponding manager is contacted and the information is retrieved. By distributing management over several workstations, the network management reliability, robustness and performance increase

Figure 2-4 Network management approaches (a) centralised (b) platform based (c) hierarchical (d) distributed

while the network management cost in communication and computation decreases. This

21

approach has also been adapted by ISO standards and the Telecommunication Management Network (TMN) architecture (ITU 1995).

A distributed system should use interconnected and independent processing elements to avoid having a single point of failures. Several reasons contribute in using a distributed management architecture: higher performance/cost ratio, modularity, greater expandability and scalability, higher availability and reliability. Distributed management services should be transparent to users, so that they cannot distinguish between a local and a remote service. This requires the system to be consistent, secure, fault tolerant and have a bounded response time. Remote Procedure Call (RPC) (NELSON 1981) is well understood control mechanism used for calling a remote procedure in a client server environment. The Object Management Group (OMG)

Common Object Request Broker Architecture

(CORBA) (OMG 1997) is also an important standard for distributed object oriented systems. It is aimed at the management of objects in distributed heterogeneous systems. CORBA addresses two challenges in developing distributed systems (OMG 1997): 1. Making the design of the system not more difficult than a centralised one 2. Providing an infrastructure to integrate application components into a distributed system.

2.7 CORBA System The most promising approach to solve the distributed interface and integration problem is the CORBA architecture(VINOSKI 1997). Although CORBA does not

22

support directly a network management architecture, it provides a distributed object oriented framework where a management system may be developed. The main component of CORBA is the Object Request Broker (ORB). An ORB is the basic mechanism by which objects transparently make requests to each other on the same machine or across a network. A client object need not be aware of the mechanisms used to communicate with or activate an object, how the object is implemented nor where the object is located. The ORB forms the foundation for building applications constructed from distributed objects and for interoperability between applications in both homogeneous and heterogeneous environments. The OMG Interface Definition Language (IDL) provides a standardised way to define the interfaces to CORBA objects. The IDL definition is the contract between the implementor of the object and the client. IDL is a strongly typed declarative language that is programming language independent. Language mapping enables objects to be implemented in the developer’s programming language of choice. CORBA services include naming, events, persistence, transactions, concurrency control, relationships, queries, security etc. CORBA services are the basic building blocks for distributed object applications. Compliant objects can be combined in many different ways and put to many different uses in applications. They can be used to construct higher level facilities and object frameworks that can inter-operate across multiple platform environments.

2.8 Implementing OSI Management Services for TMN Recently the telecommunication industry has gained knowledge and experience establishing management functionality through the Telecommunication Management 23

Network (TMN) framework(ITU 1995). On the other hand in the Internet community, Simple Network Management protocol (SNMP)has gained widespread acceptance due to its simplicity of implementation. Thus, TMN and Internet management will co-exist in the future. The aim of the TMN is to enhance interoperability of management software and to provide an architecture for management systems. A TMN is a logically distinct network from the telecommunication network that it manages. It interfaces with the telecommunication network at several different points and controls their operations. The TMN information architecture is based on an object oriented approach and the agent/manager concepts that underlie the Open Systems Interconnection (OSI) systems management. The Telecommunication Management Network (TMN) is a framework for the management of telecommunication networks and the services provided on those networks. The Open Systems Interconnection (OSI) management framework is an essential component of the TMN architecture. Each TMN function block can play the role of an OSI manager, an OSI agent or both. A managed object instance can represent a resource and thus there is a requirement for communication between managed objects instances in an OSI agent and the resources they represent. Examples of resources include telecommunication switches, bridges, gateways etc. If a new interface card is added to a switch, the switch may send a create request to the agent for the creation of the corresponding manager object instance.

24

Figure 2-5

illustrates how TMN systems can inter-work within the TMN logical layer

architecture (SIDOR 1998, FERIDUM 1996). In this architecture, system A manages System B and B may, in turn, necessitate operations on the information model of the system C.

Figure 2-5: Inter-working TMN

The management information base ( MIB) is the managed object repository and may be implemented by using C++ objects through a MIB composer tool (FERIDUM 1996). In (BAN 1995) a uniform generic object model

(GOM) is proposed for

manipulating transparently managed objects of various specific object models (CORBA, OSI X.700, COM etc.). Communication between managed object instances and resources can be initiated from both directions. Resource access components access managed object instances through the core agent. Some or all of the managed object instances in a MIB may be persistent to allow fast recovery after an agent failure. There are two major design considerations in implementing persistence:

1. Performance: persistent managed objects ensure fast restart after agent failures. For example the instance representing a leased line between two communication nodes may need to be persistent, whereas an instance representing a connection does not (since after an agent failure, the connection will be terminated). Object oriented

25

databases or traditional relational databases or even flat files can be used to implement selective persistence. 2. Synchronisation: When the agent restarts, managed objects must be updated to reflect the current state of the resources. Synchronisation requires exchange of “are you there” and “what are the current value” type messages between managed objects and resources.

2.9 Replication In a Management System Making persistent managed objects may increase the performance and object availability offered during an agent’s failure. Replication may be used to increase further the performance and the availability of network managed objects. The major design considerations in implementing replication are that it allows the control of objects locally at each replication site and it lets managers gain faster access to a MIB managed object by retrieving information locally without the need of performing remote transactions. In that way the load is shared among many sites. To facilitate better the use of a replication technique in a network management system we may incorporate replication in a distributed framework (CORBA). CORBA has been designed to provide an architecture for distributed object-oriented computing, not network management. Engineers have focused their efforts on developing an integrated management platform to

create, manage and invoke distributed

telecommunication services. Some of these efforts are (LEPPINEN 1997, MAFFEIS 1997a, RAHKILA 1997) The CORBA standard provides mechanisms for the definition of interfaces to distributed objects and for communication of operations to those objects through

26

messages. Unfortunately, the current CORBA standard makes no provision for fault tolerance. To provide fault tolerance, objects should be replicated across multiple processors within the distributed system (ADAMEC 1995. The motivations for applying object replication in a distributed network management system could be of many types:  One can be performance enhancement. Management information that is shared between a large manager community should not be held at a single server, since this computer will act as a bottleneck that slow down responses.  Another motivation is improvement of fault tolerance. When the computer with one replica crashes, system can proceed management computation with another replica.  Another motivation could be the case of using replicas to access remote objects. When a remote object is to be accessed, a local replica reflecting remote object’s state is created and used instead of a remote object.

27

Figure 2-6: (a) replication (b) no replication

Figure 2-6

shows a typical scheme for implementing replication of management

information. Agent updates the MIBs located at the manager sites by exchanging messages with the managers. Each manager get the management information locally without the need of issuing remote request. This yields a performance increment since two additional instances of the MIB are used to provide information about the same resources. (MAFFEIS 1997b) discusses a CORBA based fault tolerant system that monitors remote objects and if some of them fail it automatically restarts the failed objects and replicates state-full objects on the fly, migrating objects from one host to another. In (NARASIMHAN 1997) a similar system is discussed, which provides fault tolerant services under CORBA to applications with no modification to the existing ORB.

28

2.10 Need for replication techniques in a management system The Management Information Base (MIB) is viewed as a distributed database that stores information associated with the resources of a network or a remote system. Replication is applied to managed objects, which may be seen as an abstract representation of the resources. In traditional data-base applications, the need for a replication scheme is straight forward since the nature of the data easily allows the implementation of such a schema. For instance, a company may have locations at different cities or a bank may have multiple branches. It is natural, for such applications to enforce replication since such a scheme may increase drastically the fault tolerance of the system. If, for example, the software in some branch fails, information regarding customers of that branch may be available from some other branch. A management information base is different from a traditional database in that, the information stored in it corresponds to objects that represent software or hardware resources. For instance, a variable associated with a remote sensor may be considered as a managed object and its value may be viewed as an instantiation describing its state. The main questions that arise in a management database application are the following : 1. Do we really need to replicate such kind of objects? 2. How useful is a replication scheme in a practical management system? To answer all these questions we illustrate an example of a network management system which manages network resources spread across an interconnected network. Figure 2-7

shows an interconnected network consisting of three networks. Each

network constitutes a management domain. Each domain has a manager, an agent, a management information base (MIB) and some network resources that are monitored and controlled by the manager. Agents are responsible for collecting information from the resources and updating the relevant managed objects in the database to reflect the 29

current state of the resources. The manager and the agent of each domain could be accommodated by the same host computer. However, we use the most general case in which manager and agent reside at different computers. This could be the case where the manager runs on a diskless machine. There are two possible scenarios:

1. No replication: Each MIB stores information about managed resources of its own domain. 2. Replication: Each MIB contains replicated information which is associated with resources of other domains.

30

Domain B Network Resources A

Manager B Domain A

Network B Manager A

Agent B

Bridge AB MIB B

Network A Bridge BC Manager C Bridge AC

Agent A

Domain C

Network C MIB A

Network Resources A

Agent C

MIB C

Network Resources A

Figure 2-7: Network management replication example

In a no replication scheme, when a manager wants information about some network resources, it contacts the appropriate agent which is responsible for providing this management information. The agent either collects the information dynamically from the resources or it makes a relevant query to the MIB and sends a response back to the manager. In a pure distributed management environment if manager B wants some information about the network resources A, it sends a request to agent A. Upon receiving the request, agent A undertakes to provide the requested information to manager B. In a hierarchical management system, this could be accomplished through

31

the manager A. Manager B establish a manager to manager communication with manager A and then it asks A to request its local (domain) agent A to complete the task. In a replication scheme, replicas of managed objects exist on other MIBs. When an agent updates the state of a managed object locally, it also transmits the object state to other domain and updates all the replicated objects that resides at other MIBs. Under this arrangement, when manager B wants some information about network resources A, there is no need either to contact agent A or the manager A but its local agent B since the MIB B contains replicated managed objects of the MIB A. This improves the performance and speeds up the process of collecting network management information from remote system. This becomes more obvious if network B is a remote network which is linked with a low speed leased lines with network A. In case of an agent failure, managers can still get information from some particular managed objects from other MIBs increasing the availability of the management information and making the management system more robust and fault tolerant. We may increase further the performance and the availability if we utilise dynamic creation of managed objects from faulty agents to other agents. This can be achieved by utilising CORBA based replication techniques that may restart failed objects and replicate their state on the fly on another agent (MAFFEIS 1997b). It becomes clear that replicating managed objects to other agents’ MIBs may increase the availability of certain managed objects and the performance of the management system ensuring continuous network management without to interrupt the monitoring and control of network resources.

32

2.11 Synchronous and Asynchronous replica models In the context of research on fault tolerance, a system that has the property of always responding to a message within a known finite time interval is said to be synchronous. A synchronous replica system is said to be the system in which all update requests are ordered. That is requests are processed at all replicas in the same order. Consider for example the replication model in Figure 2-8. The node M sends an update request r to all other nodes G1, G2, G3 and wait for responses. If it receives all the acknowledgements A1, A2, A3 from those nodes, it assumes that the update is done successfully and it proceeds to the next request. In a synchronous replication system the next request is forwarded only if the current update request has been processed at all the agencies holding replicas. A replication system not having this property is said to be asynchronous. That is, in an asynchronous replication system a node proceeds to the next request without the need to wait to get acknowledgements from all the recipients of the previous request. That results in an unordered processing of requests. A request received by the node G1 may be processed in a different order than that of node G2 or G3.

Figure 2-8: Synchronous Replication

2.12 Replication Transparency and Architectural Model A key issue relating to replication is transparency. Transparency (invisibility) determines the extend the users (managers) are aware of that some objects are replicated. At one extreme the users are fully aware of replication process and can even 33

control it. At the other, the system does everything without users noticing anything. The ANSA reference manual (ANSA 1989) and the International Standard Organisation Reference Model for Open Distributed Processing (ISO 1992a) provide definitions related to replication transparency. Among others the standards state that a replication system is transparent if it enables multiple instances of the information object (in our case managed object) to be replicated without knowledge of the replicas by users or application programs. A basic architectural model for controlling replicated objects may involve distinct agencies located across a network. Figure 2-9(a) shows how a manager may control the entire process in a non transparent system. When a manager creates or updates an object, it does so on one agency and then it takes responsibility to make copies or complete any update on other agencies. An agency is a process that contains replicas and performs operations upon them directly. An agency may maintain a physical copy of every logical item, however, there are cases, when an agency may not maintain a physical copy. For example, a managed object needed mostly by a manager on one LAN may be never used by a manager of another LAN. In this case the agency in the second LAN may not contain a physical copy of that object and if ever this manager requests information about the object, the local agency may obtain the information making a call to another agency that actually holds a physical copy of the object. The general model for a transparent replication system is shown in Figure 2-9(b). A manager’s request first handled by a Front End (FE) component.

34

Figure 2-9: Architectural model for replication. (a) non transparent system (b) transparent replication system (c ) Lazy replication (d) Primary copy model.

35

The FE component is used for passing the messages to at least one agency. This hides details of how the message is forwarded to which agency. The user manager does not need to determine a specific agency for service, but it just sends the message and the FE component takes responsibility to determine which agency will receive the request. The FE component may be implemented as part of the manager application or it may be implemented as a separate process invoked by a manager application using a kind of Interprocess Communication (PRESOTTO 1990). Figure 2-9(c ) shows a specialisation of the architectural model in Figure 2-9(b). The model in (c ) is called a lazy replication model and implements what is called gossip architecture (LADIN 1992). Here the manager creates or updates only one copy in one agency. Later the agency itself makes replicas on other agencies automatically without the manager’s knowledge. The replication server is running in the background all the time scanning the managed object hierarchically. Whenever it finds a managed object to have less replicas than it is expected, the replication server arranges to make all the additional copies. The replication server works best for immutable objects since such objects cannot change during the replication process. This architecture is also called gossip architecture because the replica agencies exchange gossip messages in order to convey the updates they have each received. In gossip architecture the FE component communicates directly either with an individual agency or alternatively with more than on agencies. Figure 29(d)

shows another replication architectural model known as the primary copy model

(LISKOV 1991). In that model all front ends communicate with the same primary agency when updating a particular data item. The primary agency propagates the updates to the other agencies called slaves. Front ends may also read objects from a slave. If the primary agency fails, one of the slaves can be promoted to act as the primary. Front ends may communicate with either a primary or a slave agency to

36

retrieve information. In that case, however, front ends may not perform updates; updates are made to primary copy of an object.

2.13 Summary This chapter has set the background for a replication management system. It has shown the need for using a replication scheme in a real time application. It has examined the distributed aspect of a network management system describing the distributed nature of the MIB. It has briefly discussed two major protocols (CMIP and SNMP) for exchanging management messages. It has also examined design aspects of the MIB discussing the significance of the managed object as an autonomous entity for performing operations related to incoming messages. The concepts of object availability and performance have been defined and used as a measure of the quality of service of the system. Synchronous and asynchronous replica models have been examined and finally various architectural models for replication have been discussed as a way to maintain transparently multiple replicas. In the following chapters will discuss the internal mechanisms (algorithms) used to obtain transparent updates to replicated objects. We will discuss a variety of solutions that may be applied to ensure consistency among multiple replicas in occurrence of node or communication link failures.

37

3. FAILURES IN A MANAGEMENT SYSTEM

This chapter discusses the nature of failures in a management system. It first defines the concept of dependability between management agents and it then classifies certain failures that may occurs in a management system analysing further each one by its potentially disruptive behaviour. Failure semantics and masking are examined as a way to understand how failures may be masked by using certain techniques. The chapter ends by specifying some architectural issues including synchronisation, communication and availability of certain components in a group of agents. 3.1 Dependability Between Agents An agent provides certain management services that may be viewed as a collection of operations whose execution can be triggered by inputs from other agents (proxy) or a manager or the passage of time. An agent implements a management service without exposing to the manager the internal representation of the managed objects. Such details are hidden from the manager, who need know only the externally specified management service behaviour. Agents may implement their services which are implemented by other agents. An agent U depends on the agent R if the correctness of U depends on the correctness of R's behaviour. The agent U is called the user and the R is called the resource of U. Resources in turn might depend on other resources to provide their service, and so on, down to the managed objects. The managed object is the atomic resource which is not analysed further and which is actually used to represent hardware or software components in a network. What is a resource at a certain level of abstraction

38

can be a user at another level of abstraction. The relationship between user and resource is a "depends on" relationship as it is shown in 2nd. A distributed management system consists of many agents. The management services provided by those agents may depend on other secondary low level management services associated with operating system components as well as communication components. The union of all these management services is provided as a distributed management system service. To ensure correctness and management service availability, the classes of possible failures in the lower levels of abstraction should be studied and redundancy in particular management services should be introduced to prevent system crashes.

3.2

Failure Classification An agent designed to provide a certain management service works correctly if in

response to requests, it behaves in a manner consistent with the service specification. By an agents response we mean any output that it has to be delivered to the manager. An agent fails when the agent does not behave in the manner specified . The most frequent failures are the followings:

Figure 3-1: Relationship between user and resource.

39

1. Omission Failure: It happens when the agent receiving a request omits to respond to that request. This failure occurs either because the queue of incoming messages in the agent is full and therefore any additional request is lost or an internal failure (i.e. memory allocation failure) is experienced due to a temporary lack of physical resources for handling the incoming request. A communication service that occasionally loses messages but does not delay messages is an example of a service that suffers omission failures 2. Timing Failure: It happens when the agent response is functionally correct but untimely. The response occurs outside the real-time interval specified. The most frequent timing failure is the performance (late timing) failure in which the response reaches the manager after the elapse of the time interval during which the manager is expecting the response. This failure occurs because either the network is too slow or the agent is overloaded and it gets late to give a response to the manager. An excessive message transmission or message processing delay due to an overload is an example of performance failures. 3. Response Failure: It happens when the agent responds incorrectly, either the value of its output is incorrect (value failure) or the state transition that takes place is incorrect (state failure). A search procedure that "finds" a key that is not an entry of a routing table is an example of a response failure. Crash Failures: It happens when after the first omission to produce a response to a request, an agent omits to produce outputs for subsequent requests

3.3 Faulty Agent Behaviour To detect a failure, an agent should reveal a certain behaviour that allows us to identify the occurrence of a failure in order to perform the appropriate actions for 40

handling the failure. The behaviour of an agent under the occurrence of a failure may be classified as follows: 

Fail-stop behaviour



Byzantine behaviour

With fail-stop behaviour, a faulty agency just stops and does not respond to subsequent requests or produce further output, except perhaps to announce that it is no longer functioning. With Byzantine behaviour, a faulty agency continues to run, issuing wrong responses to requests and possibly working together maliciously with other faulty managers or agencies to give the impression that they are all working correctly when they are not. In our study we assume only fail-stop behaviour.

3.4 Failure Semantics The failure behaviour an agent can exhibit must be studied in order to suggest possible fail tolerance mechanisms. Recovery actions invoked upon detection of an agent failure depends on the likely failure behaviour of the agent. Therefore one has to extend the standard specifications of an agent to include failure behaviour. If the specification of an agent prescribes that the failure F may occur, it is said that the agent has an F failure semantics(CRISTIAN 1991). If a communication failure is allowed to lose messages but the probability that it delays or corrupts messages is negligible, we say it has an omission failure semantic. When the service is allowed to lose or delay messages, but it is unlikely that it corrupts messages, we say that it has an omission/performance failure semantic. Similarly, if an agent is likely to suffer only crash failures, we say that agent has a crash failure semantic. In general, if the failure specification of an agent A1 allows A1 to exhibit behaviours in the union of two failure classes F and G, we say that A1 has F/G failure semantics. An agent that has F/G failure semantics can experience more failure behaviours than an agent with F failure semantics 41

and thus the F/G is a weaker failure semantic than F. An agent that can exhibit any failure behaviour has the weakest failure semantic called arbitrary failure semantics. Therefore the arbitrary failure semantics includes all the previously defined failure semantics. It is the responsibility of the agent designer to ensure that it properly implements specified failure semantics. In general the stronger a failure semantic is, the more expensive and complex it is to build an agent that implements it. 3.5 Failure Masking An failure behaviour can be classified only with respect to a certain agent specification, at a certain level of abstraction. If a management agent depends on lowerlevel agents to correctly provide management services, then a failure of a certain type at a lower level of abstraction can result in a failure of a different type at a higher level of abstraction. Let us consider the example in Figure 3-2. A manager M sends a request to the agent A which in turns uses the agent B to get some information necessary to built a response to the Manager request. Let us consider that B is unable to provide the necessary information to the agent A due to either a communication failure (omission, or performance failure) or site failure (crash, value failure etc.). The agent A is actually built one layer above B and it may hide the failure of B by either using another agent, say C, that provides exactly the same information as B or try to resolve the problem by itself by playing the role of B as well (it may access directly the managed object hosted

Figure 3-2: Failure masking 42

Figure 3-3: Group masking

at the agent's B site). The agent A may also change the failure semantics. That is, a crash failure in agent B may be propagated as an omission failure to the manager from the agent A. Failure propagation among managers and agents situated at different abstraction levels of the "depends on" hierarchy can be a complex phenomenon. The task of checking the correctness of results provided by lower-level servers is very cumbersome and for this reason, designers prefer to use agents with as strong as possible failure semantics. Exception handling provides a convenient way to propagate information about failure detection across abstraction levels and replication of certain services provide the mechanism for masking lower level failures. An agent A that is able to provide certain services despite the failure of an underlying component, it is said to mask the component's failure. If the masking attempts of an agent do not succeed, a consistent state must be recovered for the agent before information about the failure is propagated to the next level of abstraction, where further masking attempts can take place. In this way information about the failure of lower level components can either be hidden from the human users by successful masking attempt or can be propagated to human users as a failure of a higher-level service they requested. The programming of masking and consistent state recovery actions is usually simpler when the designer knows that the components do not change their state when they cannot provides their 43

services. Agents which, either provide their standard service or signal an exception without changing their state (called atomic (CRISTIAN 1989)) simplify fault tolerance because they provide their users with simple-to-understand omission failure semantics. To ensure that a service remains available to managers despite agent failures, one can implement the service by a group of redundant physical independent, components , so that if some of these fail, the remaining ones provide the service. We say that a group masks the failure of a member m whenever the group (as a whole ) responds as specified to users despite the failure of m. While hierarchical masking requires users to implement any resource failure-masking attempts as exception handling mechanisms, with group masking, individual member failures are entirely hidden from users by the group management mechanisms. The group output is a function of the outputs of individual group members. For example, the group output can be the output generated by the fastest member of the group, the output generated by some distinguished member of the group or the result of a majority vote on group member outputs. A group G has a failure semantic F if the failures that are likely to be observed by users are in class F. An agent group able to mask from its managers any k concurrent member failures will be termed k-fault tolerant; when k is equal to one, the group is single-fault tolerant and when k is greater than to one, the group is multiple-fault tolerant. For example if the k members of an agent group have crash/performance failure semantics with members ranked as primary, first back-up, second back-up, etc. up to k-1 concurrent member failures may be masked. A group of 2k+1 members with arbitrary failure semantics whose output is the result of a majority vote among outputs computed in parallel by all members can mask a minority up to k member failures. When a majority of members fail in an arbitrary way, the entire group can fail in an arbitrary way. Hierarchical and group masking are two

44

end points of a continuum of failure-masking techniques. In practice one often sees approaches that combine elements of both. For example a manager M may send its request to the primary agent. If no response is received, the manager may try to send the request again. If no response is received for the second time, the manager may assume that the primary agent crashed and it may decide to send the same request to a secondary replica of the primary agent. The specific mechanisms needed for managing redundant agent groups in a way that masks member failures, and at the same time makes the group behaviour functionally indistinguishable from that of single agents depends critically on the failure semantics specified for the group members and the communication services used. The stronger the failure semantics of group members and communication, the simpler and more efficient the group handling mechanisms can be. Conversely, the weaker the failure semantics of members and communication, the more complex and expensive the group handling mechanisms become. The group handling cost increases as the failure semantics

of

group

members

becomes

weaker.

In

(CRISTIAN

1985,

EZHILCHELVAN 1986) families of solutions to a group communication problem are studied under increasingly weak group member failure semantics. Statistical measurements in practical systems confirm the general rule that the cost of group handling mechanisms rises when the failure semantics of group members is weak. while the handling cost for crash/performance failure semantics is 15% of the total throughput of a system (BONG 1989), the handling cost for arbitrary failure semantics can be over 80% (PALUMBO 1985). Since it is more expensive to build agents with stronger failure semantics, but it is cheaper to handle the failure behaviour of such agents at higher levels of abstraction, a key issue in designing multi-layered fault tolerant systems is how to balance the

45

amounts of failure detection, recovery and masking redundancy used at various levels of a management system in order to obtain the best overall cost/performance/dependability result. Recent research has shown that a small investment at a lower level of abstraction for ensuring that lower level components have stronger failure semantics can often contribute to substantial cost saving

and speed improvements at higher levels of

abstraction and can result in lower overall cost(CRISTIAN 1991). On the other hand, deciding to use too much redundancy, especially masking redundancy, at the lower levels of abstraction of a system might be wasteful from an overall cost/effectiveness point of view, since such lower level redundancy can duplicate the masking redundancy that higher levels of abstraction might use to satisfy their own dependability requirements (SALTZER 1984). 3.6 Architectural Issues A prerequisite for the implementation of a management service by an agent group capable of masking low-level component failures is the existence of multiple hosts with access to the physical resources used by the service. For example, if a disk containing managed object instances can be accessed from four different agents, then all four agents can host management services for that management database. A four-member agent group can then be organised to mask up to three concurrent processor failures. Therefore, replication of the resources needed by a service is a prerequisite for making that service available despite individual resource failures. The use of agent groups raises a number of novel issues.  Group synchronisation. How should group members running on different processors (or machines) maintain consistency of their local states in the presence of member failures, member joins, and communication failures?  Group size. How many members should a group have? 46

 Group communication. How should agent groups communicate?  Availability policy. How is it automatically ensured that the required number of members is maintained for agent groups despite operating system, agent and communication failure? 3.7 Group Synchronisation The Group synchronisation policy describes the degree of local state synchronisation that must exist among the agents implementing the management service. In other words, it describes the way agents run on different machines. Two types of synchronisation may be applied ; close synchronisation and loose synchronisation

3.7.1 Close Synchronisation Close synchronisation describes that local member states are closely synchronised to each other by letting members execute all service requests in parallel and go through the same sequence of state transitions. The group output depends on the failure semantics assumed for its members. If group members can fail in arbitrary ways majority voting is the most common method used. In a voting scheme a group answer is output only if a majority of the members agree. This group organisation masks minority member failures at the price of slowing down the group output time to the time needed for a majority of members to compute agreeing answers and for the voting process to take place. If a majority of members fail concurrently, then the group output can be incorrect. In the next chapter further certain voting techniques used to insure correctness and consistency among replicated copies of a managed object are discussed. Closely synchronised groups of servers have been used by other systems that have attempted to tolerate arbitrary server failures (HOPKINS 1978, WENSLEY 1978, HARPER 1988).

47

Examples of closely synchronised groups of members with crash/performance semantics are described in (COOPER 1985, CRISTIAN 1990). A number of rules for transforming non-fault tolerant services implemented by non redundant application programs into fault tolerant services implemented by closely synchronised server groups have been proposed in (LAMPORT 1984).

3.7.2 Loose synchronisation In contrast to close synchronisation, loose synchronisation ranks the group members. Loose synchronisation requires that only the highest ranking group member processes service requests and records the current service state. The highest ranking member (primary) is also the one who sends the group output to users. All the other lower ranking members are regularly updated by the primary. If the primary fails the next highest ranking agent can be used to service the user. In this way, the failure can be masked to users, who only experience a delay in getting their responses. Examples of loosely synchronised server groups are discussed in (BIRMAN 1987, BONG 1989, CRISTIAN 1990, OKI 1988). The main advantage of loose synchronisation over close synchronisation is that only primary agents make full use of their share of the replicated service resources, while the secondary ones make only a reduced use. The communication overhead needed in close synchronisation is also a drawback, since it requires an agreement between the group member before providing any service. The main drawback of the loose synchronisation is the delays seen by managers or other clients when group members fail are longer. The worst case delay in answering a request after a primary failure will not only be composed of the time needed to detect and reach agreement about the failure of the primary, but also of the time needed by the new primary to

48

restore old backups. For on-line transaction processing environments, such delays are considered critical. For real time applications, if the response time required is smaller than the time needed to detect a member failure and to restore old backups, close synchronisation has to be used.

3.8 Group Size The more agents in a group, the greater its availability and capacity for servicing requests in parallel. On the other hand, the more members a group has , the higher the cost for communication and synchronisation among group members. Given a certain required service availability one can use stochastic modelling and simulation methods for determining an optimum number of members by taking into account the individual agent failure rates, the failure semantics, the request arrival rates and the communication cost.

3.9 Group Communication Communication among group members is more complicated than a point-to-point communication using certain communication protocols such as TCP/IP, SNA, OSI etc. If the group state is replicated at several members, these members need to update the replicas by using special group protocols that ensure replica consistency in the presence of process or communication failure. Although these special protocols are built over a transport layer such as TCP or UDP, they must provide special facilities that enhance the availability of the replicas, ensure consistency among replicas and enforce the correctness of certain operations on the replicated copies. Certain replica control protocols (algorithms) are discussed thoroughly in the next chapter. The aim of these protocols is to control the replicated data providing facilities for group reconfiguration when certain failures occur.

49

3.10 Availability Policy The synchronisation and replication policies defined for management services implemented by an agent group constitute the availability policy for that group. One possible approach to enforce a certain group availability is to implement in each group member mechanisms for reaching agreement, as well as mechanisms for detecting member failures and procedures for handling new member joins. Another approach adopted in (CRISTIAN 1990) requires group members to implement only an application specific procedure

needed for local members. The

advantages of this approach are obvious only if we have to implement different services on different sites. The different services are provided through a service availability manager which is used to forward requests on different services to the appropriate member or subgroup. When we need to implement just one service, this approach is not satisfactory because it increases the total overhead. Another drawback is that no group availability policy will be enforced when the availability manager is down. In the case of a network management fail-tolerant system we have to implement just one service associated with the management of certain network resources. Therefore the objective is to have a specified management service availability policy enforced whenever at least one site works in the system. This results in a need to replicate the global state on all working members of the group. To maintain the consistency of these replicated global managed objects at all sites in the presence of random communication or site failures, each replica of a managed object should be updated in such a way that all these sites see the same version of the managed object. If different availability managers see different sequences of updates, then their local views of the global system state will diverge. This might lead to violations of a specified agent group availability policy. 50

3.11 Group Member Agreement To ensure consistency between group members, one needs to solve two major problems.  First, to achieve agreement on member joins and failures. Every member should know those members of the group with which a communication is possible. The protocols that ensure such an agreement are called membership protocols.  Second, to achieve agreement on the order of messages that are broadcast in the group. The protocols that ensures broadcast ordering are called atomic broadcast protocols. : Existing membership and atomic broadcast protocols can again be divided in synchronous and asynchronous. Synchronous protocols have strong timeliness properties and assume the existence of a time bound. They guarantee the propagation of information within bounded times and assume that group member clocks are synchronised within some constant deviation. Messages among members are exchanged within pre-specified time intervals and the delays are shorter than this time intervals. If the delay exceeds the time interval specified, a synchronous protocol may violate its safety requirements. Asynchronous protocols do not have time bounds and they exhibit a weak timeliness property, but on other hand they never violate their safety requirements - even when communication delays are unbounded and communication partitions occur. In the asynchronous approach, broadcasts that occur during a membership change have to be aborted. Designers using asynchronous protocols tolerate partitions that require that the correctly working members of the group form a quorum before any work can be done. This requirement is needed to prevent divergence among the members in distinct partitions. Synchronous protocols are discussed extensively in

51

(CRISTIAN 1985, CRISTIAN 1988) and asynchronous in (BIRMAN 1987, CARR 1985, ). Designers are thus faced with the following choices: attempt to ensure the existence of an upper time bound on message delays or accept unbounded message delays. If an upper bound is achievable and this bound allows satisfactory speed, then one can adopt the synchronous approach which guarantees strong timeliness properties and enables the system to work autonomously for as long as there is at least one member working correctly. On the other hand if the cost for resolving inconsistency is higher than the cost of missing recovery, one can adopt the asynchronous approach with timeout delays. The cost of using the asynchronous approach will then be weak timeliness and the need to worry about quorums. The replica control protocols presented in the next chapter assume the existence of an asynchronous mechanism for detecting any change in the group membership and exchanging any information among group members. This is because the need to detect any failure in a consistent and safe way is higher in a critical real time application such as a network management application than any other system that it allows violations of its safety requirements. Timeliness is less important than safety and the cost for compensating inconsistency in a network management system is too high. The choice of using an asynchronous approach in a network management replication system seems to be an advantage since it prevents any inconsistency among the states of group members in distinct partitions. Safety, in that case, is ensured by adopting a quorum consensus technique.

52

3.12 Summary This chapter has proposed a number of concepts that are fundamental in designing fault tolerant network management systems. Some of the concepts such as the notion of the dependability between management agents and the hierarchical structure of agents are fundamental to any fault tolerant distributed system. Dependability has been examined as a way to form a hierarchy of co-operative agents that work together to provide higher service availability Failure classification provides a way to understand the behaviour of certain failures and to set the background for a possible recovery technique. The most frequent failures have been discussed and the cause of such failures have been examined. The behaviour of a faulty agent may determine the actions adopted to recover from a particular anomaly. Fail stop and Byzantine behaviour have been stated as two different ways with which a faulty agent may interact with other agents. A failure semantics study has been expressed and the distinction between weak and strong failure semantics has been examined in terms of failure behaviour. Hierarchical failure masking is a technique used to hide a failure effects by either calling replicated agents to provide certain services or trying to disguise a failure and let a higher abstraction layer to handle it. Concepts such as group synchronisation, group size, group communication and availability policy are relative to the architectural aspect of the network management system and they have been discussed under the designer perspective. The architectural aspects are used to first, formulate fault tolerant issues that arise in designing a replication management system, and second, describe various design choices. The next chapter will examine certain quorum consensus replica control protocol architectures used as core mechanism for ensuring consistency among replicas in a

53

group of agents. Replica availability of the proposed protocols will be used as a measure for examining the suitability of the protocols. In a following chapter we measure the replica availability by simulating the membership changes in a group of agents.

54

4. REPLICA CONTROL PROTOCOLS

This chapter presents the correctness criteria that should be taken into account when designing a replication system. It discuses the difference between the logical and the physical entity of a replicated object and it explains the transaction processing strategy that satisfies the correctness criteria. An abstract model is introduced in order to study formally certain replication algorithms. A survey of a variety of replica control algorithms is presented and each replica control algorithm is discussed thoroughly. These algorithms constitute the internal mechanism of a replication scheme and they are used basically to ensure consistency among multiple copies of an object in the presence of network failures. 4.1 Partitioning in a Replication System As it is shown in previous chapters, the technique of data replication in distributed database systems (such as a distributed MIB) is typically used to increase the availability and reliability of stored data in the presence of node failure and network partitions (DAVIDSON 1985, GIFFORD 1979, BERNSTEIN 1987, JOSEPH 1987,

JAJODIA 1989, SARIN 1985). The idea of replicating an object at those sites that frequently access the object may be implemented by storing copies of the object where the access to it seems inevitable. By storing copies of critical object on many nodes, the probability that at least one copy of the object will be accessible in the presence of failures increases.

55

In a replication system, a particular object may be replicated by replicating all the internal data and functionality at various sites. A partitioning occurs when the nodes in the network split into disjoint groups of communicating nodes, due to node or communication failure. The nodes in each group can communicate with each other but no node in one group is able to communicate with nodes in other groups. A reunion occurs when two subgroups are reunited into one single partition (or sub-network). When partitioning occurs, a dangerous situation begins to arise: nodes in one partition might perform an update to an object, while at the same time, nodes in other partitions do a different update to the same object. If these two updates conflict, it may be difficult or even impossible to resolve the conflict satisfactorily. The partitioned system is faced with a choice either it accepts updates in more than one partition, in which case conflicts among copies of the same object are inevitable or it accepts updates in one partition at most, in which case the availability of the replicated object is diminished. Therefore, the design of an appropriate replica control algorithm is necessary for handling read and write operations in the presence of network partitioning. Before proceeding into the examination of certain replica control algorithms, I will present the correctness criteria that may be taken into account when examining such algorithms. 4.2 Correctness in Replication A database is correct if it correctly describes the external objects and processes the relative information as it is expected. In the case of a distributed database, correctness is related to the effective and efficient representation of network resources. In theory, such a vague notion of correctness could be formalised by a set of data constraints, known as integrity constraints.

56

Let us consider a managed object X associated with a particular resource, which is stored at both sites A and B (see Figure 4-1). At the beginning both replicas XA and XB hold the same value V1. Suppose that a communication failure isolates the two sites. If there is no precaution, users can access the replicas XA and XB and update them differently. Since there is no communication between A and B, there is no way by which they may notify each other. Therefore, after the isolation, XA may holds the value V2 and XB the value V3. This causes a kind of inconsistency since two instances of the same logical object X hold different values. Users who access transparently the object X get different views of the representative resources. Obviously, the anomaly is caused due to conflicting updates (write operations) issued in parallel by transactions executing in different partitions. A management database is a set of logical data items (objects) that support among others read and write operations. Read and write operations are examined particularly because they can cause inconsistency. Other operations, such as notifications, traps etc. cannot violate consistency since they are generated by the managed object itself and not by an external user or manager. The managed object cannot change its state unless it is instructed to do so by an external element calling a write operation.

Figure 4-1 Replication anomaly caused by conflict write operations. a) before isolation b) after isolation

57

All reads and writes are accomplished via transactions. Transactions are assumed to be correct. More precisely, a transaction transforms an initially correct database state into another correct state. Transactions may interact with one another indirectly by reading and writing the same data items. Two operations on the same object are said to conflict if at least one of them is a write (BERNSTEIN 1987). Conflicts are often labelled either read-write, write-read or write-write, depending on the types of data operations involved and their order of execution (BERNSTEIN 1981). A generally accepted notion of correctness for a database system is that it executes transactions so that they appear to users as isolated actions on the database. This property, referred as atomicity, is achieved by the “all or nothing” execution of the transaction operations. In this case, either all writes succeeds (committed transactions) or none are performed (aborted transactions). Correctness and consistency between operations performed in different transactions are ensured by assigning a serial execution to any set of concurrent operations that produces the same effect (serialisability). That is a serialisable execution is a concurrent execution of many transactions that produces the same effects to the database as some serial execution of the same transactions. Other correctness criteria may be expressed in the form of integrity constrains. Such criteria may range from simple constrains (e.g. a particular object can not assign a negative value) or more complex constrains that involve many replicas (i.e. all replicas must have the same view of a particular object any time they are accessed). In a system with integrity constraints, an operation is allowed only if its execution is atomic and its results satisfy the integrity constraints. In a replicated database the value of each logical object X is expressed by one or more physical instances, which are referred to as the copies of X. Each read and write operations issued by a transaction on 58

some logical data items must be mapped by the database system to corresponding operations on physical copies. The mapping must ensure that the concurrent execution of transactions on replicated objects is equivalent to a serial execution on non-replicated objects, a property known as one-copy serialisability (DAVIDSON 1985). The part of replication system that is responsible for this mapping is called replica control protocol (algorithm). 4.3 Transaction Processing During Partitioning In a Partitioned network, where communication connectivity of the system is broken by failures or by communication shutdowns, each partition must determine which transactions it can execute without violating the correctness criteria. It is assumed that the network is “cleanly” partitioned (that is any two sites in the same partition can communicate and any two sites in different partitions cannot communicate) and that one-copy serialisability is the correctness criterion. Addressing the correctness criteria implies satisfaction of the following proposition: 1. Correctness must be maintained within a single partition by assigning a single view to all the replicas in the partition. 2. Each partition must make sure that its actions do not conflict with the actions of other partitions Correctness within a single partition can be maintained by adapting one of the replica control algorithms. For example, the sites in a partition can implement a write operation on a logical object by writing all copies in the partition. The problem of ensuring onecopy serialisability across partitions become more difficult as the number of partitions increases. In theory, a replication scheme contains two algorithms; one to ensure correctness across partitions and a replica control algorithm to ensure one copy

59

behaviour. In practice many replica schemes compose both algorithms into a single solution. 4.4 Partition Processing Strategy Solving the problem of global correctness needs to deal with two matters: 1. When a partition occurs, sites executing transactions may find themselves in different partitions and thus unable to take a decision as to whether to commit the transaction or abort it. 2. When partitions are reconnected (reunited) mutual consistency between copies in different partitions must be re-established. By mutual consistency, it is meant that the copies have the same state (or value). The updates made to a logical object in one partition must be propagated to its copies in all the other partitions. Partition processing strategies can be divided basically into two classes. The first one is called optimistic and allows updates in all partitions in the network. the second one is called pessimistic and allows updates to take place only in one partition. Optimistic protocols (BLAUSTEIN 1985, DAVIDSON 1984, SARIN 1985) hope that the inevitable conflict among transactions are rare. These algorithms take the approach that any copy of the replicated object must be available even when the network partitions. Optimistic algorithms require a mechanism for conflict detection and resolution. To preserve consistency, conflicting transactions are rolled back when partitions are reunited Pessimistic protocols (GIFFORD 1979, ABBADI 1986, PÂRIS 1986a,

JAJODIA 1989, KOTSAKIS 1996a) maintain the consistency of the replicated object even in the presence of network partitioning. Replicated objects are updated only in a

60

single partition at any given time. Thus only one partition holds the most recent copy preventing in that way any possible conflict. Optimistic protocols are useful in situations in which the number of replicated object is large and the probability of partitioning small. Pessimistic protocols prevent inconsistency by limiting the availability. Each partition makes worst-case assumptions about what other partitions are doing and operates under the assumption that if an inconsistency can occur, it will occur. Optimistic protocols do not limit availability and allow any transaction to be executed in a partition that contains copies of an object. Optimistic protocols operate under the optimistic assumption that inconsistencies, even if possible, rarely occur. Optimistic protocols allow conflicts among the transactions and try to resolve them when the conflicts occur. Pessimistic protocols do not allow conflicts and prevent any inconsistency by allowing updates only in a single partition. As a consequence a pessimistic protocol is more suitable for real time application (network management applications etc.) than an optimistic one. In a critical real time application like that of managing the operations of a satellite ( or a nuclear reactor), the replicated data must be all the time consistent and any possible conflict should be prevented. Real time processes interact dynamically with the external worlds (i.e. network resources). When a stimulus appears the system must respond to it in a certain way before a certain deadline taking into account all the current information. The time limit sometimes does not allow conflict resolution. If, for instance, the response is not delivered during a prespecified time interval, the service may be considered unavailable (performance failure). The advantage of using a pessimistic protocol over an optimistic one in a distributed database system that is used as a repository of real time applications are the following: 1. A pessimistic protocol prevents any inconsistency, where an optimistic one allows inconsistency and tries to resolve it later.

61

2. A pessimistic protocol has faster response, since all the information needed by the protocol is available locally at the site. The protocol may decide to allow (or not allow) a particular update by using a local record kept in each site. 3. Optimistic protocols are useful in a situation in which the number of replicated copies is large and the probability of partitioning is small. This may be the case of applying replication over a Local Area Network (LAN) where the probability of having a connection break is very small. In the case of applying replication across interconnected networks that encompasses different technologies (like that of the satellite network presented in chapter 2), the probability of having link failures increases. When we design a replica control algorithm the competing goals of availability and correctness must be seriously considered. Correctness can be achieved simply by suspending operations in all but one of the partition groups. On the other hand availability can be achieved simply by allowing all nodes to process updates. It is obvious that it is impossible to satisfy both goals simultaneously, one or both must be relaxed to some extend depending on how critical the application is.

Relaxing

availability is fairly straight-forward; you simply disallow certain updates at certain sites. However, relaxing correctness usually requires extensive knowledge about the semantics of the replicated objects and the cost of any inconsistency. 4.5 An Abstract Model for Studying Replication Algorithms As it has been shown in a previous chapter, the network management database is a distributed database that may be viewed as a set of logical data items supporting two types of operations; read and write. Although, the network management protocols (such as SNMP, CMIP that have been discussed in chapter 2) do not support directly read and 62

write operations on managed objects, we can classify certain protocol operation into read and write activities. For instance , M_GET operation of the CMIP, and GetRequest (or GetNext) operation of the SNMP may be considered as read class operations since they do not affect the state of the managed object. On the other hand, M_SET operation of the CMIP and SetRequest operation of the SNMP may be considered as write class operations since their aim is to change the current state of the managed object. The replicated data items are physically stored at different sites. Each item is conceptually an object that encapsulates some internal data and provides a well defined interface for updating for accessing and updating the state of the object. The size of the objects is not important . An object may be as simple as a single variable holding a single value (this object is known as a fine grain object) or it may be as complex as a subordinate data base (this object is known as a large grain object) (CHIN 1991). Objects are physically stored at different sites. The state of an object is determined by the current values of the variables used to describe its attributes. That is, by giving a value to each of its variables. The state of the entire distributed database is composed by the individual states of all logical objects. The term logical object is used to distinguish the logical view of the object from its physical representation. Figure 4-2 shows three different sites holding a copy of a sensor object. This object has a single attribute called temperature and two operations to read and update the temperature. The object has been instantiated with a temperature value equal to 25. In a replication scheme, each site must keep a copy (physical object) of the logical view of the sensor object. To ensure consistency, all the physical copies must adhere to the same logical view. Accessing any of the physical copies allows us to get exactly the same data. The logical object provides a user oriented view of the entity, it shows how the user expects to see the entity. In a

63

Figure 4-2: Logical and Physical objects of the sensor entity.

replicated database, a logical object is assumed valid if all its physical representations are consistent and consequently they have exactly the same state (same temperature). Each read and write operation issued on a logical object must be mapped to corresponding operations on physical copies. A transaction is a process that issues read and write operations on the objects. Each of these operations may trigger a sequence of other operations in order to provide a particular access or update. For instance, the read operation may trigger an interrupt and read the temperature from a hardware device. The duration of the transaction is the time interval between the time a read or write operation is issued and the time the operation terminates. Transactions interact with one another indirectly by reading and writing the same logical object. As already noted, operations on the same logical object are said to conflict if at least one of the them is write (BERNSTEIN 1987). Therefore conflict can occur at the following sequence of operations: read-write 64

write-read write-write Transactions guarantee correctness only if they are assumed as isolated actions. This property is referred to as atomic execution and it has the following effects: 1. The execution of each transaction is “all or nothing”. Either all of the operations are performed or none are performed (atomic commitment). 2. Executing multiple transactions concurrently produces the same result as if they were executed in a serial manner one after an other (serialisability). In a replicated database logical operations issued by a transaction are mapped to corresponding physical ones. The mapping must ensure that the concurrent execution of transactions is equivalent to a serial execution on non-replicated data, a property known as one-copy serializability. The mechanism that performs this mapping is called the replica control protocol.

When the system is partitioned, each partition must determine which transactions it can execute without violating the correctness criteria (atomic commitment and serialisability). This can be accomplished by considering the following statements : 1. Each partition must maintain correctness within its region 2. Each partition must make sure that its actions do not conflict with the actions of other partitions Most of the proposed replica control protocols fulfil the conditions above in order to ensure consistency. The following sections examine thoroughly some replica control protocols and explain how they ensure consistency under network partitioning whilst providing at the same time a tolerable object availability.

65

4.6 Primary Site Protocol This was originally presented as a resilient technique for sharing distributed resources (ALSBERG 1976). It suggests that one copy of an object is designated the primary copy and thus it is responsible for all the activities of the object. All reads for an object must be performed at the primary site. Updates are propagated to all copies. In the case of partition failures , only the partition containing the primary copy can access the object. This approach does not work well in the case of general failures. In cases that it is difficult to distinguish the type of failure (site failure or communication failure), we cannot re-elect a new primary site,. However, if we are able to distinguish these two type of failures we can elect a new primary site when the original one fails (GARCIA

1982). Another very similar approach is that in (MINOURA 1982). It supports the primary copy notion except that the primary copy can change for reasons other than site failures. Although, accessing a copy needs the use of a token. In principle, this approach uses the notion of the primary copy to keep consistency among distributed copies of a logical object. The following shows how a primary site algorithm operates under partitioning. Let us consider a replication scheme that has n copies of a logical object X (Figure 4-3). The copies named X1, X2, ... Xn depict physical replicated entities of the object X located at different sites connected via communication links. The X1 copy is hosted in the primary

Figure 4-3. Replication using primary site algorithm. 66

site P and all the others in secondary sites. Whenever a site wants to read the object X, it accesses a physical entity Xi nearest to the site. To avoid any inconsistency, this Xi copy should be in the same partition as the primary site. Therefore, each read(X) is translated to a read(Xi). Whenever a site wants to update the state of the object X, it broadcast the update to all accessible sites, that is, each write(X) operation is translated to write(X1), write(X2), write(X3), ..., write(Xn). This approach is often called “read one, write all” mechanism (BERNSTEIN 1987). When a partitioning occurs, sites that are members of the partition that does not contain the primary site cannot access the X object, that is, they cannot perform a read or write operation on it. However, sites that belong to the same partition as the primary site can fully perform any operation. Write operations performed by these sites update only those Xi entities that are in the primary partition. (Primary is this partition that contains the primary site). When a reunion occurs, two or more partitions are united into one single partition. If this unified partition contains the primary site all those sites that have lost previews updates become current by getting the latest version of the primary copy. In the case of a failure of the primary site, a new primary site may be elected (GARCIA 1982). When partitioning occurs , copies that are not found in the primary partition are registered as unavailable or “not current”. These copies cannot be accessed either for read or write. The major functions that describe the behaviour of a primary site algorithm are as follows: PrimaryPartitionMember(): It returns TRUE if the site is a member of the primary partition, otherwise it returns FALSE. The primary partition is that partition which contains the primary site. The following data structures are used to describe certain concepts:  Object: A logical object

67

 ObjectCopy: A physical object  ObjectValue: The value of the logical object (it coincides with the value of the primary copy).  ObjectCopyValue: The value of a physical object DoRead(X): It reads the current state of the logical object X (Figure 4-4). This is translated to a physical read to the nearest copy of X. The following functions show the implementation of the DoRead(X). This function returns TRUE if it succeeds, otherwise it returns FALSE. The notation used to describe the functions is C based (KERNIGHAN 1988).The function FindNearestCopy(X) returns the address of the nearest copy Xi of the logical object X. Boolean DoRead(Object X) { if (PrimaryPartitionMember()) { ObjectCopy x_copy=FindNearestCopy(X); read(x_copy); /* read the copy*/ return TRUE; } else return FALSE; }

/*finds the nearest copy*/

Figure 4-4. Read in a Primary Site Protocol

DoWrite(X,v): It updates an object X to a new value v (Figure 4-5). The Update(X,v) function updates all the copies of X in the partition to the new value v and returns TRUE if it succeeds, otherwise FALSE. . DoWrite returns TRUE if it succeeds Boolean DoWrite(Object X, ObjectValue v) { if (PrimaryPartitionMember()) { if (Update(X,v)) /*updates all the copies in the partition*/ return TRUE; else return FALSE; } else return FALSE; } Figure 4-5. Write in a Primary Site Protocol 68

MakeCurrent(): It is called after the occurrence of a reunion (Figure 4-6). The aim of this function is to update those copies that missed some updates due to partitioning. This function is executed by a site that becomes aware of a reunion occurrence. Boolean MakeCurrent() { if (PrimaryPartitionMember()) { ObjectValue v=GetLatestCopyValue(X) /* Get the value of the object either from the primary copy or any other copy that resides in the primary partition*/ if (Update(X,v)); /*updates all the copies in the partition that have missed previous updates. If this succeeds all these copies will become current and they can be accessed normally from that time on.*/ return TRUE; else return FALSE; } else return FALSE; } Figure 4-6. Make Current in a Primary Site Protocol

4.7 Voting algorithms In voting algorithms, every copy of a replicated object is assigned some number of votes. Every transaction must collect a read quorum of r votes to read an object and a write quorum of w votes to write an object. Quorums must satisfy the following two statement: 1) r+w > u 2) w > u/2

where u is the total number of votes assigned to a logical object. The first constraint ensures that there is a non null intersection between every read quorum and every write quorum. Any read quorum is therefore guaranteed to have a 69

current copy of the object. In a partitioned system, this constraint guarantees that an object cannot be read in one partition and written in another. Hence read-write conflicts cannot occur between partitions. The second constraint ensures that two writes cannot happen in parallel or, if the system is partitioned, that writes cannot occur in two different partitions on the same logical object. Hence write-write conflicts cannot occur between partitions. Each site that holds replicated objects maintains its own connection vector. A connection vector is recorded continuously and it indicates the connectivity of the site. It literally presents a mechanism by which the respective site knows what sites it can talk to. Communication failures and repairs are recorded in the appropriate connection vectors, so that all connection vectors in a single partition are identical. Each physical copy i is associated with a version number (VNi). The version number of a copy is an integer which counts the number of successful updates to the copy. This number is initially set to zero and it is incremented by one each time an update to the copy occurs. The current version number of a replicated object is the maximum taken over the version numbers of all copies of the object. A copy is said to be current if its version number is equal to the current version number. Through out this section, we assume that there is a logical object that is stored redundantly in n sites in a distributed system. Initially these sites are all connected and all physical copies are mutually consistent. Since the following protocols do not depend on the number of logical objects which are replicated , it is assumed for ease of exposition that there is just one logical object replicated in n sites. 4.7.1 Majority Consensus Algorithm The first voting approach was the majority consensus algorithm (THOMAS 1979). What will be described is the generalisation of that algorithm as proposed by Gifford 70

(GIFFORD 1979). We simplify the discussion of this protocol assuming only one type of replicated physical objects. Weak copies are not considered and we assume that all the replicated copies assign the same number of votes. The following functions describe the behaviour of the Gifford’s approach. DoRead(X): It reads the current state of the logical object X (Figure 4-7). This is translated to a physical read to the nearest copy of X within the partition. The following functions show the implementation of the DoRead(X). This function returns TRUE if it succeeds, otherwise it returns FALSE. The function FindNearestCopy(X) returns the address of the nearest copy Xi of the logical object X. The function CollectReadVotes(X) gathers all the read votes assigned to the logical object within a certain partition. r is the read quorum, that is, the threshold for performing a read operation in the partition.

Boolean DoRead(Object X) { if (CollectReadVotes(X)>=r) { ObjectCopy x_copy=FindNearestCopy(X); read(x_copy); /* read the copy*/ return TRUE; } else return FALSE; }

/*finds the nearest copy*/

Figure 4-7. Read in a Majority Consensus Algorithm

DoWrite(X,v): It updates an object X to a new value v (Figure 4-8). The Update(X,v) function updates all the copies of X in the partition to the new value v. DoWrite returns TRUE if it succeeds, otherwise FALSE. The function CollectWriteVotes(X) gathers all the write votes assigned to the logical object X within a certain partition. w is the write quorum, that is, the threshold for performing a write operation on the logical object X.

71

Boolean DoWrite(Object X, ObjectValue v) { if (CollectWriteVotes(X)>=w) { if (Update(X,v)) /*updates all the copies in the partition*/ return TRUE; else return FALSE; } else return FALSE; } Figure 4-8. Write in a Majority Consensus Algorithm

MakeCurrent(): It is called after the occurrence of a reunion (Figure 4-9). The aim of this function is to update those copies that missed some updates due to partitioning. This function is executed by a site that becomes aware of a reunion occurrence. GetLatestCopyValue(X) returns the instance of X with the greatest version number.

Boolean MakeCurrent() { if (CollectWriteVotes(X)>=w) { /* Get the value of the object */ ObjectValue v=GetLatestCopyValue(X); if (Update(X,v)); /*updates all the copies in the partition that have missed previous updates. If this succeeds all these copies will become current and they can be accessed normally from that time on.*/ return TRUE; else return FALSE; } else return FALSE; } Figure 4-9. Make Current in a Majority Consensus Algorithm

72

4.7.2 Voting With Witnesses Voting with witnesses approach was introduced by Pâris in (PÂRIS 1986a, PÂRIS

1986b). In this approach a replicated object is a collection of mutually consistent entities much like the previous approaches, but there are basically two types of replicated physical objects: full copies and witnesses. A full copy contains data, a version number and a pre-defined number of votes which entitles it to participate in all elections involving the replicated object. A witness contains only a version number that always reflects the most recent update of the object. Each witness is assigned a specific number of votes and is therefore entitled to participate to all elections involving the replicated file. Procedures for collecting read and write quorums are scarcely affected by the presence of witnesses. One can indeed select Gifford’s original scheme or any of its several variants. In Pâris’s approach read and write quorums are collected as if the witnesses were conventional copies with the following restrictions: 1) every quorum must include at least one current copy 2) every write quorum must include at least one full copy. Restriction (1) expresses the fact that one cannot read from a witness or use it to bring a copy up-to-date. Restriction (2) expresses the fact that writes have to be recorded in a secondary storage. 4.7.3 Dynamic Voting Dynamic voting (JAJODIA 1989) was introduced to increase the availability of replicated objects maintaining consistency in the presence of partitioning caused by site or communication link failures. This algorithm belongs to the family of voting algorithms and it is called dynamic due to its capability to adjust dynamically its internal overhead data in order to achieve higher availability. Jajodia’s algorithm introduces 73

some additional attributes which are associated with each physical copy. These are the following: Update Site Cardinality(SC), which reflects the number of sites participating in the most recent update to the object. Each site sets initially the site Cardinality equal to the number of the replicated copies n. Whenever an update is made to the object, the site Cardinality is set to the number of physical copies which were updated during this update. Among all the sites of the network that hold a copy, there is a privileged site called Distinguished that identifies one of the sites that participated in the last update. If the sites are ordered (1, 2, 3 … etc), this could be the site which is assigned by the greater number and which has participated in the last update. A partition P is said to be a majority partition if either of the following two conditions holds: 1. The partition P contains more than the half of the current copies of the object 2. The partition P contains exactly the half of the current copies and moreover contains the Distinguished Site. A copy is said to be current if its version number is equal to the current version number (the maximum taken over the version numbers). Update Site Cardinality is an attribute similar to the connection vector but it specifies which nodes participated in the last update. What follows is the procedures that implements Jajodia’s algorithm. IsMajority(): This is to determine whether a site is a member of the majority partition or not. Figure 4-10 shows the pseudo-code for the IsMajority() function..

74

#define

AND &&

Boolean IsMajority() { int n=NOfOnes(SV); /*returns how many sites are working and thus they can communicate with each other. This is determined by the number of flags that are up in the site vector SV*/ if (n>SC/2) return TRUE; else if ((n==SC/2) AND Is_The_Distinguished_Site_In_The_Partition()) return TRUE; else return FALSE; } Figure 4-10. IsMajority in the Dynamic Voting Protocol

DoRead(X): It reads the current state of the logical object X. This is translated to a physical read to the nearest copy of X. The following function in Figure 4-11 show the implementation of the DoRead(X). This function returns TRUE if it succeeds, otherwise it returns FALSE. The function FindNearestCopy(X) returns the address of the nearest copy Xi of the logical object X. The function CollectReadVotes(X) gathers all the read votes assigned to the logical object within a certain partition. r is the read quorum, that is, the threshold for performing a read operation in the partition.

Boolean DoRead(Object X) { if (IsMajority()) { ObjectCopy x_copy=FindNearestCopy(X); read(x_copy); /* read the copy*/ return TRUE; } else return FALSE; }

/*finds the nearest copy*/

Figure 4-11. Read function in the Dynamic Voting Protocol

DoWrite(X,v): It updates an object X to a new value v (Figure 4-12). The Update(X,v) function updates all the copies of X in the partition to the new value v. DoWrite returns 75

TRUE if it succeeds.. DoWrite is executed by a site only if the site is a member of the majority partition.

Boolean DoWrite(Object X, ObjectValue v) { if (IsMajority()) { if (Update(X,v)) /*updates all the copies in the partition*/ return TRUE; else return FALSE; } else return FALSE; } Figure 4-12. write (update) in the Dynamic Voting Protocol

MakeCurrent(): It is called after the occurrence of a reunion(Figure 4-14). The aim of this function is to update those copies that missed some updates due to partitioning. This function is executed by a site that becomes aware of a reunion occurrence. The function Update() in Figure 4-13 updates a physical copy to the most recent value.

The

MakeCurrent() describes all the steps to update a reunion. The first site S that becomes aware of the reunion sends a request to each site in the partition P and asks them to determine locally whether they belong to a majority partition or not. If at least one site sends a positive response, the site S obtains the Version Number (VN) and site Cardinality (SC) from that site and executes the Update() function otherwise it does the following: It finds the maximum VN of all the copies in the partition (MAX_VN) and the set I of sites that hold the most recent copy (that with the maximum version number). Let C be the Site Cardinality of any of the sites members of I, and N be the Cardinality of I. An update of the reunited partition can take place only if

76

1) either N>C/2 or 2) N=C/2 and the Distinguished site is in the current partition The first requirement makes sure that the sites with the greatest VN in the partition are the majority. Consider that the Site Cardinality (SC) indicates the number of sites participating in the last update. Thus the Site Cardinality (C)of any of the sites in the set I determines the number of sites participating in the most recent update. The Cardinality (N) of I indicates how many of the participating sites in the most recent update are present. Therefore if N is greater than half the C, more than the half of the participating sites in the most recent update are present and an update is allowed making the partition current. The second requirement makes sure that in case N=C/2 and the Distinguished site is in the partition, the partition is allowed to be updated. The requirement for the Distinguished site is used to “break ties” between partitions when we have a partition decomposition into two sub-partitions with equal number of sites. Therefore the partition that contains the distinguished site is eligible to apply any update.

77

BOOL UpdateObject(Objext X) { /* Get the value of the object */ ObjectValue v=GetLatestCopyValue(X); if (Update(X,v))

/*updates all the copies in the partition that have missed previous

updates. If this succeeds all these copies will become current and they can be accessed normally from that time on.*/ return TRUE; else return FALSE;

}

Figure 4-13 Update in the Dynamic Voting Protocol

#define AND && Boolean MakeCurrent() { BOOL found; /*TRUE if a site in the partition is a member of a majority partition*/ ObjectCopy S=NULL; /*copy residing in a site that is member of a majority partition*/ for (all the sites X in the partition){ Request_Majority(X); /* request each site to run IsMajority() */ found=IsInAMajorityPartition(X));/*Is X in a Majority partition */ if (found) S=X; } if (found) return UpdateObject(S); else { MAX_VN=max{VNi :Si  P}; /*finds the maximum Version Number in the partition*/ I=max{Si  P: VNi=MAX_VN};/*finds those sites holding the maximum Version Number*/ C=card(I); /*cardinallity of the set I - # of sites with version number equal to MAX_VN*/ N=Site Cardinality of any site which is member of I; if (N>C/2) return Update(); else if (N==C/2) AND (distinguished Site  I) return Update(); else return FALSE; } Figure 4-14 Make Current in the Dynamic Voting Protocol

78

4.7.4 Dynamic Majority Consensus Algorithm (DMCA) - A novel approach The Dynamic Majority Consensus Algorithm (DMCA) proposes a novel approach that improves the availability of managed objects insuring at the same time consistency among replicated objects (KOTSAKIS 1996a , KOTSAKIS 1996b). DMCA algorithm tries to exploit the difference between read rate and write rate in order to increase the total availability of a managed object. Choosing the objects that must be replicated is not an easy procedure, however, replicating managed objects that are read very frequently but are updated rarely (or not frequently) improves the performance of the system substantially. In a replication system, the write operation is translated to a set of physical write operations, each one applicable to a single copy of the object. On the other hand a read operation is translated to a single physical read applicable to the nearest copy of the object. Therefore, replicating objects that rarely change reduces the use of multiple write operations and increases, in that way, the performance of the system. In the DMCA approach we assume that replication is applied to those objects that are updated rarely. As expected under such a strategy, the rate of read operations that are issued to each copy of the managed objects is greater than the rate of write operations that are issued during a specific interval of time. DMCA tries to exploit this difference between read rate and write rate in order to increase the total availability of a managed object. 4.7.4.1 DMCA Assumptions Before we proceed further with the algorithm it is necessary to make some assumptions. 79

1. Partitioning is caused by site or communication failures. When partitioning takes place, a single partition called the main partition is subdivided into two subpartitions called secondary partitions. When a reunion occurs two secondary partitions are merged into a single main partition. 2. All communication links are bi-directional 3. A site crash or a link failure is detectable by all the sites that constituted the main partition. It is assumed that a mechanism, by which each site knows what sites it can talk to, is present. Each site maintains its own connection vector in which connectivity information is recorded continuously. Communication failures and repairs are recorded in a dynamic manner in the appropriate connection vector, so that all connection vectors in a single partition are identical. (DAVCEN 1985) discusses a mechanism for implementing a connection vector. If such a mechanism is not available, alternative algorithms, which are described in ( JAJODIA 1987a, JAJODIA 1987b) can be used. 4. The sites in the network are ordered and identified by a distinct number 5. Communication failures and repairs are recorded instantly by a mechanism similar to connection vector. Looking up a communication vector we can find which nodes in the network can communicate. An implementation of the Connection Vector (CV) may be achieved by considering the CV as a sequence of bits that reflects the connectivity of the site. If, for instance, site 7 has CV=100010101111, then sites 0,1,2,3,5,7,11 (bit positions set to 1 in CV) constitute a partition and therefore the sites 0,1,2,3,5,7,11 can communicate. All the sites belonging to the same partition have the same CV. Upon an occurrence of a failure or repair, the CV changes accordingly to reflect the new connectivity of the site.

80

6. The algorithm is applicable to a set of copies (replicas) of a single data item spread across the network in different sites. The data item is stored redundantly at n sites (n>2). 7. Each replicated data item is associated with a set of variables used by the algorithm to ensure consistency and availability. These variables are discussed in the next section 4.7.4.2 DMCA Maintenance Variables 

Site Vector (SV). It is a sequence of bits, similar to CV, that indicates which sites participate in the most recent update (write operation) of the data item. When partitioning occurs and the data item in the main partition is current the SV is assigned the value of the CV. A data item is assumed current if either a write operation has occurred or the MakeCurrent procedure has been performed after the occurrence of a reunion. The MakeCurrent routine is explained in the following section



Site Cardinality (SC). It is an integer number that denotes the number of sites participating in the most recent update of the data item.



Read Quorum (r) determines the minimum number of sites that must be up, to allow a Read operation.



Write Quorum (w) determines the minimum number of sites that must be up, to allow a Write operation.



Current (CUR). It is a Boolean variable that indicates whether the data item is current or not. It is TRUE if the data item is current, otherwise it is FALSE.



Version Number (VN). It is an integer number that indicates how many times the data item has been updated. Each time an update is successfully performed the VN increase by one. It is initially zero. 81

4.7.4.3 DMCA Basic Functions DMCA uses the concept of quorum to read and write a managed object. The Read Quorum (r) identifies the minimum number of sites that must be up to allow a read operation and the Write Quorum (w) identifies the minimum number of sites that must be up to allow a write operation.. DMCA algorithm is described by five routines that co-operate with each other to allow read and write operations to be performed in a partitioned system. These routines are as follows: ReadPermitted is used by a site to determine whether a read operation is permitted or not. The bitcnt(SV) is a function that counts the 1’s in the Site Vector (SV). If this function is called after partitioning, given that the data item is current, the SV will reflect the number of communicating sites in the partition. Therefore, if the number of sites in the partition satisfies the r, a read operation is allowed. BOOL ReadPermitted() { /* It returns TRUE if a read operation is permitted */

if (bitcnt(SV)>=r)

/* the number of the working nodes in the partition is greater than or equal to read quorum..*/

return TRUE; else return FALSE;

}

Figure 4-15 ReadPermitted in the DMCA

82

WritePermitted is used by a site to determine whether a write operation is permitted or not. This routine is similar to ReadPermitted but it checks if the number of sites that remain after partitioning satisfies the write quorum w. BOOL WritePermitted() { /* It returns TRUE if a write operation is permitted */

if (bitcnt(SV)>=w)

/* the number of the working nodes in the partition is greater than or equal to write quorum..*/

return TRUE; else return FALSE;

}

Figure 4-16 WritePermitted function in the DMCA

DoRead is used when the site intents to read a replicated object. The only condition that must be satisfied in order to perform the DoRead is the SV to be greater than or equal the r.

83

BOOL DoRead(Object X) { /* It returns TRUE if a read operation may be accomplished */

if (ReadPermitted(X))

/* Read operation is permitted.*/

{ ObjectCopy x_copy=FindNearestCopy(X); read(x_copy);

/*finds the nearest copy*/

/* read the copy*/

return TRUE; } else return FALSE;

}

Figure 4-17 DoRead function in the DMCA

DoWrite is used when the site indents to change the state of the replicated object This routine checks first the write quorum w to see if a write operation is permitted. If so, it proceeds , otherwise it rejects the write operation. If the Write Quorum is satisfied, the site broadcasts the INTENTION_TO_WRITE message to all other sites in the partition. Each site upon receiving this message sends an acknowledgement. If the originator receives all the acknowledgements from all the sites in the partition, it performs the write operation and broadcasts the COMMIT message to all the sites in the partition, otherwise it broadcasts the ABORT message. If the connection vector changes during the operation of the algorithm we follow a similar approach as in (JAJODIA

1989). That is, if the Connection Vector changes after the issue of the INTENTION_TO_WRITE message, but before the sending of the COMMIT message, the originator sends the ABORT message instead of COMMIT. Any site that has 84

acknowledged the INTENTION_TO_WRITE message and receives the ABORT message terminates the write operation unsuccessfully. Upon receiving the COMMIT message, the node modifies its maintenance variable associated with that object as follows: VN=VN+1 SC=bitcnt(CV) SV=CV CUR=TRUE RW   r  round  SC  RW  WW 

 WW  w  round  SC  RW  WW 

RW is the Read Weight and WW is the Write Weight. Because the r and w are integers we use the function round to round the result to the nearest integer. If the sum (r+w) is found equal to SC we increase the w by one to ensure consistency (r+w>SC). The Read and Write Weight are associated with the probability of having a read and write respectively as follows: RW 

1 Re ad Pr ob

WW 

1 Write Pr ob

85

BOOL DoWrite(Object X,ObjectValue v) { /* It returns TRUE if a write operation may be accomplished */

if (WritePermitted())

/* Write operation is permitted.*/

{ Broadcast(INTENTION_TO_WRITE); WaitForAck(); if (AllAckReceived()) { Update(X,v))

/*updates all the copies in the partition*/

Broadcast(COMMIT); return TRUE; } else return FALSE; } else return FALSE;

}

Figure 4-18 DoWrite function in the DMCA

If the WriteProb is much less than the ReadProb the r is approximately zero and the w is approximately equal to SC. This means that if read operations occur very frequently, they are very likely to be executed since they require a small quorum. On the other hand write operations are very unlikely to be executed in a case of partitioning , since they require a quorum approximately equal to SC. In most practical applications involving distributed management databases , the ReadProb is approximately four or five times greater than the WriteProb. This, of course, may vary depending upon what policy we 86

follow in replicating managed objects. For instance, if we choose to replicate all the objects that are updated very frequently, we should expect the WriteProb to be greater than the ReadProb; but such a choice does not increase the performance of the system since each write is translated into a set of physical write operations that require extra network bandwidth to be implemented. MakeCurrent is performed after the occurrence of a reunion . This routine aims to update some copies of the object that came from a subpartition in which write operations were not allowed due to a large Write Quorum. The MakeCurrent is said to be successful if it sets the variable CUR=TRUE. If the variable CUR is TRUE the object is considered current. The site that performs this routine broadcasts a request for quorum and waits for responses. If it receives all the expected responses from all the sites in the partition, it proceeds, otherwise it sends an ABORT message and the MakeCurrent is considered to have failed. Each site in the partition upon receiving the request for quorum, sends back to the originator the VN of the copy, the w of the copy and the state of the object. The originator receives all the responses and finds the maximum VN and the w corresponding to that VN (MWQ) as well as the number of nodes that contain the maximum VN (MC) and the state of the object corresponding to the copy with maximum VN. If MC  MWQ the state of the local copy assigns the state of the copy with maximum VN and the following instructions are executed to update the maintenance variables: VN=Maximum VNCUR=TRUE SC=bitcnt(CV) SV=CV

87

RW   r  round  SC  RW  WW   WW  w  round  SC  RW  WW 

If MC is less than MWQ the site should wait for some period of time and try again. #define AND && Boolean MakeCurrent(Object X) { Broadcast(REQUEST_FOR_QUORUM); WaitForResponse(); if (AllResponseReceived()) { MVN=max{VNi :Si  P}; /*finds the maximum Version Number in the partition*/ MWQ=Writequorum(Si: MVN=VersionNumer(Si)); /*finds those sites holding the maximum Version Number and then get the write quorum (MWQ) of these sites*/ MC=The Number of sites with version number equal to MVN; if (MC>MWQ) { ObjectValue v=GetLatestCopyValue(X); /*v corresponds to a copy with the largest Version Number*/ Update(X,v)) /*updates all the copies in the partition*/ Broadcast(COMMIT); return TRUE; } else{ WaitAndTryLatter(); return FALSE; } } else return FALSE } Figure 4-19 Make Current function in DMCA

4.7.4.4 DMCA Sequence Diagram A sequence diagram shows an interaction arranged in time sequence (UML 1997). In particular, it shows the objects participating in the interaction and the messages that they exchange in the sequence. A sequence diagram has two dimensions: the vertical dimension represents time, the horizontal dimension represents objects. Time proceeds down the page. Objects can be grouped into swim-lanes on a diagram. An object is shown as a vertical line called “lifeline”. The lifeline represents the existence of the object at a particular time. An activation is shown as a tall thin rectangle whose top is 88

aligned with its initiation time. A message is shown as a horizontal solid arrow from the lifeline of one object to the lifeline of another object. DMCA algorithm involves two main objects: the user object which issues read and write requests and the replication manager object which accepts and handles these requests and which provides response messages to the user according the DMCA replica control protocol. The user object can be a part of a network management application. An instance of a replication manager object resides at many sites that accommodate managed objects. A replication manager object may be seen as the object that a user can communicate with, in order to collect or set management information associated with network resources. In the following diagrams, for simplicity, one object lifeline is drawn for all the replication manager objects are involved in read and write operations and one lifeline is drawn for all user objects. Three diagrams are represented : one describing DoRead function, one for the DoWrite function and one for MakeCurrent function. User object

Replication manager object

Broadcast Read Request

Nearest Copy Response IsReadPermitted

no

yes

X

Perform Read

FindNearestCopy

Figure 4-20: Sequence diagram for DoRead operation

89

User object

Replication manager object

Broadcast Write Request

Write Response Counts all responses All responded AND Write is permitted no

yes

X

Performs Write broadcast commit

Figure 4-21: sequence diagram for DoWrite operation

User object

Replication manager object

Broadcast Request for Quorum

QuorumResponse All Responses are Received no

X

yes F

MC>MWQ

no

yes

X

Update All Copies

Calculate Latest Copy Value

Broadcast Commit

Where F: MVN=find the maximum VN MWQ=find the sites holding the maximum VN MC= the number of sites with VN=MVN

Figure 4-22: Sequence diagram for MakeCurrent operation

90

The following table Shows the mapping between the DMCA protocol operations and the CMIP and SNMP operations

Table 4-1: DMCA mapping

CMIP M_GET M_SET M_CREATE M_DELETE

SNMP GetRequest GetNextRequest SetRequest

DMCA Read Write

4.8 Summary This chapter presents a thorough approach to replica control algorithms, especially to those replication algorithms that use voting techniques. Studying first the criteria to achieve correctness, it sets the background for understanding the internal mechanisms used to insure consistency among multiple replicas in a distributed database. Pessimistic and optimistic processing strategies have been discussed as two alternatives to establish a replication scheme. However, pessimistic strategies have some advantages over the optimistic ones in distributed database systems that are used as a repository for demanding applications. Pessimistic algorithms provides faster response, higher availability, and prevent any temporary inconsistency. Certain replication protocols have been discussed. The primary site protocol is a static protocol that introduces the notion of the primary partition. Only the operations that are submitted from sites of the primary partition are allowed to execute. From the rank of voting algorithms, the following have been examined: the classic Gifford’s approach, the Jajodia’s dynamic voting techniques which enhances the Gifford’s approach by introducing a mechanism to dynamically change the read and write quorum and a novel approach called DMCA which is an improvement on Jajodia’s technique

91

since it is able to dynamically change the read and write quorum by taking into account the read and write ratio. Jajodia’s algorithm implies that reads and writes execute with the same probability (since they have the same occurrence rates). It cannot distinguished a possible difference between read rate and write rate. Adjusting the read and write quorum according the read and write occurrence rate may increase the availability of the replicated object and make the system more fault tolerant. Chapter 6 provides a quantitative comparison between the approaches presented in this chapter and draws some conclusions about the availability provided by each replica control protocol. The next chapter discusses the model used to simulate certain replica control protocols.

92

5. ANALYSIS AND DESIGN OF THE SOFTWARE SIMULATION

This chapter presents the object oriented development of the Availability Testing System (ATS). It discusses first the advantages of using the object oriented paradigm for developing such a complex system. It then presents the simulation modelling process and presents briefly the Object Modelling Technique (OMT) which has been used to construct the ATS system. Following the development process imposed by the OMT, it discusses the requirements of the ATS system , then its analysis and design through a static and a dynamic object model. 5.1 Introduction to simulation modelling Simulation should be understood as the process of designing a model of a real system and conducting experiments with this model for the purpose of understanding the behaviour of the system or of evaluating various strategies for the operation of the system (SHANNON 1975). Simulation is classified based on the types of the system studied and it can be either continuous or discrete. In the case of studying replication algorithms, the discrete simulation seems adequate to describe the behaviour of each algorithm. There are two approaches for discrete simulation: event driven and process driven. Under event driven discrete simulation, the modeller has to think in terms of the events that may change the status of the system (LAW 1991). In a replication system, for example, the status may change by the occurrence of events that cause partitions and reunions. The status of the

93

system is defined by a set of variables being observed. On the other hand under the process driven approach, the modeller thinks in terms of processes that the dynamic entity will experience as it moves through the system. The simulation system that has been used to test the availability of the replication algorithms is a system consisting of certain dynamic entities. Dynamic entities are the objects that interchange information providing in that way certain services by using the system resources. Entities may experience events which result in an instantaneous change of the system state. Some events are endogenous and occur within the system (replica updates) and some events are exogenous and occur outside the system (read, write, partition and reunion operations). The aim of the simulation is to model the random behaviour of the system, over time, by utilising an internal simulation clock and sampling from a stream of random numbers. 5.2 Using an Objet-Oriented Technique for Modelling a Simulation System Simulation is a useful and essential technique for verifying the operability of systems with large number of entities. The object oriented paradigm has become popular in software engineering communities due to its modularity, reusability and its support to iterative design techniques. The idea of an object-oriented simulation has great intuitive appeal in the application development process because, it is very easy to view the real world as being composed of objects. An object oriented technique introduces (1) information hiding (2) abstraction (3) polymorphism. Both information hiding and data abstraction allow the simulation modeller to focus on those mechanisms that are important discarding any irrelevant implementation details. This gives the freedom to the modeller to change implementation details of a system component at a later stage of the development without the need of redesigning or affecting other components. The flexible behaviour of objects is realised through polymorphism and dynamic binding of 94

methods. The binding to an actual function takes place at run- time and not at compiletime. In this way, inheritance provides a flexible mechanism by which you can reuse code, since a derived class may specialise or override parts of the inherited specification. Object-oriented techniques offers encapsulation and inheritance as the major abstraction mechanisms to be used in system development. Encapsulation promotes modularity, meaning that objects must be regarded as the building blocks of a complex system. Once a proper modularization has been achieved, the object implementor may postpone any final decisions concerning the implementation. An other advantage of an object-oriented approach often considered as the main advantage, is the reuse of code. Inheritance is an invaluable mechanism in this respect; since the code that is reused offers all that is needed. The inheritance mechanism enables the developer to modify (or refine) the behaviour of a class of objects without requiring access to the source code. 5.3 Object Oriented Discrete Event Simulation In the object oriented paradigm, a program is described as a collection of communicating objects that represent separate activities in the real world and which are able to exchange messages with each other. An object is an abstract data type that defines a set of operations that perform on the internal data that express the object. Each object is an instance of a class. A class can be thought of as a template which produces objects. The object oriented paradigm has been successfully applied to a variety of fields of computer science and engineering. In distributed algorithms, the global system is decomposed into a set of communicating logical processes. These logical processes work concurrently to accomplish the objective of the distributed task. This concurrency is realised in a simulation system by sequential simulation of the execution time . The

95

sequential execution is achieved through a complex synchronisation mechanism, which guarantees the order in which the events are delivered to certain logical processes. The Availability Testing System (ATS) is a discrete event simulation system (MISRA 1986) and it is realised through facilities which ensure synchronisation of events. The ATS manages an event list and provides event scheduling and dispatching methods. Messages are delivered to the destinations through a communication mechanism which manages the operations of sending and receiving messages. Each message is passed to a higher level class in time-stamp order. The classes that process the messages constitute a base framework which is used to test any voting algorithm. Each tested algorithm uses the facilities provided by the framework in order to complete read or write operations 5.4 The Simulation Modelling Process The simulation modelling methodology that has been used for the development of the ATS system has the following stages:  Problem formulation and Objectives (requirements analysis)  Model design  Model Implementation 5.4.1 Problem formulation In this stage, the objectives (requirements) of the ATS system have been studied. The ATS is formulated as a system that should be able to measure the availability of certain replica control protocols. Such a simulation system should be able to handle random events, in the same way a real system does. The types of these events are partitioning events, reunion events, read and write events. A simulation model requires data. Without input data, the simulation model itself is incapable of generating any data about the behaviour of the real system it represents. The input data of the ATS simulation system are randomly generated according to the Poisson distribution. The distribution 96

function of the inter-arrival time of all of the events handled by the system is given by an exponential distribution (Poisson arrival implies exponential inter-arrival time). The rate of the occurrence of the events and the simulation time determines two number of the event that occurs during the observation time interval which is equal to the simulation interval of time. The ATS simulation system is a terminating system (SADOWSKI 1993). Terminating systems are systems that have a clear point in time when they start operations and a clear point in time when they end operations. ATS specifies a random sample size of event and a time simulation length. 5.4.2 Model Implementation The ATS simulation model is implemented by using C++ programming language (STROUSTRUP 1991). C++ is a good tool that support program organisation through classes and class hierarchies. Classes help the developer to decompose a complex solution into simpler ones . Each class has its own internal data that may be updated through a set of defined operations. The encapsulation of code (operations) and data (variables) into a single entity help the developer to focus on the design and implementation of smaller pieces of software structures and then unify all separate components to form the complete solution.

5.5 Object Oriented Analysis and Design Object oriented development is a conceptual process whose greatest benefit is that it helps developers to express abstract concepts clearly. It can serve as a medium for specification, analysis and documentation of any system. This section present an object oriented methodology used to express object oriented concepts. This methodology is called Object Modelling Technique (OMT) (RUMBAUGH 1991) and it consists of building a model of an application domain and then adding implementation details to it 97

during the design of the system. The OMT methodology has basically three stages named Analysis, Design and Implementation. 5.5.1 Analysis Analysis is the first step of the OMT methodology and it starts from the statement of the problem and builds a model focusing on the properties of particular objects that are used to abstractly represent real world concepts. The analysis model is a precise abstraction of what the desired system must do, not how it will be done. The analysis clarifies the requirements and set the base for later design and implementation. The output of the analysis phase consists of two models named object and dynamic models. The object model describes the static structure of the objects in a system and their relationships. The object model contains object diagrams. An object diagram is a graph whose nodes are object classes and whose arcs are relationships among classes. An object model captures the structural aspect of the system by showing the objects participating in the system as well the relationships among them. The dynamic model describe the behavioural aspect of the system over time. The dynamic model is used to specify and implement the control aspects of the system. The dynamic model contains state diagrams. A state diagram is a graph whose nodes are states and whose arcs are transitions between states. Transitions are caused by events. An event is something that happens at a point in time and it represents external stimuli.

5.5.2 Design Design emphasises a proper and effective structuring of the complex system allowing an object oriented decomposition. During the design phase, high level decisions are made about the overall architecture of the system. The analysis phase determines what the implementation must do, and the design phase determines the full 98

definitions of the objects and associations used in the implementation, as well as the methods used to implement all the operations. During the design phase the development of the system moves from the application domain concepts toward computer concepts. The classes, attributes and associations from analysis must be implemented as specific data structures. 5.5.3 Implementation During implementation, all the design objects and associations are explicitly defined by using a programming language (preferably an object-oriented one). The implementation language should provide facilities that help the developer to realise the concepts as defined during the design phase. One can fake object oriented implementation by using a non-object -oriented language, but it is horribly ungainly to do so. To have a smooth transition from the design phase to the implementation phase , the language should support the following features (CARDELLI 1985):  Objects that are data abstractions with an interface of named operations and a hidden local state.  Objects that have associated type (class)  Types (classes) may inherit attributes from super-types (super-classes) According the Cardelli and Wegner definition (CARDELLI 1985), a language is said to be object oriented if it supports inheritance. Under this definition, Smalltalk (GOLDBERG 1983), C++ (STROUSTRUP 1991), Eiffel (MEYER 1992), CLOSS (KEENE 1989) are all object oriented languages and can be used to implement an object-oriented concept. 5.6 ATS Requirements The OMT methodology will be use to develop a simulation for testing different replica control protocols. The final tool is called Availability Testing System (ATS) and 99

it aims to be used as a tool to measure the availability of a certain replica control protocols. The rest of this chapter explains the analysis and design model for state the problem and the solution of the application of the simulation In a replicated system, the availability of a replicated object is defined as the conditional probability that an operation issued may be performed (BEAR 1988). The availability depicts the proportion of the accepted operations, counts the total number of read and write operations issued during a specific time interval and marks those reads and writes respectively that are performed. The ratio of the total number of operations performed to the total number of operations offered during a given observation time interval provides an estimate of the availability. The Availability Testing System (ATS) has been made to test the availability of a certain replica control protocols. The ATS system tests each protocol by simulating the network behaviour. It produces artificial failures and repairs that lead to a subsequent partition or reunion. At the same time random read and write operations are issued in each node of the network. If an algorithm in a node can perform a particular operation, this operation is considered available. ATS is able to simulate a network of n sites. A set of m replica control protocols resides in each site. The ATS system test the availability of each algorithm by generating events according to predefined occurrence rates. There are four types of generated events; Read, Write, Reunion and Partition. Read and Write events are generated locally to the site and therefore they affect only the algorithm instances residing in the site. Partition and Reunion events are generated globally and they affect all the instances of the tested algorithms across the network. Each of those m algorithms has been copied to each site. When a site event (Read or Write) occurs, all algorithms in this site are executed and they change their state accordingly.

100

The ATS system provides automatic monitoring of various conditions at multiple sites. More precisely, it measures the following: 1. The number of read operations issued in each site 2. The number of write operations issued in each site 3. The number of read operations performed in each site 4. The number of read operations performed in each site 5. The mean availability of each operation across the whole network provided by each replica control protocol ATS generates random events by using an exponential distribution, its mean value of which is reverse proportional to the occurrence rate of a particular event. The system has a means to run all the candidate protocols at the same time and makes them to react (or change their state) to exactly the same set of random events. Each instance of each algorithm runs independently on each site. It is the internal state of each algorithm which determines the execution of a read or write operation. ATS is just the vehicle which accommodates most of the overhead in running the simulation. 5.7 ATS Analysis The purpose of the ATS system is to produce four different events (read, write, partition and reunion) and simulate the behaviour of each replica control protocol under the occurrence of each event. All the events are generated by an object called Event Generator. The events are queued and the most imminent even is extracted from the queue. Under the occurrence of a network event (partition or reunion) the Network manager object is used to handle the event and take the appropriate actions. The actions corresponding to a network event is the creation of a new partition or the unification of two sub-partitions into a single super-partition. Under the occurrence of a node event (read or write) the Node object is used to handle the event by running the replica 101

algorithm under testing. Each Node object is an autonomous entity simulating the behaviour of a separate site. The number of read and write operations performed are registered and finally a statistical object is called to measure the availability of each replica control protocol. All measurements are stored in a file for further processing. The main objective of the simulation is to get a practical estimate of the availability provided by certain replica protocols in order to draw useful conclusions about their effectiveness. During the evaluations of the replica control protocols all the relevant rates with which the events are generated are taken into account. The rate with which an event is generated may affect the effectiveness of a certain protocol. Figure 5-1

shows the process diagram of ATS system. In each site there is an

instance of an algorithm. Each instance runs independently. ATS supervises all the actions that should be taken and it handles all the events generated by the ATS event generator. Events may affect the state of the algorithm or the state of the whole system.

Figure 5-1. ATS process diagram

102

The protocol is defined as a class of objects. Each instance of this protocol is an object associated with a particular node in the network that holds replicated items. The ATS tests the protocol by simulating the network behaviour. ATS produces artificial failures and repairs, that lead to a subsequent partition or reunion and the same time random reads and writes are issued in each node of the network. If the node can perform a particular operation, this operation is considered available. Any replica control protocol can be virtually ported to the ATS system, since the basic components used for instrumentation are fully reusable. 5.7.1 Object Model A number of objects can be identified as the basic components of the Availability Testing System. Figure 5-2 shows the core objects and their relationships. Each object has its own responsibilities and co-operates with other objects to complete complex tasks. The basic stimuli is provided by the NetEventGenerator and the NodeEventGenerator objects. These objects inherit common characteristics from a base class which is called EventGenerator. The NetEventGenerator is used to generate partitions and reunions and the second one to generate read and write operations. The random events generated by the two generators are handled by certain object of the system. A Network object handles network oriented events (partitions and reunions). Node objects handle node oriented events (reads and writes). The Network object is used to simulate the behaviour of the network. During run time, only one instance of this object may exist. The Network object consists of one or more Partitions. If there is no failure, only one partition will exist. Each partition consists of one or more nodes. An undivided network has all its nodes in a single partition. Each node may holds many algorithms. Each algorithm represents a replica control protocol. An Algorithm may use many Messages. Messages are used to curry information from one node to another. Any replica control 103

protocol is a specialisation of the Algorithm super class. Any message used by the replica control protocol is a specialisation of the Message super class. Algorithm super class is an abstract class that provides the basic functionality that may be found useful to a replica control protocol, among others, it provides a service for sending and receiving messages. Sending a message is implemented by forwarding the message to the local Node object. The Node

is responsible for the delivering of the message to the

destination algorithm. Algorithm also provides a read-only access to the connection vector. The connection vector is an object that resembles the connectivity of a node. It indicates the nodes that constitute a partition. In my approach, the connection vector has been implemented as a sequence of bits. In Algorithm class, DoRead(), Write() and HandleMessage() functions have been declared abstract, since their definitions are protocol specific. However, they provide a single interface over multiple implementations. Message super class may be viewed as a message header containing information necessary for the identification and delivery of the message. The data part of the message is defined as a specialisation of the Message class. Each Node object is able to interact with any other node within the same partition. Therefore a node can send and receive messages to and from any other node in the partition. Each partition having more than one nodes is able to split into two sub-partitions. Any two partitions can join into a single group. ATS has two task to complete (1) to simulate the replica control protocols for a sequence of random events (2) to collect statistics after the simulation. Statistics are collected by registering the values of four variables named readissued, writeissued, readexecuted and writeexecuted. These variables indicate

104

the number of read and write operations issued and the read and write operations executed respectively. These variables provide an estimate of the availability..

105

UniformRandGenerator + Rans():double

NetworkEventGenerator

NodeEventGenerator

- ReunionRate:double - PartitionRate:double

- ReadRate:double - WriteRate:double

Vector - Nbits

+ GetNextEvent():EventType

+ GetNextEvent():EventType

+ IsSet(integer):Boolean + Set(integer) + Reset(integer) + SetAll() + ResetAll() + Count():integer + Size():integer

Partition

1+ Consists of

1+ Consists of

+ Npartnodes:integer Simulate

Message Legend: Visibility + public # protected - private

Alg1 Msg1

Alg1 Msg2

holds

head

Receive(msg)

protocol

Algorithm

+ source_node_id:integer + dest_node_id:integer + source_alg:Algorithm + dest_alg:Algorithm + source_alg_id:integer + msg_id:integer

...

-node_id:integer -readissued:integer -writeissued:integer

+ NetSize():integer Simulate + Break():Partition + PartSize():integer CollectStat + Join(Partition) CollectStat + ConnectionVector():Vector + Simulate(EventGenerator) + Send(Message) + CollectStat(Statistics) HandleEvent +Receive(Message) + HandleEvent(SystemEvent) + Simulate(EventGenerator) + CollectStat(Statistics) + HandleEvent(SystemEvent) Send(msg)

+ MakePartition() + MakeReunion() + Simulate(EventGenerator) + CollectStat(Statistics)

Node

HandleEvent

- Nnetnodes:integer - Npartitions:integer

site

group

Network

division

system

ConnVector

Alg2 Msg1

uses

Alg2 Msg2

# read_performed:integer # write_perrformed:integer + NetSize():integer + PartSize():integer + ConnectionVector():Vector + Send(Message) +Receive(Message) + CollectStat(Statistics) + HandleEvent(SystemEvent) +DoRead() {abstract} +DoWrite() {abstract} +HandleMessage(Message) {abstract} # alg_id:integer

Alg1 +DoRead() +DoWrite() +HandleMessage(Message)

Alg2 +DoRead() +DoWrite() +HandleMessage(Message)

...

Figure 5-2. ATS object model

5.8 Dynamic Model

106

The dynamic model describes the flow of control, interactions and sequencing of operations in the system. The dynamic model for the ATS is shown in figure 2. It basically consists of four state diagrams. Each state diagram describes an interactive aspect of the associated object. Network, Partition, Node and Algorithm are four fundamental state diagrams that depict the sequencing of operations in the associated objects. Each of these diagrams includes sub-diagrams that refine the interactions and provide more details about the sequencing of the operations. The states of a sub-diagram Network

NetworkEventGenerator

Simulate MakePartition Partition

EndOfSimulation

entry / Choose a partition and send Break to it [Np1] entry / Choose two partitions randomly and send Join to them

entry / Send Simulate to each partition in the Network

Join

EndOfCollection

do: Reunite two partitions and send

entry / Send Simulate to each node of the partition

CollectStat

Break

HandleEvent(vnt) to each node of those partitions

do: Split the partition into two subpartitions and send

entry / Send CollectStat to each node of the partition

Simulate

Node entry / Send Read to each algorithm

idle Write / Increase writeissued

entry / Send Write to each algorithm

Legend: Nn: number of nodes Np: number of partitions shared event

Receive(msg)

Write

destination node.

Read

do: Find the

destination algorithm by invoking Receive(msg)

Send(msg)

Receive(msg)

do: Deliver the messsage msg to the

CollectStat

NodeEventGenerator

HandleEvent(vnt) to each node of the sub-partitions

CollectStat

Partition

Read / Increase readissued

entry / Send CollectStat to each partition

Simulate

[no Partition or Reunion]

CollectStatistics

HandleEvent(vnt)

Reunion

entry / Register the values of readissued and writeissued and Send CollectStat to each algorithm

entry / Send HandleEvent(vnt) to each algorithm

HandleEvent(vnt)

idle

Algorithm HandleEvent(vnt) / Handle the system event vnt CollectStat / Register the values of read_performed and write_performed Send(msg) / Send the message msg to the attached node Receive(msg) / Handle the message msg Read / Perform read. If read succeeds, increase read_performed by 1 Write / Perform write. If write succeeds, increase write_performed by 1

Figure 5-3. ATS dynamic model

107

are totally determined by shared events and conditions. For instance, the Network state diagram includes two sub-diagrams; one for simulation (Simulate) and one for collection of the statistics (CollectStatistics). NetworkEventGenerator and NodeEventGenerator objects deliver events to both Network and Node. Each of those events triggers a sequence of operations. Dashed lines represent transitions between objects. Shared events curry information transferred from one object to another and trigger activities in those objects. 5.9 Evaluation of the System In the ATS system, the protocol is defined as a separate module and then is compiled with the rest of the system to form an executable module. More than one protocols may be tested at the same time (in a single run). All the protocols respond to the same set of events allowing us to draw conclusions about the suitability of a certain replica control protocol. The ATS model incorporates most of the characteristics of an object oriented system (classification, polymorphism and inheritance) in order to define highly reusable components. It allows the testing of the availability of multiple replica control protocols without the need to modify our testing system. Porting a new protocol is fairly easy, since what it requires is just the definition of the protocol and the definition of any related message it uses. The ATS system remains unchanged providing higher reusability with less effort.

5.10 Summary

108

Discrete event simulation using the object oriented paradigm has been shown to be a suitable approach for building complex simulation systems. The modularity and reusability help to decompose the system into co-operative processes that are related to independent simulation entities. Object Modelling Technique (OMT) supports all of the necessary facilities for expressing object oriented concepts. OMT has been used extensively for analysing the requirements of the ATS system as well as designing the ATS system . The whole ATS system is summarised in two diagrams named : object model and dynamic model. The Object model describes the static structure of the system, whereas the dynamic model describes the behavioural aspect of the system. ATS allows the testing of the availability of multiple replica control protocols without the need to modify the main procedures of the testing system. This provides an extra flexibility since it makes easy to port a new protocol without to disturb the core procedures of the simulation.

109

6. SIMULATION AND ESTIMATION OF REPLICA CONTROL PROTOCOLS

This chapter presents how the measurements regarding the performance of certain replica control algorithms has been achieved. It introduces the simulation model used to build a benchmark test utility for estimating the effectiveness of algorithms. It discusses a fault injection mechanism for generating faults and repairs and it specifies the environment in which the simulation evolves. It also defines the functional components of the simulation, as well as the parameters used to estimate the availability of the tested algorithms. The chapter ends with a thorough discussion about the contribution of the DMCA algorithm. It shows why the DMCA provides higher total availability and presents the results of the benchmark test. The DMCA is compared with two other representative voting algorithms (GIFFORD 1979, JAJODIA 1989). 6.1 Performance Evaluation The evaluation and performance of the replica control protocols has become an area of great practical interest. In most cases, the most important aspect of this performance is the availability of replicated objects managed by the protocol. The availability of the replicated data objects represents the steady-state probability that the object is available at any given moment. Several techniques have been used to evaluate the availability of replicated data. Combinatorial models are very simple to use (PU 1988) but cannot represent complex recovery modes like those found in voting protocols (GIFFORD 1979), (PÂRIS 1986b), (JAJODIA 1989) and (KOTSAKIS 1996b). Stochastic

110

models have been extensively used to study replication protocols (JAJODIA 1990), (PÂRIS 1991) but suffer from two important limitations: 1. Stochastic models quickly become intractable, unless all failures and repair processes have exponential distributions. 2. Stochastic models do not describe communication failures well, since the number of distinct states in a model increases exponentially with the number of failure modes being considered. Discrete event simulation does not suffer these limitations. Simulation models allow the relaxation of most assumptions that are required for stochastic models. They can also represent systems with communication failures. For all its advantages, simulation has one major disadvantage: it provides only numerical results. This makes it more difficult to predict how the modelled system would behave when some of its parameters are modelled. Each time you change the parameters you need to run the simulated system again to get the results. 6.2 The Simulation Model Most studies of replicated data availability have depended on probabilistic models to evaluate the availability of replica control protocols (JAJODIA 1990).

These

models do not generally consider the effect of network partitioning, because of the enormous complexity that would be involved. As a result, the data that they present are for ideal environments that are unlikely to exist under actual conditions. Discrete event simulation has been used to observe the behaviour of three replica control protocols under more realistic conditions. Many parameters can affect the availability of replicated data. The simulation model considers the following type of failures: 1. Hardware failures, which results in a site being down for hours or even days

111

2. Software failures, which result in temporary disorder of a site. The site is in an operational state again after a simple reboot. 3. Link failures, which result in temporary partitioning. The simulation model which is presented in this chapter is equally capable of handling all of these possible failure modes having also the ability to simulate arbitrary network configurations. The modelled networks are assumed to be collections of nodes linked together by gateway sites or repeaters. Partitioning is caused by generating failures into gateways or repeaters and reunions are realised by repairing failed gateways or repeaters.

Figure 6-1: Network Model

Figure 6-1 shows a typical network model that may be considered for hosting a replication network. It consists of two carrier-sense (IEEE 802.3 based LAN) segments and two token ring (IEEE 802.5 based LAN). The repeater and the gateways link together all of

the network segments. Site failures are assumed to be fail-stop

(SCHLICHTING 1983). The network will be partitioned into one or more partitions 112

if the repeater or a gateway fails. Sites attached to a local area network can communicate even after the repeater or gateway failure but they are not able to communicate with a site attached to another LAN. All replicated objects are assumed to be available as long as they can be accessed from any site in the network.

6.3 Fault Injection To evaluate the availability provided by each algorithm, a simulation-based fault injection has been used. Simulation-based fault injection assumes that errors or failures occur according predetermined distribution. The faults are injected into the system to 1. Study the behaviour of the algorithms in the presence of faults. 2. Evaluate the availability provided by the algorithms

Figure 6-2: Fault Injection System

Figure 6-2

shows a fault injection environment which typically consists of the replication

system with the target replication algorithms, a fault injector, a repair injector, a work 113

load generator, a data collector and a data analyser. The replication system comprises all the groups of replicated nodes that can host the replication algorithms as well as the necessary facilities for interchanging messages between nodes of the same group. The fault injector injects partitioning faults into the target system, by injecting faults into certain gateways or repeaters. The repair injector injects reunions into the system. Each reunion corresponds to a repeater or gateway repair. The system also executes read and write operations on replicated objects. The read and write operations are generated from the work-load generator. The controller is physically the simulation program that runs and controls all the parts of the testing system. It also tracks the execution of read and write operations and initiates data collection. The data collector performs on-line data collection and the data analyser performs data processing and analysis. The injection of faults is done during run-time by using the time-out technique. A timer expires at a predefined time triggering injection. The inter-arrival time between faults follows the exponential distribution. When the timer expires, operations occur by interrupting the normal operation of the system 6.4 Simulated Algorithms The three algorithms (GIFFORD 1979, JAJODIA 1989, KOTSAKIS 1996b) presented in the previous chapter have been tested. Each algorithm has been tested under exactly the same sequence of events. When an event (partition, reunion, read or write) occurs, it is inserted into a queue and then each algorithm performs the necessary house keeping operations to reflect any change in the replication system. Each algorithm keeps its own record and when the simulation finishes and each algorithm reaches a steady-state condition, the simulation control unit counts the percentage of read and write operations that have been performed giving in that way an estimation of the

114

availability provided by the algorithms for a particular set of parameters. The parameters, that are considered in each run, are the following: 1. Number of network nodes 2. Partitioning rate

(number of partitions per time unit)

3. Repair delay

(in time units)

4. Read rate

(number of read per time unit)

5. Write rate

(number of write per time unit)

6. Simulation Interval (in time units) The simulation interval should have such a value that it guarantees a large number of reunions and partitions during the run-time. The greater the simulation interval is, the more accurate the simulation will be. Each physical site is represented by a site process which is identified by a unique identity number. Each site process contains a work load generator which generates Poisson read and write events. The read and write operations inter-arrival time follows the exponential distribution. The mean for read operations is equal to

mean for write operations is equal to

1 and the read rate

1 . The read and write operations write rate

generated in each site realise the periodical access process to each replicated object. Each site provides a process which calculates the percentage of successful accesses. An access is considered successful if the relative operation (read or write) can be performed. 6.5 The Protocols’ routines A general form for accessing the replicated objects is used in a common way to all of the tested protocols. When a read or write occurs the read or write routine of each protocol is activated. Each protocol has its own view of the network state and according to that view it executes all the necessary subroutines needed to accomplish a read or 115

write access. The results of a successful or unsuccessful access are recorded separately for each protocol. These results are gathered during execution and later they are compared in order to draw useful conclusions about the availability provided by each protocol. A critical part of the model is to determine whether two sites can communicate. Since all the protocols rely on communication between sites to determine the status of the replicated data, a fast simple means is needed to determine communication links. For sites on the same LAN, the solution is simple. If any two sites are up and running, it can be assumed that they can communicate. For sites not on the same network segment, the assumption cannot be made since they may be separated by one or more gateway sites or repeaters. A solution to this is found by viewing the network as a tree structure, whose nodes consist of different network segment and their respective sites. One segment is chosen as the basis. Communication is determined by traversing the tree between two sites. The tree structure is conceptual and is represented by double linked lists1. The connectivity between sites is shown by a communication vector which is realised through an array of bits. Each site has assigned a unique identity number. Given the identity numbers of two sites the communication routines determine if they can communicate by checking the connectivity vector. 6.6 Implementing Group Communication Sites in a communicating group exchange messages by using the multicast model. A simple implementation is based on simple message sending which might take the form: void multicast (PortID *destination, Message msg) { for (int i=0; i1. The availability provided by the Gifford’s algorithm follows the DMCA one, but this is due to the fact that in Figure 6-7 the repair delay is too short. As we see later, when the repair delay increases, the total availability of Gifford’s algorithm becomes much smaller that that of DMCA, especially for λ