Dynamic Quorum Adjustment: A Consistency Scheme for Replicated Objects. E.G.Kotsakis and B.H.Pardoe. University Of Salford, Electrical and Electronic ...
In Proceedings of the Third Communication Networks Symposium, July 8-9, 1996, Manchester, pp. 197-200
Dynamic Quorum Adjustment: A Consistency Scheme for Replicated Objects E.G.Kotsakis and B.H.Pardoe University Of Salford, Electrical and Electronic Engineering Dept., Salford M5 4WT, United Kingdom
ABSTRACT Replication of networking objects may increase the availability of the objects and the fault tolerance of the system. Network failures may split a group of nodes to separate partitions. To insure consistency among the replicated copies, we need a protocol (distributed algorithm) for controlling read and write operations on each copy. This paper proposes a protocol for handling multiple copies of an object. Our protocol exploits the difference between read rate and write rate in order to provide higher availability. I.
INTRODUCTION
Various distributed computing systems maintain replicated copies of the same object at different sites. A distributed object may be considered as a file entity or an individual database entry. Since an object is replicated no single failure can destroy any data. A replicated object is an abstract entity that is implemented as a set of physical objects containing the same data and allocated at different sites in the network. Most of the times, the complexity is hidden from the system users. Therefore, the system will have the complete responsibility of maintaining consistency of the physical copies of each replicated object. This task is greatly complicated since each individual node is subject to failure. Several consistency schemes have been developed for maintaining mutual consistency [1,2,3,4]. One of the well known schemes is that of Quorum Consensus algorithm (QC). The QC algorithms tolerate both communication and site failures increasing in that way the availability of the replicated object. The QC algorithms are based on voting techniques. Each replica has assigned a number of votes. A read is performed if the current number of votes collected from the partition is greater than a Read Threshold (RT) and a write is performed if the number of votes is greater than a Write Threshold (WT). In that paper, we present a QC protocol allowing dynamic adjustment of RT and WT. The dynamic threshold adjustment is accomplished by taking into account the rates of read and write operations. This approach allows us to configure the quorum according to the read and write operations issued. The result of this configuration is to achieve higher total availability since
we promote the execution of the most frequent operations. Section II discusses the specifications of a quorum consensus protocol and the assumptions that must be considered in presenting the protocol. Section III describes the maintenance variables that each copy must maintain. Section IV discuses the protocol and its fundamental procedures. The rest of the paper presents the results of our simulation program that has been used to measure the availability of replication protocols (algorithms) II.
SPECIFICATIONS OF A QC PROTOCOL A distributed data base system consists of a collection of independent computers, called nodes, connected via communication links. Both nodes and links may fail. A replica control protocol ensures correct management of the replicated object in the presence of communication failures. When such failures occur the network may be partitioned into two or more independent groups called partitions. Nodes within a group are able to communicate with each other, but no node in one group is able to communicate with nodes in other groups. To insure consistency, only one group (partition), at most, is allowed to perform updates (write operations). When a node or communication link recovery occurs two or more sub-partitions may be reunited forming a new partition. The nodes of that partition are updated accordingly obtaining any missing update. Quorum Consensus algorithms are a general class of synchronisation protocols for distributed systems. An operation proceeds to completion, only if it can obtain permission from all the other nodes constitute a quorum group [5]. Two thresholds, one for reading (RT) and one for writing (WT) are used to ensure consistency. We define both (2WR) and (RT+WT) greater than the total weight of all copies for a given object [6]. The two conditions above are called quorum intersection invariants. The first one prevents two simultaneous write operations on the same object (2WT>total weight). The second one prevents a read and write operation of taking place simultaneously at different partitions. In our approach, we consider the following assumptions 1.
All nodes are assumed having the same weight.
2. 3.
All communication links are bi-directional A site crash or a link failure is detectable by those sites affected from the failure 4. The sites in the network are ordered (i.e. site 0, site 1, site 2, ...) 5. Communication failures and repairs are recorded instantly by the Connection Vector (CV). The CV is a sequence of bits that reflects the connectivity of the site. If, for instance, site 7 has CV=100010101111, then sites 0,1,2,3,5,7,11 (bit positions set to 1 in CV) constitute a partition and therefore the sites 0,1,2,3,5,7,11 can communicate. All the sites belonging to the same partition have the same CV. Upon an occurrence of a failure or repair, the CV changes accordingly to reflect the new connectivity of the site. 6. The algorithm is applicable to a set of copies (replicas) of a single data item spread across the network in different sites. The data item is stored redundantly at n sites (n>2). 7. Each replicated data item is associated with a set of variables used by the algorithm to ensure consistency and availability. These variables are discussed in the next section III.
MAINTENANCE VARIABLES
Each replicated copy maintains a set of variables used to control read and write operations issued to that copy. These variable are the following: • • • • •
•
Site Vector (SV). It is a sequence of bits, similar to CV, that indicates which sites participate in the most recent update (write operation) of the data item. Site Cardinality (SC). It is an integer number that denotes the number of sites participating in the most recent update of the data item. Read Threshold (RT) determines the minimum number of sites that must be up, to allow a Read operation. Write Threshold (WT) determines the minimum number of sites that must be up, to allow a Write operation. Site Vector Instant Update (SVIU). It is a Boolean variable that indicates whether the Site Vector should be updated instantly after a partitioning. Version Number (VN). It is an integer number that indicates how many times the data item has been updated. Each time an update is successfully performed the VN increase by one. It is initially zero.
Dynamic adjustment of the RT and WT is obtained by using the following formulas : Let us denote β = Then
ReadRate WriteRate
1 SC 2 WT = β 1 + β SC 1 SC 2 RT = 1 1 + β SC
if β ≤ 1 if β > 1
if β ≤ 1 if β > 1
If (RT+WT) is equal to SC, WT is increased by one to satisfy the consistency conditions described in the previous section. IV.
THE REPLICATION PROTOCOL
The protocol manipulates three basic procedures. These are as follows: • • •
DoRead. To perform a read operation. DoWrite. To perform a write operation MakeCurrent. To update copies that missed some updates due to partitioning.
When a system event occurs (partition or reunion), the system executes a call back function to handle that event. This call back function does the following: • If the event is a partition and the flag SVIU is up (instant update of SV is allowed), then the SV assign the value of the Communication Vector (CV). • If the event is a reunion, the Procedure MakeCurrent must be executed. DoRead operation Every time a DoRead operation is issued, the following condition is checked to see if a read quorum may be formed. If the number of 1’s in the SV is greater than the RT, then perform the read, otherwise reject the read and proceed.
DoWrite operation A write operation is performed if the following condition is satisfied. If the number of 1’s in the SV is greater than the WT, then perform the write and commit otherwise reject the write and proceed.
The execution of the write operation is done by using a commit protocol [6,7]. Let us denote A the node that is to perform a write. We assume that the state of the partition does not change during message passing. If such a change occurs the procedure aborts. Node A broadcasts an intension_to_update message. Upon receiving that message, each node in the partition sends an
acknowledgement to A along with its version number (VN). When A collects all the acknowledgements, it performs a physical write and broadcast the commit message along with any missing update. Upon receiving the commit message, each node performs the following
(a) GIF_TO JAJ_TO KOT_TO
Total Availability
α=6 1 0.95
VN = VN + 1 SC = Number of 1’s in CV SV = CV AdjustThreshold() SVIU = TRUE
0.9 0.85 0.8 0.75 0.7 0.65
The AdjustThreshold() is a procedure that configures the WT and RT as described by the equations in section III. The dynamic nature of the algorithm allows reconfiguration of the thresholds RT and WT as the Site Cardinality (SC) changes.
0.6 1
6
11
16
21
26
31
36
41
46
β
(b) GIF_TO JAJ_TO KOT_TO
Total Availability
α=1
0.5
MakeCurrent Procedure The MakeCurrent is used after the occurrence of a reunion. Its purpose is to make current any copy that missed some previous updates. Let us denote A the node that performs the MakeCurrent procedure. The node A broadcasts a cardinality_request message. Upon receiving that message, each node responses by sending a cardinality_response message along with the VN and the WT number. The node A collects all the responses and it determines the following: 1. 2. 3.
MVN: The maximum VN . N: How many copies acquire the MVN . MWT: The WT that corresponds to a copy with version number equal to MVN.
0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 1
6
11
16
21
26
31
36
41
46
β
(c) JAJ_TO KOT_TO GIF_TO
Total Availability
α=0.1 0.15 0.14
If N is greater than or equal the MWT, then the node A broadcasts a modification message much similar to a commit message. The modification message is used to update any not up to date copy in the partition. Upon receiving that message, each node performs the same operations as if the message were a commit. If N is less than MWT, The node A broadcasts the ResetSVIU message. Upon receiving that message, each node resets the Boolean variable SVIU (SVIU=FALSE). This is done to mark up that the copies in that partition cannot form a write quorum and therefore they must be handled as non current. V. AVAILABILITY All the algorithms adopting the quorum scheme have a weak point, and that is the availability. Initially, replication was introduced to improve the performance and the fault tolerance of the system. However, allowing operations to be performed in a single partition decreases the availability of the object. On the other hand, this is required in order to ensure consistency. Therefore the greater the availability, the better the algorithm. The availability is considered very important measurement when designing and analysing a replication protocol.
0.13 0.12 0.11 0.1 0.09 0.08 0.07 0.06 0.05 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
β
(d) JAJ_TO KOT_TO GIF_TO
Total Availability
α=0.5
0.35
0.3 0.25
0.2 0.15
0.1
0.05 0.1
1.1
2.1
3.1
4.1
5.1
6.1
7.1
8.1
β
Figure 1:Availability. Our protocol (KOT_TO), [3]’s protocol (JAJ_TO), [1]’s protocol (GIF_TO).
VI. SIMULATION The algorithm has been tested using a simulation program and compared with that in [3] and a non dynamic algorithm as that in [1]. Our simulation uses the Poisson model for producing events like read/write operations or partitions/reunions. In our model, the availability is considered as the conditional probability that an operation issued may be performed. The availability depicts the proportion of the accepted operations. Counts the total number of read and write operations issued in a specific time interval and mark those read and write respectively that are performed. The ratio of the number of operations executed to the total number of operations offered during a given observation time interval provides an estimate of the availability. In the Figure 1, we present a comparison between the availability provided by our algorithm (KOT_TO) and the algorithms as described in [3] (JAJ_TO) and [1] (GIF_TO). The factor α would be defined as the reunion to partition ratio. Defining α in that way, we cannot have α greater than to 1, since a reunion occurrence is meaningful after the occurrence of a partition. Therefore α is defined to represent a factor indicating how many times the probability of having a reunion after partitioning is greater than the probability of having an additional partitioning. The factor β is defined as the read to write ratio. The simulation is implemented for a network consisting of 50 nodes. Partitions and reunions occurs according the factor α and reads and writes occur according the factor β. [3]’s algorithm provides dynamic adjustment but it keeps about the same level of availability for both read and write operations. [1]’algorithm does not provide dynamic adjustment at all, so there is no way to have reconfiguration that permit us to allow updates when the number of available nodes continues to decrease in a constant rate. Our algorithm exploits the difference between read rate and write rate giving a total availability greater than that of both [3] and [1]. It is worth saying that in our approach, the availability increases as the read to write ratio β increases. This is because the aim of our algorithm is to allow low cost operations (like read) with large rates to be performed easily, whereas high cost operations (like write) to be performed rarely. Under this strategy the majority of the operations issued are executed. This approach makes our algorithm suitable for application in which the read to write ratio is well greater than to 1. Such applications may be found in the area of network management and real time fault tolerance systems [8]. VII. CONCLUSION In this paper we have described a consistency algorithm for controlling replicated objects. Our algorithm achieves greater total availability than that of [1,3]. This is obtained by exploiting the difference between read rate and write rate. To achieve a difference between read and write rates it requires a replication policy that allows duplication to those objects that are updated rarely and read frequently.
REFERENCES [1] Gifford, David K. “Weighted Voting for Replicated Data”, Proceedings of the Seventh Symposium on Operating Systems Principles, ACM, 1979, pp. 150162. [2] Paris, Jehan-Francois “Voting with a variable Number of Copies”, IEEE International Symposium in Fault Tolerant Computing, IEEE, 1986, pp. 5055. [3] Jajodia, Sushil and Mutchler, David “A Pessimistic Consistency Control Algorithm for Replicated Files which Achieves High Availability”, IEEE Transactions on Software Engineering, Vol. 15, No 1, January 1989, pp 39-46. [4] Agrawal, D. And Abbadi, A. EL “ The Generalised Tree Quorum Protocol: An Efficient Approach for Managing Replicated Data”, ACM Transactions on Database Systems, Vol 17 No 4 DEC. 1992, pp 689717. [5] Barbara, D. , Garcia-Molina, H., and Spauster, A. “Mutual Exclusion in Partitioned Distributed Systems”. Distributed Computing, 1 (1986) pp. 119-132. [6] Bernstein, Philip A., Hadzilacos, Vassos, and Goodman, Nathan, “ Concurrency Control and recovery in database systems”, Addison Wesley, 1987. [7] Skeen, D “A Quorum Based Commit Protocol” in Proceedings of the 6th Berkeley Workshop on Distributed Data Management and Computer Network, ACM/IEEE, February 1982, pp 69-80. [8] Kotsakis, E. G., and Pardoe, B.H. “Replication Of management objects in a Distributed MIB”, In proceeding of the International Conference on Telecommunications, (April 1996), Istanbul.