Continuous Replica Placement Schemes in Distributed ... - CiteSeerX

5 downloads 197363 Views 409KB Size Report
CDN). Section 4.1 gives an overview of the simulation setup, while Section 4.2 presents the evaluation of the algorithms, together with a brief discussion at the ...
Continuous Replica Placement Schemes in Distributed Systems Thanasis Loukopoulos

Petros Lampsas

Ishfaq Ahmad

Department of Computer Science, Hong Kong University of Science and Technology Clearwater Bay, Hong Kong

Department of Computer and Communication Engineering, University of Thessaly Volos, Greece

Department of Computer Science and Engineering, The University of Texas at Arlington Texas, USA

[email protected]

[email protected]

[email protected]

examples include distributed file systems [27], distributed databases [1], [28], video servers [5], [18], content distribution networks (CDNs) [2] and the Grid [11]. Tackling data management issues, e.g., how replicas of objects are created, distributed, accessed and updated, is critical to the success of the above systems. Related to data management is the replica placement problem (RPP), also known in the literature as file allocation problem [9]. Although the first formulations date back to early 70s [8], with the advance of distributed systems spanning WANs, the interest on RPP was renewed (see [14], [26] for some recent publications). A generic RPP formulation can be summarized as: given a set of M servers and N objects, find the object-server allocation (allowing multiple replicas of the same object) that optimizes certain performance criteria. Typically, RPP formulations fall into two categories: static and dynamic (see related work section). Most static RPP variations assume that access statistics do not change, and hence the replication scheme needs to be computed one time and remains as is for a large time period. For this reason, they usually do not consider the cost of creating replicas, since it is going to be amortized over a large time period. We call such formulations 1RPP, e.g., [5], [8], [9], [12], [13], [17], [20], [23], [26], [27]. Dynamic formulations on the other hand, e.g., [3], [24], [28], change the replication scheme potentially upon every request. Intuitively, they are most useful when the objects considered for replication are relatively small in size and replicas are created at points along the request path [28]. This is what happens for instance in Web proxy caching. On the other hand, if the objects are of large size and availability is of concern, dynamic schemes become less useful, e.g., distributed video servers [5]. Summarizing, we can say that static solutions act as push-based prefetching schemes, while dynamic ones are closer to pull-based methods. Even if 1RPP solutions are useful in order to guarantee some minimum availability requirements, this does not mean they should remain unchanged. The indirect assumption made so far in the literature, is that whenever we need to recalculate the replication scheme, presumably due to changes in user preferences, we can rerun one of the 1RPP algorithms and obtain a new solution. Here, we demonstrate that the existing approach is inferior, since it does not consider in depth the difficulties associated with performing the necessary replica transfers, in order to move from one replication scheme to another. Therefore, we propose an extension to 1RPP formulations, called Continuous Replica Placement Problem (CRPP) that allows for more frequent

ABSTRACT The Replica Placement Problem (RPP) aims at creating a set of duplicated data objects across the nodes of a distributed system in order to optimize certain criteria. Typically, RPP formulations fall into two categories: static and dynamic. The first assumes that access statistics are estimated in advance and remain static, and, therefore, a one-time replica distribution is sufficient (1RPP). In contrast, dynamic methods change the replicas in the network potentially upon every request. This paper proposes an alternative technique, named Continuous Replica Placement Problem (CRPP), which falls between the two extreme approaches. CRPP can be defined as: Given an already implemented replication scheme and estimated access statistics for the next time period, define a new replication scheme, subject to optimization criteria and constraints. As we show in the problem formulation, CRPP is different in that the existing heuristics in the literature cannot be used either statically or dynamically to solve the problem. In fact, even with the most careful design, their performance will be inferior since CRPP embeds a scheduling problem to facilitate the proposed mechanism. We provide insight on the intricacies of CRPP and propose various heuristics.

Categories and Subject Descriptors C.2.4 [Distributed Systems], D.4.2 [Storage Management].

General Terms Algorithms, Performance, Experimentation.

Keywords Replica placement, allocation, scheduling, heuristics, greedy method, content distribution networks, Grid, video allocation.

1. INTRODUCTION Distributed systems often use replication in order to increase their availability, performance and fault tolerance. Typical Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS’05, June 20–22, Boston, MA, USA. Copyright 2005, ACM 1-59593-167-8/06/2005…$5.00.

284

updates on the replication scheme. The extension brings in light an underlying scheduling problem that offers new optimization opportunities. Our contributions include the following: i) we formulate CRPP and identify the underlying intricacies of the problem, ii) we illustrate the scheduling sub-problem and propose heuristics to solve it, iii) we demonstrate how to modify existing algorithms for 1RPP in order to make them work in CRPP, iv) we introduce a new heuristic alternative called ReplicaEstimation that achieves good trade-off between execution time and solution quality, v) motivated by the inherent difficulties of the algorithms in estimating the actual cost of replica creation we propose a twophase optimization method that is shown to achieve the best performance overall. The rest of the paper is organized as follows. Section 2 formulates CRPP. Section 3 illustrates the heuristics, while Section 4 presents the performance evaluation. An overview of the related work is included in Section 5. Finally, Section 6 discusses some concluding remarks and demonstrates future work directions.

replica of O k . Consequently, the overall functionality of the system may be summarized as follows. A client issues a request towards one of the servers (step (1) in Fig.1). We will call this server a first hop server. In case the requested object is not replicated at the first hop server, the client request is redirected to the corresponding nearest server N (i, k ) (steps (2) and (3) in Fig.1).

2. PROBLEM FORMULATION

2.2 Cost Calculation

whole replication scheme of O k . This can be done by maintaining a list of the servers that the kth object is replicated at, called from now on the replicators of O k . Moreover, every server S i stores a two-field record for each object. The first field is the primary server Pk of it, and the second the nearest server N (i, k ) of S i , which holds a replica of O k . In other words, N (i, k ) is the server for which the arriving to S i requests for Ok , if served there, would incur the minimum possible communication cost. It is possible that N (i, k ) = S i . Another possibility is that N (i, k ) = Pk , if the primary server is the closest one holding a

We distinguish two kinds of generated traffic, one due to reads and another due to updates. Client requests that are satisfied by a first hop server account for no cost, while for the rest the cost is proportional to the size of the data transferred (not always equal to the object size) and the communication cost between the first hop server S i and N (i, k ). Updates for Ok are handled in the following manner: i) the primary replica is updated at no cost, ii) Pk updates the remaining replicas of Ok at a cost proportional to the link cost and the size of the data transferred (step (4) in Fig. 1). Let X be an M × N binary matrix (replication scheme), whereby an element X ik equals 1 if S i is a replicator of Ok and

In this section we formalize CRPP as an integer programming problem. We begin by presenting some assumptions on the system model and proceed with the formulation.

2.1 System Model

0 otherwise. Let rik denote the total number of bytes that must be transferred from Si in order to satisfy local requests for Ok (over a time period T). We can calculate the total read cost for Ok (denoted by R k ) using the following: Figure 1. A generic distributed system using replication.

M

Rk = ∑ (1 − X ik )l (i, N (i, k ))rik i =1

Consider a generic distributed system (Fig. 1) consisting of M servers. Let S i , s ( S i ) be the name and the total storage capacity

Let u k be the total size of data transferred from Pk to each

of server i, where 1 ≤ i ≤ M . The M servers of the system are interconnected through a communication network, and the communication cost between two servers S i and S j , denoted by

replicator of Ok due to updates. The total update cost U k , due to Pk updating all the replicators of Ok is given by: M

U k = ∑ X ik l (i, Pk )u k

l (i, j ) , is the cumulative cost of the shortest path between the two nodes (e.g., the total number of hops). We assume that the values of l (i, j ) are known a priori, that l (i, j ) = l ( j , i ) ∀i, j and l (i, i) = 0 ∀i . The servers of the system store N objects, denoted by Ok . Let

i =1

Let C denote the total cost due to both reads and updates. Clearly: C=

N

∑ (Rk + U k )

(1)

k =1

s (O k ) be the size of object O k . We assume that every object has at least one stored replica. We call this replica, the primary one and the server hosting it, primary server for the object. Furthermore, we assume that primary replicas do not change their position over time. Let Pk denote the primary server for

The aim is to minimize the overall traffic. Therefore, we can define 1RPP as: find the values in matrix X that minimize function C subject to constraints, e.g., the storage capacity constraint:

O k . Each primary server Pk contains information about the

problem is useful for deciding the initial replication scheme. For

N

∑ X ik s(Ok ) ≤ s( S i )

k =1

285

∀Si . Notice, that the resulting optimization

this reason (1) does not include the cost of replica creation since it will be amortized over a large time period. In the related literature it is assumed that whenever replica redistribution is necessary, we can rerun the same algorithms that solve 1RPP and define the new allocation. However, as we demonstrate in the rest of the paper, this is by no means trivial. In the sequel, we define CRPP.

M

I ( old , new ) = ∑

N

∑ s (Ok )l (i , j ) | X ik( new ) − X ik( old ) = 1 ∧

i =1 k = 1

∧ X (jkold)

= X (jknew)

=1 ∧

( old ) ( new) ∧ ¬∃x X xk = X xk = 1 ∧ l (i, x) < l (i, j ) (3)

2.3 Continuous Replica Placement

The first constraint in (3) means that S i is not a replicator of

Consider a distributed system implementing a replication scheme X (old ) . Changes in read/update patterns may result in the need to define a new replication scheme X (new) . Let C (old ) and

O k in X (old ) but it is in X ( new) , the second constraint is self

explanatory, while the third one denotes that S j should be the closest replicator of O k to S i . Thus, we can formulate CRPP as:

C (new) be the costs (calculated from (1)) for X (old ) and X (new) respectively, under the new patterns. We can define the following benefit function in order to decide whether the new replication scheme should be implemented: B(old , new) = C (old ) − C ( new) − I (old , new) (2) where I (old , new) is the cost for implementing the new replication scheme given the old one.

find the entries of matrix X ( new) that maximizes (2), given X (old ) and (3), without violating any constraints (e.g. capacity constraints). Following, we illustrate the (0,1) integer programming formulation of CRPP.

2.4 (0,1) Integer Programming Formulation (old ) We define Yijk , Yijk(new) to be 1 iff requests for Ok arriving

at S i are satisfied by S j in X (old ) and X (new) respectively, and 0 otherwise. X ik(old ) and X ik(new) are 1 iff S i is a replicator of Ok at X (old ) and X (new) respectively, and 0 otherwise. Finally,

let Z ijk be 1 if the replica of Ok at S i is created from S j and 0 otherwise. The objective is to minimize the following target function: min f = ∑ rik l (i, j )Yijk( new ) + ∑ u k l (i, Pk ) X ik( new ) + i, j,k

+

i,k

∑ s(Ok )l (i, j ) Z ijk i, j,k

Figure 2. Example of implementing a new replication scheme.

  −  ∑ rik l (i, j )Yijk(old ) + ∑ uk l (i, Pk ) X ik( old ) d  i, j,k  i,k  

(old ) Notice that Yijk and X ik(old ) are not variables but constants

Computing the implementation cost gives rise to a separate scheduling problem. We illustrate it through the example of Fig. 2. Assume that objects a, b, c, d all have the same size, link costs are equal to the number of hops between two servers and that all servers can store two objects, except from D which stores all of them. Clearly, the best way to move from the Old scheme to New is by following the 4 steps illustrated at the figure. If instead we decide to create the replicas of server B first (step (4)), then in order to create c at server A we need to fetch it from either C or D, which both are more distant compared to B. Although the situation described in Fig. 2 might seem simple, the underlying scheduling problem is NP-complete in the generic case. In order to keep the problem formulation as simple as possible, we use the following strategy for calculating the implementation cost (Sec. 3 discusses other alternatives). Initially, all replica deletions are performed at no cost and afterwards, all replica creations are done by fetching the objects from the closest servers. Creating a replica of Ok at S i incurs cost equal

computed



directly j )Yijk(old )

r (i, j , k )l (i,

from +∑

i, j ,k

matrix

u k l (i, Pk ) X ik(old )

Therefore,

is a constant and

i,k

accounts for only one dummy variable (represented by d) which is always set to 1. The constraints together with their explanation follow.

∑ s(Ok ) X ik(new) ≤ s( S i )

∀i

(4)

k

X (jknew) ≥ Yijk( new) ∀i, j, k

(5)

∑ Yijk(new) = 1

(6)

∀i, k

j

) =1 X P( new k k

∑ Z ijk

∀k

(7)

= X ik(new)

∀i, k

(8)

j

is the closest to S i server for

X (jknew) ≥ Z ijk

∀i, j, k

(9)

which X (jkold ) = X (jknew) = 1 . Notice, that the worst case for the

X (jkold ) ≥ Z ijk

∀i, j, k

(10)

above cost is: s(Ok )l (i, Pk ) , i.e., the cost for creating the new replica from its primary copy. Summing up the above remarks we can calculate the implementation cost as:

d =1

to: s (Ok )l (i, j ) , where S j

X (old ) .

Yijk( new) ,

286

(11) X ik( new) , Z ijk

= 0 ∨ 1 ∀i, j, k

(12)

(4) represents the storage capacity constraint. (5) and (6) represent the read from the nearest policy. (5) means that the server S j which satisfies the requests from S i for Ok , should have a replica of X

( new) jk

= Yijk( new)

may no more be as good, therefore changing them with a new one might have a positive effect on (2). The criterion to construct the list is the impact of deallocation in the cost function (1). The objects that free enough storage space with the minimum possible negative impact are selected. The process ends when either the storage capacity of all servers is reached, or all element changes result in negative benefit. Fig. 3 shows a description of the algorithm in pseudocode.

Ok . Notice, that we cannot write:

since it does not capture the case where S j is a

replicator of Ok but S i does not send requests to it. (6) declares that only one replicator is selected to fetch the object from. The requirement that S j should be the nearest to S i replicator of Ok

GG( X (old ) , X (new) ) //takes X (old ) as input and outputs X (new) Bbest ← 0; while ( Bbest ≥ 0 && server capacity not reached)

is not included as a constraint, but is implied by the minimization of the target function f. More clearly, in order for f to be minimum ∑ rik l (i, j )Yijk(new) must be minimized, which means that the

for (all (S i , Ok ) S i ≠ Pk ∧ X ik( old ) = X ik( new) ) // i.e., not already X ik( new) ← 1 − X ik( old ) ;

i, j,k

Updatebest( X ik(new) , Bbest ); else // X ik( new) = 1 and capacity exceeded repeat deallocate object with the lowest cost coefficient until capacity is no longer violated ; Updatebest( X ik(new) , Bbest );

which S i will fetch Ok must be a replicator of Ok in both X (new) and X (old ) matrices, while (8) says that Ok is fetched from only one server and that servers that are not replicators of Ok in X (new) pay no cost. Notice that instead of

∑ Z ijk

=

X ik(new)

the constraint could have been ∑ Z ijk ≥

j

but not

X ik(new)

( new) X bestibestk ← 1; //and X ik( new) ← 0, k = 1..x if deallocations are required endwhile

Updatebest( X ik(new) , Bbest ) if ( B(old , new) ≥ Bbest ) Bbest ← B(old , new); besti ← i; bestk ← k ; // after benefit calculation, restore X (new) X ik( new) ← X ik( old ) ; //and X ik( new) ← 1, k = 1..x if deallocs were required

,

j

∑ Z ijk j

≤ X ik(new) or,

∑ Z ijk

// changed in a previous iteration

if ( X ik( new) = 0 || ( X ik( new) = 1 && Capacity ( S i ) OK))

l (i, j ) s must be minimum, i.e., the nearest neighbor should be selected. (7) declares that the primary server of Ok should hold a replica of it. Constraints (8), (9), (10) are related to the implementation cost. (9) and (10) mean that the server S j from

= 1 . Finally, constraints (11)

j

and (12) are self-explanatory. As formulated, CRPP is NP-hard (proof based on reduction to 1RPP which is known to be NP-hard [13], [20]). Using linear programming (LP) relaxation one can achieve a good solution. However, the number of variables in the problem ( 2 M 2 N + MN + 1 ) and the number of constraints

Figure 3. Pseudocode for GreedyGlobal.

G1 is identical to GG, apart from the fact that G1 flips an element in X ( old ) if the benefit in (2) is positive, whereas GG requires that the benefit is also the maximum among all possible changes.

( 3M 2 N + 2 MN + M + N + 1 ) are extremely high, making the LP relaxation applicable only for small networks. In the following section we present heuristics in order to tackle larger problem sizes.

3. HEURISTICS FOR CRPP First, we present two algorithms based on the greedy paradigm, i.e., GreedyGlobal (GG) and Greedy1 (G1). Simpler variants of the algorithms were proposed earlier in the literature [15] in order to tackle 1RPP. Next, we propose a heuristic called ReplicaEstimation (RE) that is based on calculating an estimation of the number of replicas needed for each object. We proceed by discussing some alternatives for calculating the implementation cost and provide hindsight on their effects to the algorithms. Based on the observations made, we propose two-phase optimization variants for G1, GG and RE.

Figure 4. Example of a deadlock-like case.

A subtle issue concerns the implementation cost for creating a new replica. We examine two alternatives. The first is to create from the primary, while the second to create from the nearest in X ( old ) . Depending on the estimation used, we refer to GG and G1 variants with GG-prim, GG-near and G1-prim, G1-near, respectively. Getting the object from the primary server, accounts for the worst case in cost terms. Intuitively, create from the nearest seems a better choice. Unfortunately, the nearest server might not remain a replicator in X ( new) . Thus, the estimation will only be accurate if replica transfer is scheduled before replica deletion, something that cannot always be guaranteed. Fig. 4 illustrates an example. Notice, that although the case resembles a

3.1 Greedy Heuristics GG works in iterations. In each iteration it considers the impact of every possible single element flipping in matrix X ( old ) and selects the one with the maximum benefit as computed from (2). Only elements that do not represent primary replicas and were not changed during a previous iteration are considered. If the storage capacity is exceeded, a list of objects for deallocation is constructed. The rationale is that previously beneficial replicas

287

deadlock, in fact it is not, since the transfers can always be performed from the primary servers. It demonstrates however, that obtaining a replica from the nearest server is not always be feasible. On top of deadlock-like cases, any estimation for the implementation cost will likely be inaccurate since a closer server that was not a replicator in X ( old ) might become one in X ( new) . This case cannot be tackled, since it would require the algorithm to know in advance its output. A final note is about the algorithms’ complexity. In the worst case where all the elements of X ( old ) change, GG will perform O ( M 2 N 2 ) evaluations of (2), while G1 only O( MN ) . We should note that the cost calculation in (1) can be done efficiently by maintaining an M × N matrix storing the cost coefficients of each change. Whenever an element changes, only the columns affected by the change need to be updated.

servers. They fail though when the opposite is true. In order to remedy the situation, RE creates the object replicas one by one and updates the estimator after each creation.

3.3 Scheduling Heuristics The implementation cost for moving from X ( old ) to X ( new) , depends on how replica deletions and creations are scheduled. The simplest approach is the following: (i) perform all necessary replica deletions first, (ii) create all replicas by fetching the object from the primary server. We call this approach I1. We have already presented another alternative in Sec. 2.3 that we briefly repeat here: (i) perform all necessary deletions first, (ii) create all replicas by fetching the objects from the closest replicators that remain. (3) gives the cost of the method, which we call I2. Notice the difference between the implementation cost estimation discussed in Sec. 3.1 and the calculation. The first is used by CRPP heuristics in order to decide whether a particular replica should be created or not and involves imperfect knowledge of the location from where the object will be fetched at the end. On the other hand, calculation takes as input X ( new) and given a scheduling policy, outputs the actual cost for implementing the new scheme.

3.2 Replica Estimation Heuristic For each Ok , RE estimates the number of required replicas E k . It afterwards uses this estimation to create one replica for

each object (in order of decreasing E k ). After one replica for every object is created, E k s are recalculated. The pseudocode is as follows:

ImplementationCostCalculation3 ( X (old ) , X (new) ) Aik ← 0 ∀i, k ; ImplCost ← 0; // we use X (temp) to keep track of the scheduling process // X ik(temp ) = 1 iff the relevant replica creation/deletion was scheduled

estimate the number of replicas E k for each Ok ; while ( B(old , new) > 0 && servers’ capacity not reached) sort objects according to Ek ;

X ik(temp ) ← 1 ∀i, k X ik( old ) = X ik( new) = 1 , 0 otherwise;

∀k E k > 0 create one replica in the most beneficial server;

while (server capacities not reached) create replicas from closest neighs; //updating X (temp) and ImplCost endwhile while ( ∃X ik( temp ) = 0 ) // X (new) not reached for (all (S i , Ok ) pairs )

recalculate E k s;

Calculating the estimator Let p k denote the per byte total popularity of Ok . Let

if (( X ik( new) = X ik(temp ) = 0 && X ik( old ) = 1 )) // S i loses the replica but this is not scheduled yet calculate Aik ; // Aik = 0 for (S i , Ok ) that do not satisfy if_cond

rk = ∑ rik , i.e., the total number of bytes for reads to Ok . First i

assume no updates exist. Then we can write: p k =

rk . s(Ok )∑ rx

sort servers in ascending order according to

k

x

Let E be the total number of expected replicas. E is given by:

k

update X ( temp) ; create replica from closest neigh; //updating X ( temp) and ImplCost endwhile endwhile return ImplCost;

E = ∑ s ( S i ) / s (Ok ) , where s (Ok ) is the average object size. i

The rationale behind the calculation of E is that the only thing that restricts replica creation (with no updates present) is the storage capacity. E k can be calculated as: Ek = pk E / ∑ p x (13). By

Figure 5. Pseudocode for I3 scheduling.

x

including updates in the estimator we come up with the following equations:

pk =

Fig. 5 presents the pseudocode of another scheduling alternative called I3. Let Aik denote the number of servers that

rk − u k (14) s (Ok )∑ (rx − u x )

use S i in order satisfy requests for Ok , i.e., S i is the closest replicator of Ok for them. The algorithm starts by creating all

x

∑ s(Si ) E = min{

i

s (Ok )

,

rk uk

∑ Aik ;

while (replica cannot be created) //due to capacity constraints select S x |min ∑ Axk and delete the object of min Axk ;

replicas in X ( new) from the closest replicator, until no more replica creations can be done due to storage capacity limitations. In order to continue, replica deletions must be performed. The algorithm orders the servers in terms of the number they appear as closest replicators in all (server, object) pairs, i.e., ∑ Aik . The

} (15)

E k is calculated again using (13). The rationale behind (15) is that both storage capacity and updates constraint the number of replicas, therefore the constraint that is more dominant should dictate the result. (13), (14) and (15) give good replica estimations, when objects exhibit the same popularities at all

k

one with the smallest value (let S x ) is selected and within it, the object with the smallest Axk is deleted. The rationale behind the

288

approach is that deleting replicas in servers that are not “good” as nearest replicators will minimally affect the implementation cost, since most servers use less distant replicators. Having freed enough storage space, S x creates replicas until either its

implementation costs do not differ. Finally, notice that X ( old ) never changes which means that G1-prim always uses X ( prim ) as input. The alternative of using X (best ) (after suitably tuning the factor calculation) gave inferior results. We presented two-phase optimization using G1-prim as a basis. The case of G1-near is symmetric. The only thing that changes is that the implementation cost estimation may now start as an underestimation of the actual cost (Iestim1 we reset it to 1 if (B(old, new)>B(old, best)) X (best ) X ( new) ; else break; while (ratio