A Distributed Algorithm for the Replica Placement ... - Computer Science

1 downloads 0 Views 273KB Size Report
3.1 Preliminaries. We propose a distributed approximation algorithm, called. DGR (Distributed Greedy Replication), that solves the replica placement problem.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

1

A Distributed Algorithm for the Replica Placement Problem Sharrukh Zaman, Student Member, IEEE, and Daniel Grosu, Senior Member, IEEE Abstract—Caching and replication of popular data objects contribute significantly to the reduction of the network bandwidth usage and the overall access time to data. Our focus is to improve the efficiency of object replication within a given distributed replication group. Such a group consists of servers that dedicate certain amount of memory for replicating objects requested by their clients. The content replication problem we are solving is defined as follows. Given the request rates for the objects and the server capacities, find the replica allocation that minimizes the access time over all servers and objects. We design a distributed approximation algorithm that solves this problem and prove that it provides a 2-approximation solution. We also show that the communication and computational complexity of the algorithm is polynomial with respect to the number of servers, the number of objects, and the sum of the capacities of all servers. Finally, we perform simulation experiments to investigate the performance of our algorithm. The experiments show that our algorithm outperforms the best existing distributed algorithm that solves the replica placement problem. Index Terms—Replication, distributed replication group, distributed algorithm, approximation algorithm.



1

I NTRODUCTION

R

EPLICATION of popular data objects at a server closer to the users can improve the access time for the users and reduce the network bandwidth usage as well. Replication of an object refers to maintaining a fixed copy of it for a specific time interval at a given server [1]. To efficiently use the server storage we need to replicate objects that will yield the best performance. Among different models of object replication, we consider the distributed replication group model and study the problem of replica placement within such a group. A distributed replication group consists of several servers dedicating some storage for the replicas. A server has to serve requests from its clients and also from other servers in the group. When a server receives a request from a client, it immediately responds to the client if the object is in its local storage. Otherwise, the object is fetched from other servers within the group at a higher access cost or from the origin server, at an even higher cost, in the case no server within the group stores a replica of the object. The origin server may be the actual source of that object or, if the servers are part of a hierarchical system, the parent replicator of these servers. The access cost is the highest when an object is accessed from the origin server. The purpose of the replication group is to achieve minimum access cost over all users of the participating servers and over all objects considered for replication. Thus, the replica placement problem we are solving is defined as follows. Given the request rates for the objects and the server capacities, find the replica allocation that minimizes the access time over all servers and objects. The replica placement should consider the constraint that each server employs a limited storage capacity for replication. There are several approaches to solve this problem and hence different solutions exist in the • The authors are with the Department of Computer Science, Wayne State University, 5143 Cass Avenue, Detroit, MI 48202. E-mail: [email protected], [email protected].

literature. In the context of Internet, a distributed solution is more acceptable than a centralized one. The replica placement problem we are considering here is a generalized version of the multiple knapsack problem in that it allows multiple copies of the objects to be placed in the bins, and that object profits vary with the bin and the items already inserted in the bins. Since the multiple knapsack problem is NP-hard [2] it follows that the replica placement problem is NP-hard. We design an approximation algorithm that guarantees that the total access time is within a factor of two from the optimal. The algorithm runs in polynomial time and has a communication cost that is polynomial in the number of servers, objects and total server capacities. 1.1 Related Work The replica placement problem we are considering has some similarities with several other optimization problems such as, the generalized assignment problem [3], the multiple knapsack problem [2], the facility location problem [4], and the transportation problem [5]. The transportation problem was solved in [5] by extending the Auction Algorithm for linear network flow problems proposed in [6]. The closest work to ours is by Leff et al. [7] which presented the design of a family of distributed approximation algorithms for remote caching architectures and determined their performance by simulation. The model used in [7] assumes that servers have equal-sized caches, while in our proposed model and algorithm we eliminate this restriction. Although approximation algorithms are proposed, the authors of [7] do not provide and prove the theoretical bounds on the approximation ratios of their algorithms. We provide a theoretical proof of the approximation ratio of our proposed algorithm. Laoutaris et al. [1] extended the model from [7] considering caches of different sizes and a setting where servers act

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

selfishly. They showed that the selfish behavior of the servers leads to a Nash equilibrium [8] and determined the price of anarchy induced by the selfish behavior. Although the servers can act selfishly, they have to communicate the replication decision in each iteration. Each server has to know the request rate for all objects from all servers in the initial phase. The servers have to go through multiple rounds to converge to the best possible solution. Our algorithm synchronizes the object placement decisions to achieve a solution close to the optimal. It achieves this performance without requiring more communication overhead than the algorithm presented in [1]. Some other papers (e.g., [9], [10]) also studied the gametheoretic aspect of the problem of caching and replication assuming selfish behavior of the servers. The problem of selfish caching was investigated in [11]. The object placement problem was also studied in [12] where approximation algorithms for object placement in networks modeled as general graphs were proposed. Khan and Ahmad [13] performed an extensive performance evaluation of several replication algorithms. Optimal placing of transparent en-route caches (TERCs) was studied in [14]. TERCs are caches placed along the paths from clients to servers. They work without requiring the clients or servers to be aware of them. Qiu et al. [15] studied different algorithms to place a maximum of k replicas of each object in a content distribution network, where k is determined beforehand and is given as an input to the algorithms. They showed by simulation that the greedy algorithm provides the closest solution to the optimum. The idea of utilizing neighbor caches to reduce the requests to parent proxies was explored in [16]. Centralized and distributed approximation algorithms for data caching in ad hoc networks were proposed in [17]. Kumar and Norris [18] proposed an improvement over the LRU algorithm by introducing a quasi-static portion of the cache. Rabinovich et al. [19] proposed protocols for cooperative caching among Internet Service Providers. They considered a scenario where servers cooperatively cache objects and the cost to access objects from the servers can be larger than the cost to fetch them directly from the Internet. Baev and Rajaraman [20] showed that the data placement problem in arbitrary networks is MAXSNP-hard for objects of uniform size. For non-uniform size objects, they proved that no polynomial-time approximation scheme exists unless P = NP. They also designed a 20.5-approximation algorithm for the former problem. Recently, Baev et al. [21] presented a 10-approximation algorithm for the data placement problem. Moscibroda and Wattenhofer [4] developed a distributed approximation algorithm for the facility location problem. Data replication in grids is addressed in [22], [23]. Research has also been conducted in the area of multicast replication [24] and file system replication [25]. Other recent work on replication evaluated different architectures [26], used artificial intelligence techniques [27], and proposed replicating web services at the operating system level [28]. An implementation of a content delivery network (CDN) is presented in [29]. The CDN is implemented with user’s computers and provide replication solutions to ensure content availability, low user latency, and fault tolerance.

2

Complexity results for problems ranging from the knapsack to the generalized assignment problem (GAP) are given in [2]. The replica placement problem we are considering is a general case of the multiple knapsack problem, which is NP-Hard [2]. 1.2 Our Contribution We design a distributed approximation algorithm that solves the replica placement problem. We show that the communication and computational complexity of the algorithm is polynomial in the number of servers and objects and the sum of the server capacities. The closest work to ours [7] proposed distributed approximation algorithms for the replica placement problem and investigated them by simulation, but no theoretical proofs of the approximation ratios have been provided. We prove that our algorithm is a 2-approximation algorithm. We conducted extensive simulation experiments to compare the performance of our algorithm with that of the best distributed algorithm provided in [7]. In these experiments, our proposed algorithm performs better than the best existing distributed algorithm in more than 97.28% of the cases. We also compare the performance of our algorithm with that of a centralized algorithm based on A-Star search [13] that produces near-optimal solution but suffers from excessive running time. Our algorithm exhibited only 1% degradation in performance compared to the centralized algorithm. 1.3 Organization The rest of the paper is organized as follows. In Section 2, we describe the replica placement problem and the system model. In Section 3, we describe the proposed distributed approximation algorithm that solves the replica placement problem. In Section 4, we analyze the complexity and show the approximation guarantees of our algorithm. In Section 5, we analyze the performance of our algorithm by simulation. In Section 6, we conclude the paper and discuss future research directions.

2

R EPLICA P LACEMENT P ROBLEM

In this section, we formally define the replica placement problem we are solving. We use the system model described in [1], with different notation. We consider that the replication group is composed of m servers s1 , . . . , sm with capacities c1 , . . . , cm . There are n unit-sized objects o1 , . . . , on that will be placed in the server caches in order to achieve minimum possible access cost over all objects. The access costs are determined by the location and the request rates of the objects. We assume that a server can access an object with a cost of tl , if it is stored in its own cache. The cost becomes tr , when it has to access another replicator’s cache to fulfill its client’s request. The highest access cost is ts , if that particular object is not stored at any server in the group and it is to be accessed from the origin or source of that object. Obviously, tl ≤ tr ≤ ts . The motivation behind choosing this model is that distributed replication groups are effective when there is a high degree of proximity among the servers [1]. An example is a replication group composed of servers belonging to different departments

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

3

and offices in a university. In such a setting, we can consider the access costs among the servers in the replication group to be equal and the distance to the origin server to be much larger than the distances among the servers in the replication group. A server si knows the request rates ri j , j = 1, . . . , n, of its local users for all objects. We denote by ri = (ri1 , ri2 , . . . , rin ), the vector of request rates of the users at server si , and by r = (r1 , r2 , . . . , rm )T , the m×n matrix of request rates of all the objects at all servers. We denote by X the placement matrix, an m × n matrix whose entries are given by:  1, if object o j is replicated at server si Xi j = 0, otherwise

for i = 1, . . . , m and j = 1, . . . , n. The system’s goal is to minimize the access time at each server over all objects, that is:   m

min ∑  i=1





r i j tl +

j:Xi j =1

j:Xi j =0

and

r i j tr +



j:rc j =0

rc j >0

ri j ts 

(1)

subject to: Xi j ∈ {0, 1},

i = 1, . . . , m; j = 1, . . . , n

∑nj=1 Xi j ≤ ci , i = 1, . . . , m

(2) (3)

where rc j = ∑m i=1 ri j , is the “replica count” of object o j . The first term of the objective function represents the access time corresponding to the objects that are stored locally at server si . The second term represents the access time corresponding to the objects that are not stored locally at server si but are cached in one of the servers belonging to the replication group. The third term represents the access time for the objects that are not cached at any of the servers. The first constraint says that an object j is either allocated or not allocated to server si . The second constraint, which is the capacity constraint, says that the number of objects allocated to server si should not exceed the capacity ci of server si . The above minimization problem can be translated into an equivalent maximization problem in which we maximize the overall gain in the access time obtained by replicating the objects. The overall gain in access time is given by the difference between the total access time over all objects if n no objects are replicated (∑m i=1 ∑ j=1 ri j ts ) and the total access time obtained by replication (given by the objective function in equation (1)). Thus, the equivalent maximization problem is as follows:   m

max ∑  i=1



j:Xi j =1



ri j (ts − tl ) +

j:Xi j =0

and

rc j >0

ri j (ts − tr ) (4)

subject to constraints (2) and (3). The first term of the objective function represents the gain obtained by caching the objects locally at server si , while the second term represents the gain obtained by caching the objects at other servers within the replication group. To understand the design of our algorithm we rewrite the objective function in equation (4) as follows. Since (ts − tl )

can be written as (tr − tl ) + (ts − tr ), we split the first term to obtain the following equivalent expression:





ri j (tr − tl ) +

j:Xi j =1

ri j (ts − tr )

j:Xi j =1



+ j:Xi j =0

ri j (ts − tr )

and

(5)

rc j >0

Since Xi j = 1 implies rc j > 0, the union of sets { j : Xi j = 1} and { j : Xi j = 0 and rc j > 0} is the set { j : rc j > 0}. Equation (5) is thus equivalent to



ri j (tr − tl ) +

j:Xi j =1



ri j (ts − tr )

j:rc j >0

This leads to an equivalent maximization problem defined as: ! m

max ∑

i=1



j:Xi j =1

ri j (tr − tl ) +



ri j (ts − tr )

(6)

j:rc j >0

subject to constraints (2) and (3). The first term represents the additional gain obtained by replicating objects locally at server si , while the second term represents the gain obtained by replicating objects within the replication group. In the next section, we design a distributed approximation algorithm that solves this problem. The algorithm decides the placement of objects based on the value of the total gain defined by the objective function above.

3 D ISTRIBUTED RITHM

R EPLICA P LACEMENT A LGO -

3.1 Preliminaries We propose a distributed approximation algorithm, called DGR (Distributed Greedy Replication), that solves the replica placement problem. The algorithm has as input five parameters, r, c, ts , tr , and tl . The first parameter, r, is the matrix of request rates as defined in the previous section. The second parameter, c = (c1 , . . . , cm ) is the m-vector of server’s capacities. The last three parameters, are the access costs of the objects from source, remote and local replicas, respectively. In order to describe the algorithm we define two additional parameters, insertion gain and eviction cost. The insertion gain for object o j and server si is defined as follows:   p j (ts − tr ) + ri j (tr − tl ), if rc j = 0 ri j (tr − tl ), if Xi j = 0, rc j > 0 (7) igi j =  0, if Xi j = 1

where p j = ∑m i=1 ri j is the “popularity” of object o j . As can be seen from the definition of igi j , it represents the increase in overall gain the system would experience if it replicates object o j in server si ’s cache. The highest insertion gain is for an object which does not have a replica in the group. It reduces to only the local gain of a server when that object is already replicated elsewhere. Otherwise, it is zero. The eviction cost of object o j at server si is defined as:  if Xi j = 0  0, ri j (tr − tl ), if Xi j = 1, rc j > 1 (8) eci j =  p j (ts − tr ) + ri j (tr − tl ), if Xi j = 1, rc j = 1

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

The eviction cost, eci j is the decrease in the system gain that would happen if object o j is evicted from server si ’s space. The eviction cost has the highest value for an object that has only one replica in the group, since evicting this object will cause all servers to access it from the origin. The insertion gain and the eviction cost are used to characterize each “local” decision of replicating or evicting an object from a server. In making these decisions the algorithm considers the effect of replicating the objects on the overall system gain. 3.2 The Proposed Algorithm The proposed distributed approximation algorithm for replica placement is given in Algorithm 1. The algorithm is executed by each server within the replication group. It starts with an initialization phase (lines 2 to 7) in which the servers initialize their local variables and compute the popularity of each object. In order to compute the popularity, p j , of each object o j , all servers participate in a collective communication operation called all-reduce-sum [30]. This collective communication operation is defined by the communication primitive allreduce-sum(ri , p) which works as follows. Before executing the primitive each server has a vector ri = (ri1 , . . . rin ) of size n, and as the result of the primitive execution, each server will contain a vector p = (p1 , . . . , pn ) whose entries are given by p j = ∑m i=1 ri j . Thus, all-reduce-sum computes the popularity of each object, and the result (the popularity vector) is made available at each server. In line 4, the algorithm initializes row i of the allocation matrix X to zero, that means no objects are allocated. It also initializes the available capacity ei to ci , the capacity of server si . The insertion gain for each object is initialized to the maximum value which corresponds to the case in which no replica exists in the replication group. The eviction cost and the replica count for each object are initialized to 0. The second phase of the algorithm is the iterative phase, consisting of the while loop in lines 13 to 52. Before entering the loop, the global maximum insertion gain, igmax , is computed through another collective communication operation called all-reduce-max (send msg, recv msg) (lines 8 to 11). The parameters are the send buffer and the receive buffer, respectively. Both are ordered lists of four variables (igmax , i, j, j′ ), where igmax is the maximum insertion gain, i and j are the indices of the corresponding server and object that gives the maximum insertion gain, and j′ is the object to be evicted, if necessary. To participate in this operation each server si determines its highest insertion gain igmax and the object o j that gives this highest gain (lines 8-9). There is no object for eviction at this point, so j′ = 0. We shall discuss more about j′ later in this subsection. In line 9, each server si prepares the buffer send msg with igmax and the indices i, j, and 0 for j′ . The primitive all-reduce-max will return the send msg with highest igmax to each server through the output buffer recv msg (line 10). After all-reduce-max execution each server si knows the global maximum insertion gain igmax and the server and the object that has this igmax . It also knows the index j′ of the object to be evicted if needed. At this point the servers are ready to enter the main loop (line 13) of the algorithm.

4

Algorithm 1 DGR(r, c, ts , tr , tl ) 1: {Server si :} 2: {Initialization} 3: all-reduce-sum(ri , p) 4: Xi ← 0; ei ← ci 5: for j := 1 to n do 6: igi j ← ri j (tr − tl ) + p j (ts − tr ); eci j ← 0; rc j ← 0 7: end for 8: igmax ← maxk igik ; j ← arg maxk igik 9: send msg ← (igmax , i, j, 0) 10: all-reduce-max(send msg, recv msg) 11: (igmax , i′ , j, j ′ ) ← recv msg 12: {i′ is the server that has igmax for object j; j′ is the object to be evicted from server i′ (0 if none)} 13: while igmax > 0 do 14: if i′ = i then 15: {this server has the maximum insertion gain} 16: Xi j ← 1 17: eci j ← igi j ; igi j ← 0; ei ← ei − 1; rc j ← rc j + 1 18: if j′ 6= 0 then 19: Xi j′ ← 0 20: igi j′ ← eci j′ ; eci j′ ← 0; 21: ei ← ei + 1; rc j′ ← rc j′ − 1 22: end if 23: else 24: {another server has the maximum insertion gain} 25: rc j ← rc j + 1 26: if Xi j = 0 then 27: igi j ← ri j (tr − tl ) 28: else 29: eci j ← ri j (tr − tl ) 30: end if 31: if j′ 6= 0 then 32: rc j′ ← rc j′ − 1 33: if Xi j′ = 1 and rc j′ = 1 then 34: eci j′ ← ri j′ (tr − tl ) + p j′ (ts − tr ) 35: end if 36: end if 37: end if 38: {prepare the next iteration} 39: igmax ← maxk igik ; j ← arg maxk igik 40: ecmin ← mink (ecik : ecik > 0) 41: j′ ← arg mink (ecik : ecik > 0) 42: if ei = 0 or ci − ei ≥ n then 43: if igmax ≤ ecmin then 44: igmax ← 0; j′ ← 0 45: end if 46: else 47: j′ ← 0 48: end if 49: send msg ← (igmax , i, j, j′ ) 50: all-reduce-max(send msg, recv msg) 51: (igmax , i′ , j, j′ ) ← recv msg 52: end while

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

During each iteration, if server si has the maximum global gain for an object, it performs allocation and, if necessary, deallocation of objects (lines 14 to 22). If si does not have the maximum global gain, it only updates some local values to keep track of the changes that resulted from allocation/deallocation of objects at other servers (lines 24 to 37). These updates are performed according to equations (7) and (8). Allocation (deallocation) of object o j at server si is performed by setting the Xi j entry of the allocation matrix to 1 (0). Replica count, rc j , and available capacity, ei , are incremented, respectively decremented, in case of allocation. The reverse is done for deallocation. In the case of allocating an object, the ig value before the allocation becomes the new ec value for that object (this is according to equations (7) and (8)). For example, if before the allocation, object o j does not have any replica in the group (i.e., rc j = 0), the value of igi j is equal to the first entry in equation (7). After the allocation, Xi j = 1 and rc j = 1, so the value of eci j is equal to the third entry in equation (8). This holds true for all other cases and, therefore, we can assign igi j to eci j when we allocate o j to si and do the reverse when we evict o j from si . Obviously, the insertion gain becomes zero after an insertion and the eviction cost becomes zero after an eviction. The eviction happens if j′ 6= 0, when object o j is evicted from the server si ’s cache (lines 18-22). Lines 24 to 37 simply update rc, ig and ec, since another server si′ (i′ 6= i) performed the allocation and si needs to keep track of it. rc is incremented or decremented and equations (7) and (8) are used to update ig and ec. If another server replicates o j , server si updates the values of its insertion gain and eviction cost for o j (lines 26-30). If object o j′ was evicted from another server (i.e., j′ 6= 0), the replica count is decremented and the insertion gain and the eviction cost corresponding to o j′ are updated. If object o j′ is replicated only at si , then server si updates only the eviction cost, ei j′ . If the object is not replicated at any server in the group, then si updates only the insertion gain, igi j′ . Then, each server participates in another all-reducemax operation that determines the next candidate server and object(s). Each server prepares the send msg as follows. The maximum insertion gain igmax and j are determined as before. Server si also determines a candidate object o j′ for eviction. This is the object that has the minimum eviction cost at si . A server is eligible to be considered for an allocation only if one of the following holds: it has available capacity to store more objects, or it is full but some inserted object o j′ has its eviction cost less than the insertion gain of some uninserted object o j . Otherwise, it reports its ineligibility by setting igmax to 0. In lines 40 and 41, ecmin and j′ are determined, and in lines 42-43, ecmin is compared with igmax only when the available capacity, ei = 0. If both eligibility conditions fail, igmax is set to zero. If ei > 0 then server si has space for new objects, and hence, no eviction is necessary ( j′ = 0). The algorithm terminates when each server reports igmax = 0.

4

A NALYSIS OF DGR

In this section, we analyze the complexity of the DGR algorithm and determine its approximation ratio.

5

4.1 Complexity We analyze the computational and communication complexity of DGR. To calculate the running time we determine the number of iterations of the main loop. We differentiate the iterations based on whether an eviction occurs ( j′ > 0) or not. An insertion iteration is one that does not involve an eviction. If eviction takes place in an iteration, we call it a replacement iteration. It is clear from the algorithm description that each iteration falls into one of these two categories. We denote by C = ∑m i=1 ci , the total capacity of the replication group. Finally, to represent the value of a variable after an iteration is executed, we use the notation variableiteration . For example, the value of igi j after iteration t is completed is denoted by igti j . The initial value for this variable is ig0i j . This notation is used to show the state of the variables that change during the main loop iterations. Lemma 1: The main loop of DGR requires at most C insertion iterations. Proof: Each ei is set to ci in the initialization phase (line 4). Therefore, ∑m i=1 ei = C at the beginning. An insertion iteration decreases an ei by one (line 17). No insertion iteration takes place once ∑m i=1 ei = 0, or earlier, if all objects are replicated. Also, a replacement iteration does not have any effect on any ei . Hence, DGR requires at most C insertion iterations. Lemma 2: For some object o j and server si , igti j > igt−1 ij only if t is a replacement iteration that evicts o j from si . Proof: In the main loop of DGR, the insertion gain of an object is assigned a value in three places, line 17, 20 and 27. Only line 17 or line 27 is executed in an insertion iteration. igi j becomes zero in line 17. Line 27 is executed if server si does not contain object o j and therefore igt−1 i j is at least ri j (tr −tl ). Hence, an insertion iteration cannot increase igi j . Line 20 is executed only in replacement iterations. igi j′ is increased from zero in line 20, since server si evicts object o j′ . We next show that in a given iteration an object replica will not be evicted if in the previous iteration it is the only replica stored by the replication group. Lemma 3: An object o j will not be evicted in iteration t if rct−1 = 1. j Proof: We prove this lemma by induction on the order of objects’ first replica insertion at any server within the group. Let us assume an ordering o j1 , . . . , o jn of objects such that the first replica of object o j1 is inserted before the first replica of object o j2 , and so on. As the base case of the induction we show that object o j1 will not be evicted when it has only one replica in the group. In the inductive step we prove that if each of the objects o j1 , . . . , o jk−1 has at least one replica in the group and object o jk has only one replica, that replica will not be evicted. Base case: Since object o j1 is first replicated by the algorithm, the replication must occur in the first iteration. Let si be the server that replicates o j1 in iteration 1. Therefore, the insertion gain of o j1 at si has the highest possible value among all objects and servers. According to line 17 of DGR, eci j1 is larger than any ig values after iteration 1. eci j1 retains this value as long as the replica of object o j1 at server si is the only

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

replica of object o j1 . Therefore, as long as o j1 ’s only replica remains stored at server si , this replica cannot be evicted. Inductive step: Suppose that after some iteration, objects o j1 , . . . , o jk−1 have one or more replicas in the group and object o jk has only one replica. Let us assume that server sl holds the replica of object o jk . It is clear that objects o jk+1 , . . . , o jn cannot have higher insertion gains than object o jk because otherwise they would have been replicated before o jk . Only the objects inserted earlier than o jk may have an increased insertion gain due to eviction from some server. Objects replicated at server sl have insertion gains equal to zero which can increase upon their eviction from sl (Lemma 2). Obviously, these insertion gains will not be larger than object o jk ’s eviction cost, since otherwise o jk would be selected for eviction instead. If an object is evicted from some other server, its insertion gain at server sl will remain fixed if there exists other replicas of that object in the group after the eviction (equation (7)). If an object’s only one replica in the group is evicted from some other server in a replacement iteration, its insertion gain at server sl can increase (equation (7)). Only in this case the increased insertion gain can exceed the eviction cost of o jk from server sl . Therefore, the only replica of object o jk cannot be evicted if objects o j1 , . . . , o jk−1 have at least one replica in the group. Clearly, a replacement will occur only if some object’s eviction cost decreases or some other object’s insertion gain increases, or both. From Lemma 3, we see that both ways of increasing an insertion gain cannot be sufficient for evicting an object. We can conclude with the following corollary. Corollary 1: A replacement cannot occur only because the insertion gain of an object is increased. A decrease in the eviction cost of an object must occur for a replacement to take place. Next, we show that only the “first inserted” replica of an object with multiple replicas is subject to a decrease in eviction cost, and thus, it can possibly be evicted. Lemma 4: If object o j is replicated at server si in iteration ′ t, then ecti j < ecti j is possible for some iteration t ′ > t only if rct−1 = 0, as long as o j has a replica at si . j Proof: Equation (8) shows the three possible values for eci j . It is zero when o j is not replicated at si . If object o j is replicated at si when there is no replica of o j in the group, eci j has the maximum value. This value can decrease if some other server replicates object o j . On the other hand, when object o j is being replicated at server si and there already exists other replicas of o j in the group, eci j is set to the second highest value shown in equation (8). This value will not decrease as long as the replica remains stored at server si . Therefore, as long as o j remains replicated at server si , eci j can decrease later only if it is the first replica of object o j in the group. Lemma 5: The main loop of DGR requires at most C replacement iterations. Proof: By Corollary 1, we only need to investigate the cases when eviction cost can be decreased. Lemma 3 says that only objects with multiple replicas may be evicted. Lemma 4 states that only the “first inserted” replica of an object with multiple replicas is subject to a decrease in eviction cost, and thus, it can possibly be evicted. In the worst case DGR

6

will allocate two copies of C/2 objects. Of them, C/2 will be replaced by other C/4 objects, each having two replicas. This will continue until one replica of C objects exists in the system. Thus, the maximum number of replacement iterations is C/2 +C/4 + . . . = C. Theorem 1: The running time of DGR is O(n +C log n). Proof: From Lemmas 1 and 5, the maximum number of main loop iterations is 2C. Each iteration consists of constant time operations with the exception of two max and min operations in lines 39 and 40, which can be implemented using one max-heap and one min-heap. Thus, the running time of each iteration is O(log n). The initialization phase consists of a n-iteration loop and a build-heap operation of cost n. Therefore, the worst case running time of DGR is n + 2C log n = O(n +C log n). Theorem 2: The communication complexity of DGR is O((n +C) log m). Proof: The standard primitives all-reduce-sum and allreduce-max contribute to the communication cost of DGR. They are basically the same operations except that they use a different associative operator. Grama et al. [30] shows that the communication cost of all-reduce operations is O(w log m), where w is the message size in words and m is the number of participating servers. Therefore, the communication complexity of all-reduce-sum during initialization is O(n log m), since each message ri is of length n. The primitive all-reduce-max is called with constant size messages and it is executed 2C times in the main loop. Thus, its communication complexity is O(2C log m). We conclude that the total communication cost of DGR is O((n +C) log m). Since we are comparing the performance of DGR with the performance obtained by the best distributed algorithm presented in [7], we also need to compare them in terms of their complexity. The algorithm given in [7], which we refer to as LWY from the name of the authors, has the same communication overhead as DGR, that is O((n + C) log m), and a running time in O(mn log n). Clearly, DGR has a better running time since n ≫ C in practice. The communication and computational complexity of the two algorithms are given in Table 1. We determine the actual differences in communication and running time of these two algorithms in Section 5. TABLE 1 Complexity of DGR and LWY Complexity Communication Computational

DGR O((n +C) log m) O(n +C log n)

LWY O((n +C) log m) O(mn log n)

4.2 Approximation Ratio In the following analysis, we assume that ts − tr ≥ tr − tl . This assumption considers the benefit a server can expect from participating in a distributed replication group. A server will benefit if the remote access time for an object is considerably lower than the access cost from origin. Also, a server will access an object from another replicator if that access cost is not unreasonably higher than the local access cost.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

7

We show that OPT /DGR ≤ 2, where OPT and DGR denote the total gain by the optimal and DGR allocations, respectively. An equivalent expression is (OPT − DGR)/DGR ≤ 1 or, OPT − DGR ≤ DGR. To determine OPT − DGR we characterize the difference in allocation by means of “primitive operations” such as allocating or deallocating objects, etc. We show that any DGR allocation can be converted into an optimal one by a finite set of such primitive operations. An operation increases, decreases, or does not affect the total system gain. We term their effect on the system gain as the “gain of the operations”. Let OP be the set of operations that converts a particular DGR allocation into the optimal one. Therefore,



Gop = OPT − DGR

op∈OP

where Gop is the gain of the operations that change DGR into OPT . It is sufficient to show that



Gop ≤ DGR

(9)

op∈OP

to prove that DGR is a 2-approximation algorithm. Definition 1: A discrepancy is the existence of a (server, object) pair, (si , o j ), such that o j is replicated at si in DGR allocation but not in OPT or vice versa. Resolving a discrepancy means performing the necessary insertions or evictions on the DGR allocation so that the discrepancy is eliminated. Definition 2: We define a primitive operation to be one of the following: (i) INSERT( j, i): inserts object o j in the cache of server si . (ii) EVICT( j, i): evicts object o j from the cache of server si . (iii) MOVE( j, i1 , i2 ): equivalent to EVICT( j, i1 ) followed by INSERT( j, i2 ). Note that INSERT and EVICT increases, and respectively, decreases the number of objects replicated at a server. MOVE decreases the number of objects at one server and increases it at another. Definition 3: A feasible set of operations is a finite set of primitive operations that: (i) can resolve a nonempty set of discrepancies; (ii) does not change the number of objects replicated at any server; and (iii) can increase the overall gain from DGR allocation. Corollary 2: A set consisting of a single primitive operation is not feasible. This corollary follows directly from Definition 3. Lemma 6: A feasible set of operations can only be one of the following: • IME: A set composed of one INSERT( j1 , i1 ) operation, k MOVE( jl , il−1 , il ) operations, l = 1, . . . , k + 2, and one EVICT( jk+3 , ik+2 ) operation. • MM: A set composed of k − 1 MOVE( jl , il−1 , il ) operations, l = 1, . . . , k − 1 and one MOVE( jk−1 , ik−1 , i1 ) operation. Proof: We prove this lemma by first showing that IME and MM sets of operations satisfy the feasibility properties and then showing that other sets of operations either are not feasible or can be transformed into an IME or a MM set. Now we show that IME and MM satisfy the feasibility conditions in Definition 3. First, an INSERT can resolve

a discrepancy that an object is replicated in the optimal allocation but not in DGR. Similarly, an EVICT can resolve the opposite type of discrepancy and a MOVE can resolve a set of two discrepancies. Being sets of these primitive operations, IME and MM both satisfy the first condition. IME satisfies the second feasibility condition as follows. The number of objects at si1 is increased by the INSERT operation, but the first MOVE decreases that number and replicates an object at another server, which is nullified by the next MOVE and so on. The number of objects incremented at sik+2 by the last MOVE is balanced by the EVICT operation. The same logic applies to prove that MM satisfies the second property too. The third condition of feasibility is that the set of operations should be able to increase the overall gain from the DGR allocation. We show that both IME and MM sets can increase the overall gain. An INSERT operation increases the overall gain in access time since it allocates an object to a server and an EVICT operation decreases the system gain. A MOVE operation has the effects of both an INSERT and an EVICT. The sets for which the total increase in gain surpasses the total decrease in gain will increase the overall system gain. The same logic applies to MM, since MM contains only MOVE operations. This concludes that an IME or a MM set of operations can increase the overall system gain. It is worth mentioning here that not all such sets increase the overall gain, we consider the ones that increase the gain from the DGR allocation to characterize the difference between the DGR and the optimal allocation. Now we show that other types of sets of primitive operations are either not feasible or can be represented as an IME or a MM. A set with unequal number of INSERT and EVICT operations violates the second condition. A set with one INSERT and one EVICT satisfies the second condition if they operate on the same server. But they violate the third condition since for the greedy choice by DGR, the INSERT operation cannot gain more than the EVICT looses in this case. Similarly, we can show that a set of multiple pairs of INSERT and EVICT operations must follow the patterns of IME or MM sets to maintain the last property, and thus, it can be represented as an IME or a MM. Lemma 7: Let an IME resolve a set of discrepancies and A be the subset of the DGR allocations that are converted into optimal allocations in the process. Then, ∑op∈IME Gop ≤ ∑a∈A Ga , where Gop is the gain of operation op and Ga is the gain of the DGR allocation a. Proof: Let an IME be composed of the following operations: INSERT( j1 , i1 ), MOVE( j2 , i1 , i2 ) and EVICT( j3 , i2 ). Therefore, A = {(si1 , o j2 ), (si2 , o j3 )} and B = {(si1 , o j1 ), (si2 , o j2 )} are the subsets of DGR and optimal allocations that give the discrepancies resolved by this IME. Without loss of generality, we assume that all allocations are single replicas of the respective objects in the group.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

8

Therefore,

∑ Ga = ri1 j2 (tr − tl ) + p j2 (ts − tr )

a∈A

+ ri2 j3 (tr − tl ) + p j3 (ts − tr )

(10)

+ ri2 j2 (tr − tl ) + p j2 (ts − tr )

(11)

∑ Gb = ri1 j1 (tr − tl ) + p j1 (ts − tr )

b∈B

By subtracting equation (10) from equation (11), we find the gain of the operations in IME as



Gop = ri1 j1 (tr − tl ) + p j1 (ts − tr ) − ri2 j3 (tr − tl )

op∈IME

− p j3 (ts − tr ) + ri2 j2 (tr − tl ) − ri1 j2 (tr − tl )

(12)

Therefore, we need to prove that equation (12) is less than equation (10). First, we claim that ri2 j1 (tr − tl ) + p j1 (ts − tr ) ≤ ri2 j3 (tr − tl ) + p j3 (ts − tr )

(13)

Here the first term is the overall gain for replicating o j1 at si2 and the second one is the overall gain for replicating o j3 at si2 . The inequality holds because otherwise o j1 would be replicated at si2 by DGR instead of o j3 . We prove the main result in two parts. First, we consider the first four terms from equation (12): ri1 j1 (tr − tl ) + p j1 (ts − tr ) − ri2 j3 (tr − tl ) − p j3 (ts − tr ) = (ri1 j1 − ri2 j1 )(tr − tl ) + ri2 j1 (tr − tl ) + p j1 (ts − tr ) (14) − ri2 j3 (tr − tl ) − p j3 (ts − tr ) ≤ ri2 j3 (tr − tl ) + p j3 (ts − tr ) since (ri1 j1 − ri2 j1 )(tr − tl ) ≤ ri2 j1 (tr − tl ) + p j1 (ts − tr ) because (ri1 j1 − ri2 j1 ) ≤ p j1 and (tr −tl ) ≤ (ts −tr ) and the rest follows from equation (13). Now we claim that the remaining two terms from equation (12) satisfy (ri2 j2 − ri1 j2 )(tr − tl ) ≤ ri1 j2 (tr − tl ) + p j2 (ts − tr )

(15)

since (ri2 j2 − ri1 j2 ) ≤ p j2 and (tr − tl ) ≤ (ts − tr ). Adding equations (14) and (15) gives us the result



op∈IME

Gop ≤

∑ Ga

(16)

a∈A

We showed that the inequality holds for an IME that includes only one MOVE operation. We claim that it also holds for IME sets with more than one MOVE operation. Equation (14) shows that the combined gain of the INSERT and EVICT operations is less than the gain of the DGR allocations they change. This will hold for each IME set, since by definition an IME includes only one INSERT and one EVICT. Equation (15) shows that the gain by a MOVE operation is less than the corresponding allocation in DGR. Therefore, as we add more MOVE operations, we have one such inequality for each MOVE operation. Therefore, the inequality in (16) holds for any IME set. Again, in these cases we assume that the respective objects have one replica each. But we use the overall system gain in our analysis. Therefore, we claim that these results are valid for an IME with any number of MOVE operations.

Corollary 3: The results of Lemma 7 also hold for an MM set of operations. MM is a subset of an IME set with multiple MOVE operations. Therefore, the inequality in equation (15) will be applied to each MOVE operation in the set and we can conclude that ∑op∈MM Gop ≤ ∑a∈A Ga , where A is the set of allocations in DGR that were changed by the operations in MM. Theorem 3: DGR is a 2-approximation algorithm for the replica placement problem. Proof: From Lemma 6, Lemma 7 and Corollary 3, we find that if OP is the set of operations that resolves the set of discrepancies (if any) between a DGR and an optimal allocation, then OP will be the union of zero or more IME and MM set of operations. Also, a discrepancy can be resolved only once, therefore these IME and MM sets will be disjoint. So will be the set of DGR allocations they will change. Let A be the set of allocations in DGR, which are changed by the operations in OP. Therefore,



op∈OP



Gop =



Gop +

IMEx ∈OP op∈IMEx

≤∑





Gop

MMy ∈OP op∈MMy

∑ Ga + ∑ ∑ Ga ≤ DGR

x a∈Ax

y a∈Ay

Here, Ax ∈ A is the set of allocations affected by IMEx and Ay ∈ A is the set of allocations affected by MMy . Thus, we showed that the inequality in equation (9) is satisfied, and therefore, DGR is a 2-approximation algorithm for the replica placement problem.

5

E XPERIMENTAL RESULTS

In this section, we perform simulation experiments to determine the performance of DGR in practice. We perform three sets of experiments. In the first set we compare the performance of DGR with the performance of the best distributed algorithm presented in [7] (which we call LWY from the name of the authors). To the best of our knowledge [7] is the closest work to ours that proposed the distributed algorithm with the best performance to date. We focus on investigating how the variability on the request rates affects the performance of the two algorithms. The second set of experiments compares DGR and LWY focusing on their scalability in terms of number of servers and objects. In the third set of experiments we compare DGR with a centralized algorithm, the Aε -Star search algorithm presented in [13]. We selected this algorithm because it provides the best performance in terms of reducing object access costs among those compared in [13]. 5.1 Experimental Setup In the first set of experiments we compare the performance of DGR and LWY [7] for different types of data distributions. The problem we are considering is similar to the one investigated in [7] except that their model considers that each server deploys the same amount of memory for replication, while we remove this restriction in our model. To be able to compare our algorithm with LWY [7], we chose an experimental setting in which the servers have the same storage capacity for

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

replication. There is also another subtle difference between the two models. In our model, we assume that the request rates are integers, i.e., they represent the number of requests, while in [7], the request rates are between 0 and 1. The LWY algorithm works as follows. In the beginning, each server exchanges the request rates for all objects with other servers. Then, each server works independently but they exchange the information about their decision of replicating objects. At first the allocation matrix X is initialized to zero. Then, a server si is randomly chosen. Server si calculates the insertion gains of all objects using equation (7). Then, it replicates the first ci objects in decreasing insertion gain values, updates row i of matrix X accordingly, and shares this information with other servers. (Here ci is the capacity of server si .) Next, another server si′ is chosen randomly. This server has the information about the allocation matrix X with the replication decisions made by server si . Server si′ now calculates the insertion gains based on the updated information, makes replication as before and updates the matrix X accordingly. Thus, one server is randomly chosen in each step to perform the above actions. When all servers complete the replication process, a “round” is completed. The algorithm continues until the allocation converges, i.e., when no round can improve the overall gain in access cost. The authors of [7] used three parameters to characterize the distribution of the request rates. The first parameter, called the hot-set parameter, θ , determines the distribution of the request rates of all objects for one server as ri j = e−θ j /T , where T = ∑nj=1 e−θ j and 0 ≤ θ ≤ 1. The distribution is flat if θ = 0, and its skewness increases with θ . In [7], θ was varied from 0.001 to 0.082 with increments of 0.009. We found that for 100 objects, the request rates vary between 1.05% and 0.95% for θ = 0.001. This is almost a flat distribution. On the other hand, when θ = 0.082, we see the request rates of the first 60% objects add up to almost 100% of the total request rates. To get a steeper curve, in our experiments, we chose to extend the upper limit of θ to 0.208. This way we obtain a setting in which around 20% of the objects constitute about 100% of the total request rates. The request rate curves become even steeper as we increase the number of objects. Thus, we can get more steeper curves by only increasing the number of objects. Therefore, in our experiments, we varied θ from 0.001 to 0.208. We kept the same increments of 0.009 as in [7]. The second parameter, called the correlation of site hot-set, ρ , determines the randomness of request rates of an object for different servers. Its values are between 0 and 1. It can be characterized as follows. At first we determine the request rates for server s1 for a given θ . Therefore, server s1 has the j’th highest request rate for object o j (note that the request rates are exponential in negative θ j ). Now, at a server other than s1 , object o j will have the k’th highest request rate among the objects, where k is a random number between 1 and min(ρ ∗ n + j − 1, n). We varied ρ from 0.01 to 0.96 with 0.05 increments. As ρ increases from zero, the randomness of the access rates of the objects at different servers increases. The third parameter is the relative site activity, η . After we determine the object access rates at different servers with the

9

above two parameters, we multiply the request rates at server 1−η and 0 ≤ η ≤ 1. si by ai = 1/(A ∗ i1−η ), where A = ∑m i=1 1/i That means, all servers are equally active when η = 1, and the activity levels vary more as η decreases. In our experiments we varied η from 0 to 1 with 0.1 increments. The other parameters, ts , tr and tl were kept fixed at 1 ms, 6 ms, and 63 ms, respectively (as in [7]). We varied the cache size of the servers so that the total capacity varies from 10% to 100% of the total number of objects. Here we kept the same cache size for all servers as in [7]. In the first set of experiments we consider the following scenarios: 10 servers and 100, 1000, 2000 and 5000 objects, and 2000 objects and 10, 30 and 60 servers. Our dataset is a very large superset of that used in [7] with 52,800 data points for each (number of servers, number of objects) combination. In each experiment, we calculate the ratio of the gain achieved by DGR and LWY for analysis. In the above set of experiments, we mainly focused on how variability of requests affects the performance of DGR, hence we selected wide ranges of parameter values with small intervals for each (m, n) combination. In the second set of experiments we test how well DGR and LWY scale when we increase the problem size. We considered every combination of m and n, with m = (8, 16, 32, and 64) and n = (8192, 16384, 32768, and 65536). For each (m, n) pair, we choose the other parameter values as follows: θ = (0.005, 0.01, 0.02, 0.04, 0.08), ρ = (0.05, 0.1, 0.2, 0.4, 0.8), and η = (0, 0.5, 1). We consider the following server capacities (125, 250, 500, 1000, 2000, and 4000). Along with the replication performance, we compare the running time and message communication overhead of the two algorithms for these large size problems. The third set of experiments compares DGR with a variant of the Aε -Star search algorithm proposed in [13]. A ε -Star uses a technique to reduce the running time of the A-Star [31] search at the expense of achieving a (1+ε )-approximation solution. We devised an admissible and monotone heuristic function for our problem and use it in the A ε -Star algorithm. It turned out that although Aε -Star yields a better gain over DGR, it has a prohibitively high running time. Therefore, we had to limit our experiments to small values of m and n and small variations for each (m, n). We performed the experiments with 4 servers and 50, 100, 200, 400, 800, and 1600 objects, 8 servers and 50, 100, 200, and 400 objects, and 16 and 32 servers with 100 and 200 objects. For the problem instances with 4 servers and the one with 8 servers and 50 objects, we considered the following values for θ = (0.005, 0.01, 0.02), ρ = (0.05, 0.1), and η = (0, 0.5, 1). For the rest of the cases, we fixed θ , ρ , and η at 0.005, 0.1, and 0.5, respectively. The server capacities were (5, 10, 20, and 40) in all of the cases. In the next subsection we discuss the results of all these experiments. 5.2 Performance Analysis In the first set of experiments we compare DGR with LWY for varying distributions of the request rates. We calculate the system gain achieved by both algorithms using equation (4)

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

10

Distribution of Relative Gain (m = 10, n = 100)

Distribution of Relative Gain (m = 10, n = 1000)

8

8

7

7

6

6

5

5

Distribution of Relative Gain (m = 10, n = 5000) 60

50

4

Frequency (%)

Frequency (%)

Frequency (%)

40

4

3

3

2

2

1

1

30

20

10

0 0.98

0.99

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

0 0.98

1.08

0.99

1

1.01

1.02

DGR / LWY

1.03

1.04

1.05

1.06

1.07

0 0.98

1.08

0.99

1

1.01

1.02

DGR / LWY

(a) n = 100

1.03

1.04

1.05

1.06

1.07

1.08

1.06

1.07

1.08

DGR / LWY

(b) n = 1000

(c) n = 5000

Fig. 1. Distribution of relative gain of DGR over LWY for m = 10

Distribution of Relative Gain (m = 10, n = 2000)

Distribution of Relative Gain (m = 30, n = 2000)

25

Distribution of Relative Gain (m = 60, n = 2000)

3.5

4.5 4

3 20

3.5 2.5

10

Frequency (%)

Frequency (%)

Frequency (%)

3 15

2

1.5

2.5 2 1.5

1 1

5 0.5

0.5 0 0.98

0.99

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

1.08

0 0.98

0.99

1

1.01

1.02

DGR / LWY

1.03

1.04

1.05

1.06

1.07

1.08

0 0.98

DGR / LWY

(a) m = 10

0.99

1

1.01

1.02

1.03

1.04

1.05

DGR / LWY

(b) m = 30

(c) m = 60

Fig. 2. Distribution of relative gain of DGR over LWY for n = 2000 TABLE 2 DGR Gain vs. LWY Gain (10 Servers) n 100 1000 2000 5000

DGR > LWY 97.28% 98.98% 99.21% 97.47%

Min 0.988524 0.990726 0.989642 0.995193

DGR/LWY Max Mean 1.059139 1.016007 1.077364 1.023941 1.076756 1.017410 1.075472 1.008909

TABLE 3 DGR Gain vs. LWY Gain (2000 Objects) StDev 0.012780 0.020080 0.020296 0.015777

and then divided the gain of DGR by that of LWY. This “gain ratio” is used to determine the relative gain in performance of DGR over LWY. At first we calculate some statistics of the relative gains based on the number of servers (m) and objects (n). We summarize the results in Tables 2 and Tables 3. In Table 2, we show the statistics of the relative gains for a replication group composed of 10 servers and considering different number of objects. The first column shows the number of objects. The second column indicates the percentage of cases where DGR performed better than LWY. The other columns show the minimum, maximum, standard deviation and mean values of the relative gains. Table 3 shows the same statistics where the number of objects is fixed to 2000 and the number of servers is varied. The first column in this table shows the number of servers and the rest of the columns are the same as in Table 2. The results show that DGR obtains better gain than LWY in almost all cases. For example, for a system composed of 10 servers and 100 objects LWY yielded better or equal gains than DGR in less than 2.72% of the cases. For small number of objects, the minimum ratio of the DGR

m 10 30 60

DGR > LWY 99.21% 98.95% 98.12%

Min 0.989642 0.988029 0.988650

DGR/LWY Max Mean 1.076756 1.017410 1.076831 1.030088 1.071726 1.029507

StDev 0.020296 0.019329 0.018256

gain to the LWY gain is about 0.988, which means that LWY achieves a gain that is at most 2% better than DGR. On the other hand, in this set of experiments DGR achieves a gain of more than 7% over that obtained by LWY in the best case. The above results suggest that our algorithm obtains a higher overall gain in performance than the LWY algorithm in more than 97.28% of the cases. The reason is that in DGR, the servers coordinate the replication decisions in each iteration. In each iteration, each server determines the object that would give the highest system gain upon replication. The replication with the highest gain among all servers is chosen in each iteration. We note here that according to equation (7), preferences of all servers change after one object is replicated somewhere in the group. The DGR algorithm captures these changes and evaluates the replication gains after each object replication. But in LWY, a server makes decisions for all objects at the time of its turn, simply replicating the object that would yield the highest gain from replication until the server’s capacity is exhausted. This prevents the detection of cases where some objects might have better gains if replicated elsewhere. On the other hand, we know that DGR is an

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

11

DGR < LWY cases vs. η (m = 10)

DGR < LWY cases vs. ρ (m = 10)

0.35

1.2 100 objects 1000 objects 2000 objects 5000 objects

0.3

100 objects 1000 objects 2000 objects 5000 objects 1

Percentage of cases

Percentage of cases

0.25

0.2

0.15

0.8

0.6

0.4 0.1

0.2

0.05

0

0 0

0.1

0.2

0.3

0.4

0.5 η

0.6

0.7

0.8

0.9

1

.01 .06 .11 .16 .21 .26 .31 .36 .41 .46 .51 .56 .61 .66 .71 .76 .81 .86 .91 .96 ρ

(a) Distribution over η

(b) Distribution over ρ

Fig. 3. Distribution of cases in which DGR gain is less than LWY gain vs. η and ρ Minimum DGR Gain / LWY Gain Ratio for All m, n

Maximum DGR Gain / LWY Gain Ratio for All m, n

1.0005

1.27 n = 8192 n = 16384 n = 32768 n = 65536

1

1.26

0.9995

1.25

0.999

1.24

0.9985

1.23

0.998

1.22

0.9975

1.21

0.997

1.2

0.9965

1.19

0.996

n = 8192 n = 16384 n = 32768 n = 65536

1.18

0.9955

1.17 m=8

m = 16 m = 32 Number of Servers (m)

m = 64

(a) Minimum DGR/LWY gain

m=8

m = 16 m = 32 Number of Servers (m)

m = 64

(b) Maximum DGR/LWY gain

Fig. 4. DGR/LWY gain ratio vs. m, n

approximation algorithm because the greedy choice cannot always obtain the optimal allocation. Therefore, it is possible to obtain a better performance than DGR in very few cases. However, the low percentage of cases where LWY performs better than DGR shows that the strategy applied in DGR is more robust in finding better solutions. Although in LWY the servers do exchange information about which objects they cache and how much they request each object it performs poorer than DGR since DGR synchronizes the servers and updates the incremental information. We also plot the frequency distributions for the relative gain of DGR with respect to LWY as histograms in Figures 1 and 2. A bar in these histograms represents the percentage of the cases where the ratio of DGR gain and LWY gain falls between a specific limit. For example, Figure 1a shows that in almost 7% of the 52,800 cases, the gain ratio is between 1 (inclusive) and 1.001 (exclusive). Now we discuss the results grouping them up according to Table 2 and Table 3. Figure 1 show the distributions summarized in Table 2. In these figures, m is fixed while n is varied. We see that in each of these histograms, the frequency of the cases with relative gain between 1 and 1.001 is the highest. This frequency increases as the number of objects increases. Also, we can see that the next range of gains, i.e., from 1.001 to 1.002 has frequencies close to 5% in all four figures and from there they decrease gradually at different rates. Although LWY has gains closer to DGR in several cases as the number of objects increases, DGR still offers considerable improvement over LWY in other cases. We observe that when the number of objects per server increases,

LWY performs close to DGR in more cases. This is because the number of allocations that are different in DGR and LWY becomes less compared to the total number of allocations. Figure 2 gives the details of the results that are summarized in Table 3. Here the number of objects is fixed to 2000 and the number of servers is 10, 30, and 60. Here the shape of the histogram changes a bit with the increase in the number of servers. The percentage of cases where LWY performs close to DGR decreases significantly as the number of servers increases. When the number of servers increases, the LWY algorithm performs poorer because of the amount of information available to a server when making decisions. For example, for 10 servers, when the second server makes its replication decision, it has knowledge about 1/10th of the overall replica placements. The third one taking the turn has the information for 2/10th of the overall replica placements. But for 30 servers, these ratios become 1/30th and 2/30th, respectively. Thus, the more servers, the less informed decisions are made by LWY. But in DGR, the servers coordinate each replication decision and therefore DGR makes informed decisions regardless of the number of servers. We now present the cases where LWY performed better than DGR, with respect to different parameters. We would like to mention here that these cases constitute less than 2.72% of the total number of cases considered in the experiments. Figure 3a shows the cases with better LWY gains for different values of η . Figure 3b shows the same cases grouped according to different values of ρ (the correlation of site hot-set parameter). In Figure 3a, we see that LWY obtains bigger gains than DGR

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

12

Total capacity (%) vs. DGR / LWY gain ratio (m = 8)

Total capacity (%) vs. DGR / LWY gain ratio (m = 64)

1.25

1.3 avg max

avg max 1.25

1.2

1.2 1.15 1.15 1.1 1.1

1.05 1.05

1

1 1.53

3.05

6.1 12.2 24.4 48.8 97.7 Total server capacity (% of total objects)

195

391

(a) No. of servers = 8

12.2

24.4

97.7 195 391 781 Total server capacity (% of total objects)

1562

3125

(b) No. of servers = 64

Fig. 5. DGR/LWY gain ratio vs. server capacity TABLE 4 Running time & total amount of communication of DGR vs. LWY

m 8 8 8 8 16 16 16 16 32 32 32 32 64 64 64 64

n 8192 16384 32768 65536 8192 16384 32768 65536 8192 16384 32768 65536 8192 16384 32768 65536

Time (ms) 6 9 14 25 17 33 52 70 52 113 156 309 174 323 508 1632

DGR Comm. (KB) 136 231 426 818 224 344 600 1122 386 521 832 1479 721 845 1196 1960

Time (ms) 21 47 103 233 48 122 311 752 103 267 778 2275 198 525 1660 5973

LWY Comm. (KB) 388 650 1175 2224 861 1384 2432 4531 1888 2938 5033 9228 4115 6211 10404 18795

in less than 0.35% of the cases for lower η values. Increasing η produces few cases in which LWY performs better. Here we see that if the servers are not equally active (i.e., small η ), LWY can beat DGR in some cases. This will occur when highly active servers are chosen at the beginning and therefore the other servers can choose objects more efficiently for themselves. Whereas for the sake of coordination some servers might suffer in DGR in few cases. In Figure 3b, we see a clear relationship between ρ and the frequency of cases with higher LWY gain. Note that DGR performs better with higher ρ value. Higher ρ means more randomness in the request patterns among servers. It suggests that the coordination among servers is very important when the request rates are not similar. In practical cases, it is very unlikely to have identical request patterns, so we claim that DGR is going to always perform better than LWY. Now, we discuss the second set of experiments in which we compare DGR and LWY for scalability. In Figures 4a and 4b, we plot the minimum and the maximum gains of DGR over LWY across different number of servers and objects. We see that the maximum gain steadily increases with both m and n in majority of cases. The minimum gain of DGR over LWY is not less than 0.995 in all cases. The maximum gains ranging

from about 18% to about 27% clearly indicates that DGR can deal with certain instances of the problem where LWY cannot. We also compare the running time and the communication overhead of these algorithms with respect to m and n in Table 4. The first two columns of this table represent the number of servers and objects, respectively. The next two columns show the execution time and total amount of data communicated in DGR in milliseconds and kilobytes, respectively. The last two columns give the same metrics for LWY. We find that in each case DGR is much faster and it requires less amount of data to be communicated among servers than LWY. This is because in LWY, a server needs to perform a costly sort operation over the insertion gains of the objects while DGR does it using a series of less costly all-reducemax operations. Similarly, in DGR, in each iteration, a single message can determine the current maximum and current allocation decision, while in LWY, servers need to perform a bulk data broadcast to inform the others of their replication decision. We have seen a large difference between the maximum gains of DGR over LWY when we measure them varying m and n. Now we present the data summarized along different other dimensions. This will enable us to justify whether the cases in which DGR is performing better are more common in practice. Figure 5 shows two plots of the average and maximum DGR vs. LWY gain over the ratio of the total server capacity and the total number of objects. This ratio varies from as small as 1.53% to as large as 3125% in the data we generated for the experiments. In Figure 5a, we see that for 8 servers, DGR performs better when total server capacity is much smaller than the total objects considered for replication. Here we see the characteristic of an efficient algorithm in DGR, since it performs better under constrained conditions. When server capacities increase, naturally all algorithms will tend to converge in performance because all servers will be able to replicate their highly requested objects and the differences will be caused by the less requested objects. We observe this trend for every value of m and present another such plot for m = 64 in Figure 5b as a representative case. As the last comparisons of DGR with LWY, we present ten individual experiments where DGR obtains the maximum performance against LWY in Table 5. The ten cases in which DGR performs the worst are given in Table 6. At first, we

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

13

TABLE 5 Ten Best Performance Cases of DGR vs. LWY DGR/LWY 1.2689 1.2639 1.2633 1.2607 1.2590 1.2561 1.2547 1.2537 1.2535 1.2531

m 64 64 32 32 64 32 64 64 32 32

n 65536 65536 65536 32768 65536 65536 32768 65536 32768 65536

Capacity(%) 12.2 12.2 6.1 12.2 12.2 6.1 24.4 12.2 12.2 6.1

η 1 1 1 1 1 1 1 1 1 1

θ 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005

TABLE 7 DGR vs. Aε -Star Comparison

ρ 80 80 80 80 80 80 80 80 80 80

m 4 4 4 4 4 4 8 8 8 8 16 16 32 32

TABLE 6 Ten Worst Performance Cases of DGR vs. LWY DGR/LWY 0.9960 0.9963 0.9964 0.9965 0.9966 0.9968 0.9968 0.9968 0.9970 0.9971

m 8 16 64 64 64 16 16 32 32 8

n 8192 8192 8192 8192 8192 8192 8192 8192 8192 8192

Capacity(%) 48.8 48.8 391.0 391.0 391.0 97.7 48.8 195.0 195.0 24.4

η 1.0 1.0 0.5 0.5 0.5 0.5 1.0 0.5 0.5 1.0

θ 0.005 0.010 0.005 0.005 0.005 0.005 0.010 0.005 0.005 0.010

ρ 5 5 5 5 5 5 5 5 5 5

note that the highest gain is about 27%, whereas DGR gain is less than 1% of LWY gain in the worst case. In Table 5, we see that DGR performs better in cases with large number of servers and objects. This ensures the scalability of performance for this algorithm. The other parameter values represent the fact that DGR is expected to yield better solutions in cases with constrained server capacities, equally active servers with high variability between their object request rates, and where the objects are of comparable importance. On the other hand, in Table 6, we see that DGR performs worse than LWY in cases where server capacities are large. As we mentioned before, every algorithm will perform almost equally in settings in which the servers have large capacities and therefore this results cannot suggest any specific pattern. This set of experiments reveal that DGR is expected to perform better than LWY in the majority of the cases. We also establish that both computation time and communication overhead of DGR are much less than those of LWY. Finally, we determined the problem characteristics for which DGR offers significant improvements over the LWY algorithm. We now discuss briefly the third set of experiments in which we compare DGR with the Aε -Star algorithm [13]. The results for different m and n are summarized in Table 7. Since Aε -Star is a variant of the A-Star algorithm [31], which is designed for searching the optimal solution, it always performs better than DGR on average. But the difference is only 1% on average and 3% in the worst case. On the other hand, Aε -Star spends huge amounts of time for finding the solution (as shown in the sixth column in Table 7), despite using pruning techniques to reduce the search space.

n 50 100 200 400 800 1600 50 100 200 400 100 200 100 200

DGR / Aε -Star Gain Mean Min 0.994 0.986 0.994 0.977 0.994 0.976 0.998 0.972 0.999 0.970 0.998 0.977 0.997 0.990 0.997 0.995 0.996 0.995 0.998 0.998 0.999 0.998 0.998 0.995 1.000 0.999 0.998 0.996

Time (ms) DGR Aε -Star 0.43 18095 0.48 111024 0.54 152982 2.83 518259 6.67 612236 12.92 906406 0.80 117718 2.76 428126 0.25 989748 0.75 3851200 2.75 1099573 0.75 5843789 1.5 3391153 1.75 37011437

5.3 Summary of Results The DGR algorithm has clear advantages over the existing algorithms. We first showed that it outperforms the best known distributed algorithm for the replica placement problem in the majority of the cases. The experiments show that DGR produce higher quality results with increasing problem size. The execution time and the communication overhead of DGR are less than those of LWY. We showed that DGR performs far better than LWY in cases with constrained server capacities, equally active servers with high variability between their object request rates, and where the objects are of comparable importance. Finally, we showed that DGR also performs close to the centralized Aε -Star algorithm, which was shown to outperform many current algorithms in [13]. DGR is much faster than the Aε -Star algorithm and thus more suitable for practical implementation.

6

C ONCLUSION

A distributed replication group helps create large replication storage by combining server caches and coordinating replication decisions. The efficiency of the group depends on how effectively the servers can store the replicas to minimize the overall object access cost. Therefore, an efficient distributed algorithm with minimum overhead is highly desired in this setting. We designed a distributed approximation algorithm for the replica placement problem. We showed that the proposed algorithm runs in polynomial time and has a polynomial time communication overhead. We also proved that the proposed algorithm is a 2-approximation algorithm. We compared by simulation the performance of our algorithm with the distributed algorithm providing the best performance known so far in the literature. The comparison results show that our algorithm performs better in more than 97.28% of all cases, yeliding a gain in performance of up to 26.9%. We also showed that our algorithm scales very well in terms of performance and computational and communication complexity. We also established that our algorithm is suitable for practical problem instances. Finally, we showed that the proposed algorithm performs within 1% of the best known centralized algorithm. Hence we claim that DGR is a very good candidate algorithm for practical implementation in distributed replication groups.

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

In future work, we plan to implement the proposed algorithm in a real system and extend it for more general settings.

ACKNOWLEDGMENTS This research was supported in part by NSF grant DGE0654014. A short version of this paper [32] was published in the Proc. of NCA 2009. The authors wish to express their thanks to the editor and the anonymous referees for their helpful and constructive suggestions, which considerably improved the quality of the paper.

R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9]

[10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]

N. Laoutaris, O. Telelis, V. Zissimopoulos, and I. Stavrakakis, “Distributed selfish replication,” IEEE Trans. Parallel Distrib. Syst., vol. 17, no. 12, pp. 1401–1413, 2006. C. Chekuri and S. Khanna, “A PTAS for the multiple knapsack problem,” in Proc. 11th Ann. ACM-SIAM Symp. on Discrete Algorithms, 2000, pp. 213–222. D. B. Shmoys and E. Tardos, “An approximation algorithm for the generalized assignment problem,” Mathematical Programming, vol. 62, no. 3, pp. 461–474, 1993. T. Moscibroda and R. Wattenhofer, “Facility location: distributed approximation,” in Proc. 24th Ann. ACM Symp. Principles of Distributed Computing, 2005, pp. 108–117. D. P. Bertsekas and D. A. Castanon, “The auction algorithm for the transportation problem,” Annals of Operations Research, vol. 20, no. 1-4, pp. 67–96, 1989. D. Bertsekas, “A distributed algorithm for the assignment problem,” 1979, Laboratory for Information and Decision Systems Unpublished Report, M.I.T. A. Leff, J. L. Wolf, and P. S. Yu, “Replication algorithms in a remote caching architecture,” IEEE Trans. Parallel Distrib. Syst., vol. 4, no. 11, pp. 1185–1204, 1993. M. J. Osborne, An Introduction to Game Theory. Oxford University Press, USA, 2003. B. Chun, K. Chaudhuri, H. Wee, M. Barreno, C. H. Papadimitriou, and J. Kubiatowicz, “Selfish caching in distributed systems: a game-theoretic analysis,” in Proc. 23rd Ann. ACM Symp. Principles of Distributed Computing, 2004, pp. 21–30. S. U. Khan and I. Ahmad, “Discriminatory algorithmic mechanism design based www content replication,” Informatica, vol. 31, no. 1, pp. 105–119, 2007. N. Laoutaris, G. Smaragdakis, A. Bestavros, I. Matta, and I. Stavrakakis, “Distributed selfish caching,” IEEE Trans. Parallel Distrib. Syst., vol. 18, no. 10, pp. 1361–1376, 2007. N. Laoutaris, V. Zissimopoulos, and I. Stavrakakis, “Joint object placement and node dimensioning for internet content distribution,” Information Processing Letters, vol. 89, no. 6, pp. 273 – 279, 2004. S. U. Khan and I. Ahmad, “Comparison and analysis of ten static heuristics-based internet data replication techniques,” J. Parallel and Distributed Computing, vol. 68, no. 2, pp. 113 – 136, 2008. P. Krishnan, D. Raz, and Y. Shavitt, “The cache location problem,” IEEE/ACM Trans. Networking, vol. 8, pp. 568–582, 2000. L. Qiu, V. N. Padmanabhan, and G. M. Voelker, “On the placement of web server replicas,” in Proc. 20th Ann. IEEE Conf. on Computer Communications, 2001, pp. 1587–1596. S. Bakiras, T. Loukopoulos, D. Papadias, and I. Ahmad, “Adaptive schemes for distributed web caching,” J. Parallel and Distributed Computing, vol. 65, no. 12, pp. 1483–1496, 2005. B. Tang, H. Gupta, and S. R. Das, “Benefit-based data caching in ad hoc networks,” IEEE Trans. Mobile Computing, vol. 7, no. 3, pp. 289–304, 2008. C. Kumar and J. B. Norris, “A new approach for a proxy-level web caching mechanism,” Decision Support Systems, vol. 46, no. 1, pp. 52– 60, 2008. M. Rabinovich, J. Chase, and S. Gadde, “Not all hits are created equal: cooperative proxy caching over a wide-area network,” Computer Networks and ISDN Systems, vol. 30, no. 22-23, pp. 2253 – 2259, 1998. I. D. Baev and R. Rajaraman, “Approximation algorithms for data placement in arbitrary networks,” in Proc. of the 12th Ann. ACM-SIAM Symp. Discrete Algorithms, 2001, pp. 661–670.

14

[21] I. Baev, R. Rajaraman, and C. Swamy, “Approximation algorithms for data placement problems,” SIAM J. Computing, vol. 38, no. 4, pp. 1411– 1429, 2008. ˇ [22] U. Cibej, B. Slivnik, and B. Robiˇc, “The complexity of static data replication in data grids,” Parallel Computing, vol. 31, no. 8+9, pp. 900–912, 2005. [23] J. Zhou, Y. Wang, and S. Li, An Optimistic Replication Algorithm to Improve Consistency for Massive Data, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2005, pp. 713–718. [24] Z. Begic, M. Bolic, and H. Bajric, “Centralized versus distributed replication model for multicast replication,” in Proc. 49th Int’l Symp. ELMAR, 2007, pp. 187–191. [25] B. Cai, C. Xie, and G. Zhu, “EDRFS: An effective distributed replication file system for small-file and data-intensive application,” in Proc. 2nd Int’l Conf. on Communication Systems Software and Middleware, 2007, pp. 1–7. [26] R. T. Hurley and B. Y. Li, “A performance investigation of web caching architectures,” in Proc. 2008 C3S2E Conf., 2008, pp. 205–213. [27] S. Sulaiman, S. M. H. Shamsuddin, F. Forkan, and A. Abraham, “Intelligent web caching using neurocomputing and particle swarm optimization algorithm,” in Proc. 2nd Asia Int’l Conf. on Modelling and Simulation, 2008, pp. 642–647. [28] V. Stantchev and M. Malek, “Addressing web service performance by replication at the operating system level,” in Proc. 3rd Int’l Conf. on Internet and Web Applications and Services, 2008, pp. 696–701. [29] G. Pierre and M. Steen, “Globule: A collaborative content delivery network,” IEEE Commun. Mag., vol. 44, pp. 127–133, 2006. [30] A. Grama, G. Karypis, V. Kumar, and A. Gupta, Introduction to Parallel Computing (2nd Edition). Addison Wesley, 2003, ch. 4. [31] P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the heuristic determination of minimum cost paths,” IEEE Trans. Syst. Sci. Cybernetics, vol. 4, no. 2, pp. 100–107, 1968. [32] S. Zaman and D. Grosu, “A distributed algorithm for web content replication,” in Proc. 8th IEEE Int’l Symp. on Network Computing and Applications, 2009, pp. 284–287.

Sharrukh Zaman received his Bachelors of Computer Science and Engineering degree from Bangladesh University of Engineering and Technology, Dhaka, Bangladesh. He is currently a PhD candidate in the Department of Computer Science, Wayne State University, Detroit, Michigan. His research interests include distributed systems, game theory and mechanism design. He is a student member of the IEEE.

Daniel Grosu received the Diploma in engineering (automatic control and industrial informatics) from the Technical University of Ias¸i, Romania, in 1994 and the MSc and PhD degrees in computer science from the University of Texas at San Antonio in 2002 and 2003, respectively. Currently, he is an associate professor in the Department of Computer Science, Wayne State University, Detroit. His research interests include distributed systems and algorithms, resource allocation, computer security and topics at the border of computer science, game theory and economics. He has published more than sixty peer-reviewed papers in the above areas. He has served on the program and steering committees of several international meetings in parallel and distributed computing. He is a member of the ACM and a senior member of the IEEE and the IEEE Computer Society.

Suggest Documents