IEEE/CIC ICCC 2014 Symposium on Social Networks and Big Data
CRMS: a Centralized Replication Management Scheme for Cloud Storage System Kangxian Huang
Dagang Li
Yongyue Sun
School of Electronic and Computer Engineering Peking University
School of Electronic and Computer Engineering Peking University Email:
[email protected]
Xunlei Networking Tech. Ltd
Abstract—As distributed storage clusters have been used more and more widely in recent years, data replication management, which is the key to data availability, has become a hot research topic. In storage clusters, internal network bandwidth is usually a scarce resource. Misplaced replicas may take up too much network bandwidth and greatly deteriorate the overall performance of the cluster. Aiming to reduce the internal network traffic and to improve load balancing of distributed storage clusters, we developed a centralized replication management scheme referred to as CRMS. A model is proposed to capture the relationships of block access probability, replica location and network traffic. Based on this model, the replica placement problem is formulated as a 0-1 programming optimization problem. Based on the feasible solution to this problem, a heuristic is proposed to process the replica adjustments step by step. Our CRMS is evaluated by using the access history from a distributed storage cluster of Xunlei Inc., one of the leading Internet companies in China. The experimental results show that CRMS can greatly reduce the amount of internal network bandwidth consumption, while keeping the cluster’s storage usage in balance.
In many papers, performance, data availability and fault tolerance are the main concerns of replication. However, misplaced replicas may cause unnecessary traffic that also affects overall performance. In a large and busy storage cluster, internal traffic will be generated when a client accesses a block not in local. If the block is accessed frequently, the extensive traffic it generates can become a real issue. Placing replicas close to where they are accessed most frequently will greatly save bandwidth. To do that, we propose CRMS, a centralized replication management scheme, in which the placement of replicas is adjusted according to the measured user access pattern to reduce internal traffic. The remaining of this paper is organized as follow. Section II provides an overview of related works. The design of centralized replication management system (CRMS) is presented in Section III. Experimental results are discussed in Section IV. Section V concludes the paper with possible future work. II.
I.
I NTRODUCTION
The demand of distributed storage clusters has been growing dramatically in recent years, especially in the form of cloud storage services. Rather than relying on a few central storage arrays, storage clusters consolidate a large number of geographically distributed servers into a single storage pool and provides a large capacity, high performance storage service at low cost in unreliable and dynamic network environment [1]. Data replication divides files into multiple data blocks and replicates them across data nodes. This technique has been widely used for the great advantages it can bring. Firstly, it can increase the data availability of the distributed storage clusters. In a heterogeneous and large-scale cloud environment, failure is the norm rather than the exception [1]. By using data replication, though a data node may fail, the blocks that it stores can still be accessed through other data nodes. Secondly, by keeping all replicas active, the additional copies may be used to improve load balance and overall performance if replicas and requests are reasonably distributed. A block can be accessed with high aggregate bandwidth in way of parallel data transfer. But replicating also brings a challenging issue: how to place replicas across the nodes? This work was supported by Shenzhen Basic Research Program JCYJ20130329175455606 and partly supported by The Shenzhen Engineering Lab of Three-dimensional Media Technology.
978-1-4799-4146-9/14/$31.00 ©2014 IEEE
344
R ELATED W ORK
Data replication and placement was first studied in the context of the file assignment problem [2] and was shown to be a complex combinatorial optimization problem. Replica placement has received attention from diverse research areas, e.g., delivery networks, web caching, web proxy services, distribution storage system, etc. A thorough categorization of papers on replica placement can be found in [9]. In the distributed storage cluster context, replica placement is one of the main topics and various methods have been proposed. Raj Tewari and Nabil R. Adam [13] proposed a method aiming at incorporate the behavior of consistency control algorithms in the replica placement problem. They take into account probabilistic reliability to ensure mutual consistency of replicated data. Scarlett [4] replicates blocks based on the observed probability in the past. It computes a replication factor for each file and creates budget-limited replicas distributed among the cluster, with the goal of minimizing hotspots. Replicas are also aged to reserve space for new replicas. These methods are rather static in the sense that they assume that access patterns are estimated in advance and remain unchanged, therefore a one-time replica scheme is implemented and lasts for a long time period. In such case, the cost imposed by replica adjustment is going to be amortized and can be ignored. However, the performance of these methods will deteriorate when the access pattern changes over
IEEE/CIC ICCC 2014 Symposium on Social Networks and Big Data
time. In contrast, dynamic methods adapt replicas in the cluster as frequently as upon every request, which are more responsive at the cost of higher system cost. One of the dynamic methods is DARE proposed in [5], which uses probabilistic sampling and a competitive aging algorithm independently at each node to determine the number of replicas to allocate for each file and the location for each replica. It tries to reduce consuming extra network and computation resources. Based on the data popularity, ERMS [6] increases the number of replicas for hot data and cleans up these extra replicas when the data cools down. The main purpose of ERMS is improving the reliability and performance of HDFS. CDRM [3] builds up a cost model to capture the relationship between availability and replication factor. The model is used to find the lower bound on the number of replicas which are placed among distributed nodes to minimize blocking probability. Dynamic methods can bring better performance to what they target at, but they also bring non-trivial extra replica adjustment cost that may deteriorate the system’s overall performance. Of course if the replica adjustment cost is taken into consideration from the start of development, the overhead can be kept under control with the dynamic approach. The methods mentioned above all try to estimate a suitable access pattern for the coming future. Based on this access pattern, they focus on different criteria, including access lantency [7], system availability [8], [13], minimizing resource hotspots [4], [6], or resource consumption [5]. In this paper, an access pattern is estimated, too, but we will focus on reducing the internal traffic, because internal network bandwidth is a scarce resource, and for the data center studied in our research, file serving is the main job which will be affected even more by heavy internal traffic than computing-centric data centers. There have been other papers concerning traffic reduction, for example [12]. But in [12], the traffic is reduced at the price of storage resource imbalance in which data are stored as much as possible at a small set of nodes in the cluster. In a real system, storage usage should be reasonably balanced to spread the load evenly across all the nodes, so each node still has spare room for performance tuning and for sudden bursts in user demands. This is an important aspect considered in our method. III.
C ENTRALIZED R EPLICATION M ANAGEMENT S CHEME
Fig. 1.
A double-star topology network
HDFS-like [10] clusters to support its services, which are distributed all around China geographically. Files are divided into blocks whose replicas are stored in these clusters. One such cluster is made up of a set of interconnected nodes. These nodes are divided into two categories: the nodes that not only store replicas but also answer user requests are full server nodes (S), while those that only store replicas are called non-server nodes (N ). The replicas in a cluster are for a set of data blocks (K). These replicas are normally of the same size in practice. In the discussion below, s and n represents node from S and N respectively, and k is a block from K. The Xunlei cluster in question has a double-star topology as shown in Fig. 1, which is composed of a number of server nodes(SN) and some non-server nodes(NN). All the nodes are interconnected by central switch SW1 to form the first star, which is the internal network of the cluster. The server nodes are also interconnected by another switch SW2 to reach the Internet to serve external users. When a request for a data block is received, one of the server nodes will be chosen at random to process it for the purpose of load balancing. This server node checks whether it has a replica for the requested block. If yes, it will be sent to the user directly, and no internal network traffic will be generated. Otherwise it has to retrieve the requested block from one of the nodes that do have a replica of that data block. Therefore in this topology, internal traffic is mostly caused by this replica forwarding process.
Our CRMS was developed with three goals in mind:
B. 0-1 Programing Model
1)
The placement adjustment is carried out in epochs across the whole time span of the cluster’s serving period. During each epoch, the access probability for each data block is estimated based on the user access pattern from the previous epoch. With these access probabilities we can calculate a target placement scheme for the replicas by solving a 0-1 IP problem, which will be described below.
2) 3)
adjusting the location of replicas to reduce internal network bandwidth consumption effectively; balancing the storage usage between data nodes; little overhead to the cluster’s service.
In this section, we first describe the background of this work, and then formulate it as a 0-1 integer programming (IP) problem. Based on the solution to this problem, an heuristic algorithm is used to carry out the appropriate adjustment. A. Problem Description The problem that we studied comes from real-world services provided by Xunlei Inc., one of the leading Internet companies in China. These services mainly involve file downloading and other storage related business. Xunlei uses many
345
1) Internal network bandwidth consumption: We can formulate the replica placement problem as a 0-1 IP problem, in which replicas for different data blocks are to be placed optimally among all the nodes in the storage cluster. The replica placement can be represented as a two-dimensional array store, with rows to be different nodes and columns each for a data block. Therefore the size of the two dimensions are |S ∪ N | and |K|. When storenk equals 1, it means that the
IEEE/CIC ICCC 2014 Symposium on Social Networks and Big Data
replica of block k is stored in node n; otherwise when storenk equals 0, it means block k has no replica on node n. For each epoch, the total number of requests c of all the blocks and the access probability pk of each block k are calculated based on the access history from the last epoch. The total internal network traffic in this epoch is Or as defined in Equ.1, where the average number of requests for block k that is assigned to a certain server node s is calculated as cpk /|S|. When node s doesn’t have a replica for block k, it has to retrieve the data from another node in the cluster at the cost of one unit of internal network traffic. Otherwise, it can respond to the request directly. Or =
cpk (1 − storesk ) |S|
(1)
s∈S k∈K
storenk
(5)
k∈K
(1 − θ)Cs ≤ Csn ≤ (1 + θ)Cs , ∀n ∈ S ∪ N
(6)
3) Number of replicas: Here in this paper the adjustment on the number of replicas for each data block is not included, since at Xunlei Inc. this is done separately in a different module. Combining placement and number adjustment can be one of our future work. The number of replicas bk for a data block k has to be maintained throughout each adjustment epoch, which is the second constraint of the IP problem, as shown in Equ.7. storenk = bk , ∀k ∈ K (7) n∈S∪N
Intuitively we can reduce Or by moving requested blocks to the server nodes and non-requested ones away in exchange. However, the adjustment itself may also cause internal network traffic when moving replicas around. The adjustment to be carried out in each epoch can be presented by an array adjust, which records the differences between the old placement store and the target placement scheme store from the solution to the IP problem, as shown in Equ.2. adjustnk can take three values. When adjustnk equals 1, we call it a creating adjustment as a replica of block k should turn up on node n; when it equals -1, we call it a deleting adjustment since the existing replica of block k should be removed from node n. When adjustnk equals 0, there is no adjustment needed for the replica of block k in node n, whether it exists or not. adjustnk = storenk − storenk , ∀n ∈ S ∪ N, k ∈ K
Csn =
(2)
With adjustnk from Equ.2 we can define the internal traffic caused by adjustment to be Oa , as in Equ.3. Here a unit of internal traffic is only generated for those creating adjustments. For deleting adjustments the affected replicas are simply removed. Oa = max{0, adjustnk } (3) n∈S∪N k∈K
We use a 0-1 IP problem solver LocalSolver [11] to get a satisfactory feasible solution storenk . The adjustment can then be performed according to adjustnk from Equ.2, preferably at the time when the cluster is relatively less loaded to alleviate the impact on the cluster’s serving performance. However, adjusting to the full extend of adjustnk might not be a good idea because the adjusted replica placement might over-fit to the access history, which is not 100% what to be seen in the current epoch. On the other hand, not all the replica adjustment step generates the same internal traffic reduction, some might not worth the effort comparatively. Next we will develop an heuristic adjustment algorithm to deal with these issues. C. Heuristic Replica Adjustment Algorithm For each block k, we define a quantity prof itk to characterize the level of traffic reduction should adjustment be performed on this block. Based on this quantity, we can identify the blocks with better adjustment reward and choose which ones are really adjusted for the best overall result. prof itk is defined as the difference between revenuek and costk , where revenuek is the amount of traffic reduction brought by moving replicas of block k to a better location, and costk is the traffic caused by the adjustment itself on block k. prof itk = revenuek − costk
The objective of our IP problem is to minimize Or + Oa , so the total internal traffic should be minimized after the adjustment, subject to two constraints on the storage usage of each node and the number of replicas for each data block, which will be discussed below. 2) Node storage usage balancing: The balancing of node storage usage is also an important aspect that we care about. The average storage usage among all nodes Cs and the storage space used at each node Csn are defined in Equ.4 and Equ.5. For storage usage balancing we constraint Csn in the IP problem as in Equ.6, which ensures that the storage usage at each node never goes too far from the overall average. The fluctuation range is determined by parameter θ. A smaller θ balances node storage usage more strictly. n∈S∪N k∈K storenk (4) Cs = |S ∪ N |
346
)= revenuek = −(Ork − Ork
cpk adjustsk |S|
(8)
(9)
s∈S
costk =
max{adjustnk , 0}
(10)
n∈S∪N
As shown in Equ.9, revenuek is defined as the difference of Ork and Ork , which in turn are the amount of internal traffic for block k before and after it is adjusted. cpk (1 − storesk ) (11) Ork = |S| s∈S
= Ork
cpk s∈S
|S|
(1 − storesk )
(12)
IEEE/CIC ICCC 2014 Symposium on Social Networks and Big Data
Fig. 2.
4 steps in one batch of adjustments
The adjustment starts from the block with the highest prof it. In order to make the process easier, all the blocks are reordered in descending order according to their prof itk . At the same time, all the server nodes and non-server nodes are also reordered but in ascending order, separately, according to their storage usage. The result is shown in Fig.2.
Fig. 3.
Pseudo code for replica adjustment heuristic
Fig. 4.
The comparison between our heuristic and the greedy approach.
Balance on storage usage is always kept in mind during the adjustment, therefore replicas are always moved away from higher loaded nodes and to nodes with more spare space, whenever possible. To do that, up to four adjustment steps are batched together as described below. In each batch the block with the highest prof it (k1), is first observed. We identify the replica of k1 with a creating adjustment on a server node n1 which has the largest spare space, and perform it as step A, as shown in Fig. 2. This step brings a high profit. Then we try to identify a deleting adjustment B for block k1, which should be on the mostly loaded non-server node. Step B makes room on highly used storage node, and together with step A the number of replica for block n1 stays unchanged. Next we try to find another deleting adjustment C on node n1 that incurs the smallest negative profit (moving replicas away from a server node actually moves it away from the user). Step C removes replica of block k2 on node n1, and together with step A the storage usage on server node n1 will be unchanged. The last step D should be a creating adjustment for block k2, on a non-server node with the most spare space. So functionally step D together with C moves the replica of k2 to a lowly used storage node with the lease internal traffic penalty. After this batch of 4-step adjustment is processed, the vector prof it is updated and the next batch of adjustment can be identified and performed. The process is repeated until a satisfactory result is reached. Since the user access pattern from Xunlei’s cluster approximately follows a power law, performing only the small number of high profit adjustments can achieve a large proportion of the overall profit, while neglecting the bulk of small profit adjustments can avoid the danger of over-fitting and at the same time alleviate the adjusting overhead. The stop criteria can be, for example, a threshold on the overall profit or the proportion of adjusted blocks, whichever suits the actual need. The pseudo code of the heuristic is shown in Fig.3. IV.
E XPERIMENTAL R ESULTS
The cluster in experiments is composed of 49 identical nodes with relatively the same storage space. These nodes are interconnected with a double star topology as discussed before. Among them there are 20 server nodes and 29 nonserver nodes. The access pattern is calculated based on the
347
access history collected by XunLei Inc. from a working cluster, during the period of January 4, 2014 to January 10, 2014. The epoch of adjustment is set to be a full day. According to the access history, the number of requests for each block indeed matches approximately a power-law distribution. The most popular data block had been requested 1340746 times, while the least popular blocks are not requested at all. We derive the location of the replicas and the number of replicas for each block from the access history. For simplicity, in our experiments, we only consider blocks that are actually requested during the period. The minimum replicas a block has is 1, and the maximum is 7. Most blocks have 2 replicas, which count up to 1677792 and take up 64.9% of all the blocks. The blocks with just one replica count up to 885468, taking up 34.2% of all the blocks. The remaining 23049 blocks have more than 2 replicas, taking up the rest 0.9%. Fig. 4 shows the comparison between our heuristic algorithm with the greedy approach similar to [12], in which always the highest profit adjustment is chosen in each step. In both approaches, the total cumulative profit on internal traffic reduction and the variance on storage usage across the 49 nodes will meet at the optimal solution from the IP problem after all the adjustments are finished. However, the dynamics along the adjustment process is very different: since the greedy approach tries to maximize the profit gain in each step and cares nothing about storage balance, the cumulative profit grows faster in the beginning at the cost of a much higher variance growth; only after the profit reaches its highest point,
IEEE/CIC ICCC 2014 Symposium on Social Networks and Big Data
V.
Fig. 5.
C ONCLUSION AND F UTURE W ORK
In a busy distributed storage cluster, misplaced replicas will cause unnecessary internal traffic to move them to the serving nodes. In this paper, we design and implement CRMS, a centralized replication management scheme to solve the problem. The problem is first formulated as a 0-1 Integer Programing problem, and a feasible placement scheme is reached by solving this IP problem based on access history collected in the cluster. In order to alleviate the overhead of adjustments and avoid over-fitting, only a fraction of the replica adjustment is performed according to an heuristic adjustment algorithm. CRMS batches 4 adjustment steps together to keep node storage usage in balance and stops adjustment when satisfactory internal traffic reduction is reached. From the experimental results, we can see that CRMS meets our expectation and reduces internal traffic greatly comparing to the unadjusted situation in a real world cluster from Xunlei Inc.. In the future, we plan to improve CRMS by dynamically adjusting the number of replicas as well, so popular data blocks can have more replicas at the server nodes to further reduce the internal traffic.
The internal traffic reduction for the heuristic algorithm
R EFERENCES [1]
Fig. 6.
The variance ratio between the heuristic and the greedy approach
the adjustments afterwards slowly tighten the variance until finally it comes back to the range bounded by Equ.6. On the contrary, our heuristic algorithm keeps the variance low all the time, so that the adjustment process can stop anywhere in the middle with a placement scheme that is still feasible. The capability to keep the replica placement feasible all the time even when the adjustment process stops beforehand is very crucial for us to skip small profit adjustments for lower overhead and to avoid the danger of over-fitting, as discussed earlier at the end of Sec.III. Next we will see the performance of CRMS across a successive adjustment period of 7 days, when it stops at different proportion of the total adjustments. The results of the heuristic stopping at 1, 1% and 1/10 compared with those at full adjustment are shown in the last two figures. Fig.5 shows the comparison of the internal traffic reduction each of them can achieve from the original replica placement of the XunLei’s cluster. Here we can see that for a 90% reduction in adjustment effort about 50% gain can be achieved across the board, and with just 1% effort we still get more than 10% gain in internal traffic reduction. Fig.6 shows the percentage of the variance achieved by CRMS with respect to that of the greedy approach. The variance difference is much higher when more adjustments are performed in each epoch, which leads to better traffic reduction gain as from Fig.5. And the correlation of the access pattern between consecutive days can be better used by CRMS, compared with the greedy approach, to alleviate the imbalance of storage usage among all the nodes in the cluster.
348
S. Ghemawat, H. Gobioff, S. Leung, The Google File System, Proceedings of 19th ACM Symposium on Operating Systems Principles (SOSP 2003), pp. 29-43, 2003. [2] L. Dowdy, D. Foster, Comparative Models of the File Assignment Problem, ACM computer Surveys, 14(2), pp. 287-313, 1982 [3] Q. Wei, B. Veeravalli, B. Gong, L. Zeng, D. Feng, CDRM: A Costeffective Dynamic Replication Management Scheme for Cloud Storage Cluster, Processings of the IEEE International Conference on Cluster Computing (CLUSTER 10), pp. 188-196, 2010 [4] G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, E. Harris, Scarlett: coping with skewed content popularity in mapreduce clusters, Proceedings of the 6th conference on Computer system (EuroSys), pp. 287-300, 2011. [5] L. Abad, Y. Lu, R. Campbell, DARE: Adaptive Data Replication for Efficient Cluster Scheduling, Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER 11), pp. 159-168, 2011 [6] Z. Cheng, Z. Luan, Y. Meng, Y. Xu, D. Qian, A. Roy, N. Zhang, G. Guan, ”ERMS: An Elastic Replication Management System for HDFS, Processings of the IEEE International Conference on Cluster Computing Workshops (CLUSTER WORKSHOPS), pp. 32-40, 2012 [7] L. Qiu, V. Padmanabhan, G. Voelker, On the Placement of Web Server Replicas, Proceedings of IEEE INFOCOM, pp. 15871596, 2001. [8] H. Yu, A. Vahdat, Minimal Replication Cost for Availability, Processings of 21st ACM Symposium on Principles of Distributed Computing (PODC), pp. 98107, 2002. [9] M. Karlsson, C. Karamanolis, Choosing replica placement heuristics for wide-area systems, Processings of the 24th International Conference on Distributed Computing Systems (ICDCS’04), pp. 350-359, 2004. [10] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, The Hadoop Distributed File System, Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST 10), pp. 1-10. 2010 [11] T. Benoist, B. Estellon, F. Gardi, R. Megel, LocalSolver 1.x: A blackbox local-search solver for 01 programming, 4OR: Quart. J. Oper. Res, 9(3), pp. 299316, 2011 [12] T. Loukopoulos, P. Lampsas, I. Ahmad, Continuous Replica Placement Schemes in Distributed Systems, Proceedings of the 19th ACM International Conference on Supercomputing (ICS’05), 2005 [13] R. Tewari, N. Adam, Distributed file allocation with consistency constraints, Processings of the 12th International Conference on Distributed Computing Systems (ICDCS92), 1992.