LAR: Locality-Aware Reconstruction for Erasure-Coded Distributed

CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2017; 00:1–30 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cpe

LAR: Locality-Aware Reconstruction for Erasure-Coded Distributed Storage Systems Fangliang Xu1,2 , Yijie Wang1,2, ∗† , Xiaoqiang Pei1,2 , Xingkong Ma1,2 1 National

Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, Hunan, P. R. China, 410073 2 College of Computer, National University of Defense Technology, Changsha, Hunan, P. R. China, 410073

SUMMARY Many modern distributed storage systems adopt erasure coding to protect data from frequent server failures for cost reason. Reconstructing data in failed servers efficiently is vital to these erasure-coded storage systems. To this end, tree-structured reconstruction mechanisms where blocks are transmitted and combined through a reconstruction tree have been proposed. However, existing tree-structured reconstruction mechanisms build reconstruction trees from the perspective of available network bandwidths between servers, which are fluctuating and difficult to measure. Besides, these reconstruction mechanisms cannot reduce data transmission. In this study, we overcome these limitations by proposing LAR, a localityaware tree-structured reconstruction mechanism. LAR builds reconstruction trees from the perspective of data locality, which is stable and easy to obtain. More importantly, by building reconstruction trees that combine blocks closer to each other first, LAR can reduce the data transmitted through the network core and hence speed up reconstruction. We prove that a minimum spanning tree is an optimal reconstruction tree that minimizes core bandwidth usage. We also design and implement a general reconstruction framework that supports all tree-structured reconstruction mechanisms and nearly all erasure codes. Large-scale simulations on commonly deployed network topologies show that LAR consumes 20%–61% less core bandwidth than previous reconstruction mechanisms. Thorough experiments on a testbed consisting of 40 physical servers show that LAR improves proactive recovery throughput by 23% at least and improves degraded read rate by c 2017 John Wiley & Sons, Ltd. up to 68%. Copyright Received . . .

KEY WORDS: Distributed Storage System, Erasure Coding, Data Reconstruction, Locality-Aware

1. INTRODUCTION Modern large-scale distributed storage systems often consist of thousands or even tens of thousands of servers [1, 2]. Frequent servers failures witnessed by these storage systems [1, 3, 4] make introducing data redundancy essential to ensure data reliability. Replication [1, 5], the commonly used technology for generating redundancy, brings too high storage overheads to be acceptable for petabytes or exabytes of storage [6, 7]. For this reason, large-scale distributed storage systems are progressively turning to erasure coding [8, 9] for generating redundancy, which costs much less storage and can offer the same or even higher data reliability [10, 3] at the same time. Such distributed storage systems include Microsoft Azure [11], Facebook HDFS [4], Google Colossus [12] and HDFS in Hadoop 3.0 [13]. ∗ Correspondence to: Yijie Wang, National Laboratory for Parallel and Distributed Processing, College of Computer, National University of Defense Technology, Changsha, Hunan, P. R. China, 410073 † Email: [email protected]

c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls [Version: 2010/05/13 v3.00]

2

F XU, ET AL.

When examining erasure-coded distributed storage systems, data reconstruction always draws the most attention. Broadly speaking, data reconstruction is performed in two cases. One is proactive recovery, where blocks in failed devices are reconstructed by the maintenance task at the background to remain certain redundancy level. The recovery time of failed blocks largely determines the system’s data reliability when the used erasure code is given [3, 14]. The other case is degraded read, where failed data is requested by normal applications and must be reconstructed immediately. Degraded read has great impact on read performance with device failures. In both cases, reconstruction is supposed to complete fast. Unfortunately, reconstructing erasure-coded data is drastically inefficient as it requires multiple blocks to be transmitted for each failed block. What’s worse, the high-volume data transmission incurred by data reconstruction occupies a large portion of network resources [4]. This seriously impairs co-running normal workloads. Previous work on addressing the reconstruction problem mainly focuses on new erasure codes (e.g., LRCs [4] and SHEC [15]) that require less data for reconstruction, where improvement often comes with higher storage overheads. Recently, reconstruction mechanisms that can further improve reconstruction for some existing erasure codes have emerged. Among them is a tree-structured reconstruction schema [16], where blocks are transmitted through and combined on the intermediate servers of a reconstruction tree defined as a spanning tree covering all servers that participate in the reconstruction. Existing reconstruction mechanisms [16, 17, 18] based on the tree-structured schema all attempt to speed up reconstruction by exploiting the heterogeneous available network bandwidths between servers. However, their help is limited in practice due to a few shared drawbacks. One is they only redirect data transmission to links with higher available bandwidth but cannot reduce the overall data transmission. Therefore, they cannot improve the overall reconstruction throughput of proactive recovery, where many failed blocks are reconstructed concurrently to exploit all surviving servers [19]. Another drawback is that measuring the fluctuating available bandwidths [20, 21] takes time and incurs high data transmission overheads. The high cost can totally mask the improvement that may be brought by doing this. In this study, we make the tree-structured reconstruction schema practical and helpful by proposing LAR, a locality-aware tree-structured reconstruction mechanism. LAR builds reconstruction trees from a novel perspective, the data locality in practical systems. As we know, most network topologies of modern data centers show a tree-like hierarchy and distances between servers are various. LAR uses network distance to measure data locality and generation trees with minimum total network distance as the reconstruction tree. The advantages of LAR are twofold. First, data locality is stable and easy to obtain. Therefore, building a reconstruction tree incurs very low overheads. Second and more important, awareness of data locality enables LAR to combine blocks that are closer to each other (e.g, blocks located in a same aggregation domain) first so that data transmitted through the network core is reduced. Tree-like network topologies often have high oversubscription ratios [22], which means the aggregation bandwidth of all servers is many times of that at the highest layer (the network core). It is widely acknowledged that the network core is very easily overloaded and often becomes the bottleneck [22, 20, 23]. Therefore, by reducing core bandwidth usage, LAR can speed up reconstruction and reduce the impairment of reconstruction on normal workloads. The challenging part of LAR is to find a general and quick way to determine the proper reconstruction tree, as practical network topologies are very complicated and various. We first propose to measure the data locality with a metric called network distance, which is defined as the number of physical links contained in the path between two servers. The advantages of the network distance are its stability and generality. It can be obtained easily in all kinds of network topologies. With this, we prove that a spanning tree with the minimum total network distance is optimal in that it minimizes core bandwidth usage. For an erasure code against multiple server failures, there are multiple choices of providers for reconstruction. We also propose a way that can quickly select the optimal providers so that the core bandwidth usage of the optimal reconstruction tree can be further reduced. Besides, we offer a method to minimize the degree of an optimal reconstruction tree without increasing core bandwidth usage. This mitigates the TCP-Incast problem [24] and further speeds up data reconstruction, especially for degraded read.

c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls

Concurrency Computat.: Pract. Exper. (2017) DOI: 10.1002/cpe

LAR: LOCALITY-AWARE RECONSTRUCTION FOR ERASURE-CODED DISTRIBUTED STORAGE SYSTEMS3

Prior studies based on the tree-structured reconstruction schema are purely analytic because it is challenging to implement the tree-structured reconstruction schema. In this study, we design and implement a general tree-structured reconstruction framework in our Raid Distributed Storage System (RDFS) [25, 26], which is based on HDFS-RAID [27] and HDFS [2] and has been integrated with many novel ideas of ours to tackle the problems facing erasure-coded distributed storage systems. The framework is very easy to extend to support all tree-structured reconstruction mechanisms and all linear erasure codes, which represent most if not all existing erasure codes for distributed storage. We evaluate LAR through both large-scale Monte Carlo simulations and extensive experiments on a testbed consisting of 40 physical servers. Simulations reveal that LAR consumes 23%–61% less core bandwidth than prior tree-structured reconstruction mechanisms on commonly deployed network topologies. Experiments show that LAR increases proactive recovery throughput by 23% at least and improves degraded read rate by up to 68%. More importantly, we observe an obvious inversely proportional relation between reconstruction performance and core bandwidth usage. Putting the simulations and experiments together, we can conclude that LAR really is able to speed up data reconstruction significantly in real-world systems. We organize the reminder of this paper as follows. In Section 2, we provide some background details necessary to understand our work. In Section 3, we introduce the tree-structured reconstruction schema and generalize it to all linear erasure codes. Section 4 delves into LAR. The implementation of LAR is shown in Section 5, following which the evaluation of LAR is provided in Section 6 and Section 7. In Section 8, we discuss related work. Section 9 presents the discussion. At last, we conclude this paper in Section 10.

2. BACKGROUND In this section, we provide some background details necessary to understand our work. We first introduce erasure coding and data reconstruction in distributed storage systems. We then briefly describe the commonly deployed large-scale network topologies. 2.1. Erasure Coding Distributed storage systems provide scalability by scattering data into numerous storage servers. The common way to scatter data is splitting origin data objects (files, blobs, etc.) into data blocks of equal size (tens or hundreds of megabytes) and distributing them to different servers. For erasure coding, these data blocks are divided into a set of stripes. Each stripe comprises k data blocks and performs encoding and reconstruction as an independent unit. Generally, an erasure code can be denoted by a triple (n, k , k ′ ), k ≤ k ′ < n. For each stripe, an (n, k , k ′ )-code encodes the k data blocks and generates n coded blocks, guaranteeing any k ′ blocks out of the n blocks sufficient to decode the original k data blocks. When k ′ = k , erasure codes can be denoted by (n, k ) and are usually called the Maximum Distance Separable (MDS) codes, which achieve the best trade-off between fault tolerance and storage overheads. The widely used RS codes [28] are examples of MDS codes. For good access performance, almost all erasure codes for storage systems are systematic, which means that the k data blocks remain unaltered and only n − k parity blocks are generated after encoding. 2.2. Data Reconstruction Server failures happen so frequently in distributed storage systems that they have been considered as the norm than the exception [1, 3, 4]. When failures occur, failed data has to be reconstructed. To reconstruct a failed erasure-coded block, multiple blocks must be transmitted from their residing servers called providers to another surviving server called newcomer where the failed block will be regenerated and stored. As aforementioned, data reconstruction needs to be performed in two cases: proactive recovery and degraded read. c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls


4

F XU, ET AL.

Internet CR Core Layer

Aggregation Layer S

AR

AR

AS

AS

S

S

CR

Ă

S Ă

AR

AR

Key CR = L3 Core Router AR = L3 Access Router AS = L2 Aggregation Switch S = L2 Switch ToR = Top-of-Rack Switch

Edge ToR ToR ToR ToR Layer Servers Ă Servers Servers Ă Servers A Single Aggregation Domain Figure 1. A typical network topology for data centers [30].

2.2.1. Proactive Recovery The primary goal of proactive recovery is to maintain the same level of redundancy, otherwise failed data would become permanently unavailable with the increase of failed data. To reduce vulnerability windows size, blocks in failed servers are reconstructed concurrently in practice to speed up proactive recovery. The concurrence feature of proactive recovery makes overall reconstruction throughput matter most. However, prior tree-structured reconstruction mechanisms cannot reduce the overall load of reconstruction. Therefore, they hardly can improve the overall reconstruction throughput of proactive recovery. By reducing the load at the network core, which is often the performance bottleneck, our locality-aware tree-structured reconstruction mechanism overcomes this limitation. 2.2.2. Degraded Read Although degraded read shares the same basic reconstruction process with proactive recovery, the reconstruction in degraded read is different in two aspects. First, reconstruction in degraded read is quite sequential in that very few failed blocks are requested concurrently. This makes sequential reconstruction rate matters more. Second, only the part that is requested instead of a whole block needs to be reconstructed. Therefore, degraded read is very sensitive to the cost of building the reconstruction tree, which is unfortunately very high in prior tree-structured reconstruction mechanisms. Just like the tree-structured reconstruction schema, we only focus on the reconstruction from single failure. We emphasize that the “single failure” here means not the case when only one server is failed in a system but the case when only one block is failed in a stripe. For load balance and incremental scalability, the coded blocks from each stripe usually are distributed on a different set of servers. Therefore, most failures are still single even when multiple servers fail. Studies on production systems show that reconstruction from single failure represents more than 98% of reconstruction in proactive recovery [29, 7]. The reason why we don’t consider reconstruction from multiple failures is that the better way is to reconstruct all failed blocks from a same stripe simultaneously with a same set of blocks [26], which cannot be done through the tree-structured reconstruction schema without incurring more data transmission. In degraded read, only the requested data needs to be reconstructed. Therefore, all failures in degraded read also can be considered as single. 2.3. Network Topology Large-scale network topologies today show a tree-like hierarchy [22], consisting of a three-level tree of switches or routers, with a core layer (layer 1) in the root of the tree, an aggregation layer (layer 2) c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls



in the middle and an edge layer (layer 3) at the leaves. Figure 1 shows a typical network topology for data centers. A widely known problem of such network topologies is the serious oversubscription. Although much work has been done to mitigate this problem, commonly deployed networks today still have high oversubscription ratios (4:1–10:1) [22, 31] because of the extremely high cost to build non-oversubscribed networks. Studies [22, 20, 23] have shown that links in the network core are the most utilized compared to aggregation and ToR switches. Thus, they will benefit the most from reduction in usage. However, existing reconstruction mechanisms start large data transmission without awareness of the data locality. This makes most of transmitted blocks for reconstruction flow through the network core, which overloads the core and thus impairs reconstruction as well as co-running normal workloads. By minimizing the core bandwidth usage, our locality-aware reconstruction mitigates this problem. 2.4. Block Placement In most modern distributed storage systems, e.g., HDFS-RAID [27] and HDFS in Hadoop 3 [13], the coded blocks from every stripe are distributed into a different set of servers. That is, they randomly selected n servers based on certain rules to store n coded blocks from every stripe. In these systems, every server stores both data blocks and parity blocks. They migrate the performance bottleneck by doing this. As coded blocks from every stripe are stored in a different set of servers and stripes perform encoding and reconstruction independently, many stripes can be simultaneously encoded and recovered. This significantly improves encoding performance and recovery performance. Recent HDFS distribution includes a block placement policy that exclusively distributes 3 copies among 3 servers obeying rack-aware placement strategy. And modern distributed storage systems often distribute blocks within a stripe to as many different racks as possible to improve the reliability. However, this does not mean that there is no aggregation-level locality to exploit. First, these block placement polices do no suggest that blocks within a stripe are placed to an identical aggregation domain. As shown in Figure 1, one aggregation domain contains many racks. These block placement policies only consider how blocks are distributed among racks, as racks often fail. Blocks within a stripe are placed randomly from the view of aggregation domains. Therefore, there is a big chance that blocks within a stripe are placed to multiple aggregation domains and multiple blocks within a stripe to are placed to a same aggregation domain, as shown in Figure 2. Second, it is not a good choice to place blocks within a stripe to an identical aggregation domain, as an aggregation domain also can fail because of power and network problem, making data unavailable. Therefore, it is better to place blocks within a stripe to multiple aggregation domains. : server

CR

: rack

AS1

c1

AS3

AS2

c2

c5

c3

c4

Figure 2. An example of block placement.

3. TREE-STRUCTURED RECONSTRUCTION SCHEMA In this section, we first introduce the tree-structured reconstruction schema [16] proposed for RS codes [28] (Section 3.1). We then discuss existing bandwidth aware tree-structured reconstruction c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls


6

F XU, ET AL.

(Section 3.2). Finally, we generalize the tree-structured reconstruction to all linear erasure codes (Section 3.3), which represent most if not all erasure codes for storage systems. 3.1. Overview The conventional reconstruction is a star-structured scheme, where blocks are transmitted from providers directly to the newcomer. Authors of [16] showed that reconstruction from single failure for RS codes [28] can be done by putting a linear combination of blocks. Based on this, they proposed a tree-structured reconstruction schema, where blocks are transmitted from providers to the newcomer through a reconstruction tree defined as a spanning tree covering the newcomer and all the providers. During transmission, an intermediate server in the reconstruction tree first receives blocks from its child servers, then combines the received blocks with the block it stores, and finally sends the intermediate result to its parent server. Figure 3 illustrates an example of the tree-structured reconstruction schema. This tree-structured transmission scheme also is used to optimize update of erasure-coded data recently [32, 33].

m3

m1

c6=m3+m4 m4

c3 m3=m1+m2+β3c3 m2

c1 m1=β1c1

c4 m4=β4c4

c2 m2=β2c2

Figure 3. An example of the tree-structured reconstruction schema for (6, 4)-RS code, where the unavailable block c6 can be reconstructed by performing β1 c1 + β2 c2 + β3 c3 + β4 c4 . Each circle in this example represents a server.

3.2. Bandwidth-Aware Reconstruction The challenging issue of the tree-structured reconstruction schema is to determine the appropriate reconstruction tree, as there exist many spanning trees given the newcomer and the providers. In existing reconstruction mechanisms [16, 17, 18] based on the tree-structured schema, reconstruction trees are built from the perspective of the heterogeneous available network bandwidths between servers. Specifically, bandwidth-aware reconstruction mechanisms denote the servers participating in a reconstruction job and the network that connects them together as a complete weighted graph G where each vertex denotes a participating server and the weight of an edge is the available network bandwidth between the two servers it connects. It was proved that the reconstruction time depends on the edge with the minimal weight for each tree T in graph G and a maximum spanning tree of graph G in an optimal reconstruction tree. Figure 4 illustrates an example of the bandwidth-aware reconstruction for (6, 4)-RS code, where the unavailable block c6 can be reconstructed by performing β1 c1 + β2 c2 + β3 c3 + β4 c4 and Server3 is the newcomer. Figure 4 (a) shows the graph to denote the involved servers and the network. Figure 4 (b) shows the data flows in the reconstruction. And Figure 4 (c) shows the reconstruction tree in Figure 4 (b), which is the maximum spanning tree of the graph shown in Figure 4 (a). According to the theory, the reconstruction time depends on the link between Server1 and Server6 only. If the amount of the data to reconstruct is 1000 KB, it will take (1000/25 = 40) seconds. While they are attractive theoretically, previous bandwidth-aware reconstruction mechanisms are impractical under real-world deployments. A key assumption in their theory is that when multiple servers send equal amount of data to a same server simultaneously, the time depends on the edge with the narrowest bandwidth only. For example, if Server1 and Server2 both send 450 KB of data to Server3 simultaneously in Figure 4, the time depends on the link between Server1 and Server3 and c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls



Server3

Server3

Server2 c2

Server3

Server5 Server2

Server5

c3

c3

c2

Server2 c2 40

c1 Server1

c4 Server6

(a) Network Model

c1 Server1

c4 Server6

(b) Data Flows

Server6 c4 25

c3 c1 Server5 Server1 (c) Reconstruction Tree

Figure 4. An example of the bandwidth-aware tree-structured reconstruction for (6, 4)-RS code, where the unavailable block c6 can be reconstructed by performing β1 c1 + β2 c2 + β3 c3 + β4 c4 . The number on each edge is the available network bandwidth (in KB/s) between the two servers it connects.

it will take (450/15 = 30) seconds. This assumption does not follow unless there is an independent physical link between each pair of servers. However, this is not the case in real-word network topologies, where most links are shared by many servers. Therefore, previous bandwidth-aware reconstruction mechanisms cannot be applied to real-world deployments. 3.3. Generalization The base of the tree-structured reconstruction schema is the linear property of reconstruction, which has been proven in prior studies [16] for RS codes. In this study, we give a more general proof and show that all linear erasure codes have this property. By doing this, we generalize the tree-structured reconstruction schema to all linear erasure codes. For linear erasure codes, each coded block can be generated by putting a linear combination of data blocks. Mathematically, for an (n, k , k ′ )-code, let αi = αi,j = (αi,1 αi,2 · · · αi,k ) be the vector comprised by the coefficients for encoding the k data blocks {d1 , d2 , · · · , dk } into coded block ci , 1 ≤ i ≤ n, 1 ≤ j ≤ k , αi,j ∈ Fq , where Fq is the Galois Field of size 2q . Then, coded blocks can be generated by:   d1   d2    α α · · · α i,1 i,2 i,k ci = (1) ×  ..  , i = 1, 2, · · · , n.  .  dk Here, all operations are performed in Fq over the units of q bits. The matrix G = (α1 α2 · · · αn )T is the generator matrix. For systematic codes, the first k rows of G compose the identity matrix Ik×k , thus ci = di , i = 1, 2, · · · , k . Without loss of generality, suppose block cn is lost and some k ′ surviving coded blocks {c1 , c2 , · · · , ck′ } are sufficient to decode the original k data blocks. For convenience, let G′ denote (α1 α2 · · · αk′ )T , C ′ denote (c1 c2 · · · ck′ )T and D denote (d1 d2 · · · dk )T . According to Equation (1), we can get G′ × D = C ′ . As the k ′ coded blocks are sufficient to decode the k data blocks, there exists one and only one solution for linear equations G′ × D = C ′ . Therefore, G′ must ′−1 ′−1 ′ ′ have a left inverse and let’s denote it by G′−1 L . Then we can get GL × G × D = GL × C , which is     d1 c1 c  d   2  2   = G′−1 ×   (2) L  ..   ..  .  .   .  dk ck ′ c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls


8

F XU, ET AL.

′ Substitute Equation (2) into Equation (1) and we get (cn ) = αn × G′−1 L × C . Let (β1 β2 · · · βk′ ) ′−1 denote the result of αn × GL . Then cn can be reconstructed by performing

cn =

′ i=k X

βi c i .

(3)

i=1

We call β1 , β2 , · · · , βk′ the reconstruction coefficients. Although the number of coded blocks required to decode the k data blocks is k ′ , much fewer blocks often are sufficient to reconstruct a failed block for many codes. Let us take the (10, 7, 5)LRCs used in Facebook [4] for an example. Although up to eleven blocks are required to decode the original ten data blocks in some cases, five or four blocks are sufficient to reconstruct an unavailable block in most cases. In fact, reducing the number of blocks actually required for reconstruction is the primary goal of many erasure codes proposed recently. This does not contradict Equation (3), as this is equivalent to the case when some reconstruction coefficients in Equation (3) are zeros. Without loss of generality, in this study, we denote the reconstruction for an (n, k , k ′ )-code as cn =

i=r X

βi c i ,

(4)

i=1

where cn is the block to reconstruct, c1 , c2 , · · · , cr are the actually required blocks for the reconstruction, and β1 , β2 , · · · , βr are the reconstruction coefficients corresponding to c1 , c2 , · · · , cr respectively, 1 ≤ r ≤ k ′ .

4. LOCALITY-AWARE RECONSTRUCTION Data reconstruction in erasure-coded distributed storage systems incurs heavy network traffic, as multiple blocks must be transmitted for each unavailable block. The primary drawback of previous reconstruction mechanisms is that they overlooked the significant impact of data locality on performance. In this study, we take this into consideration and propose a locality-aware tree-structured reconstruction mechanism named LAR. LAR exploits data locality in large-scale networks to reduce data flowing through the network core during reconstruction, which is widely acknowledged as the performance bottleneck. The basic idea of LAR is to combine blocks that are closer to each other (e.g., in a same aggregation domain) first and then combines the intermediate result with blocks further away (e.g, in another aggregation domain) second. Thus, blocks that flow through higher layers of the network are reduced. Noting that the network core often lies in the highest layers, blocks flowing through the network core is also reduced. 4.1. Problem Statement Given a reconstruction job described by Equation (4), the primary goal of LAR is to minimize core bandwidth usage, which is defined as following. Definition 1 (Core Bandwidth Usage) Core bandwidth usage is the amount of data that flows through the core layer of the network topology for reconstruction, measured as the average number of blocks that flow through the network core for each unavailable block. The challenging part of LAR is to find a general way that can determine and build the optimal reconstruction tree quickly. Practical network topologies today are often very complicated for throughput and reliability reasons, although they still show a tree hierarchy. Therefore, it is difficult to determine which blocks should be combined together first and which second in practice. In this study, we use path lengths between servers to measure the data locality and prove that a spanning tree with the minimum total path length is an optimal reconstruction tree. c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls



4.2. A Motivating Example Let us see a specific example that illustrates how LAR works and how it can reduce core bandwidth usage. Consider a distributed storage system on a simplified network topology showed in Figure 5, where Server4 is failed. Now we need to reconstruct coded block c6 stored in Server4 previously. Assume Server3 has been selected as the newcomer and coded blocks c1 , c2 , c3 and c4 are sufficient to reconstruct c6 .

: server

CR

: rack

m1

AS1

m3 AS3

AS2

m2 c1

m4

c2

c6

Server1 Server2 Server3 Server4 m1=m2+β1c1 m2=β2c2 c6=m1+m3

c3

c4

Server5 Server6 m3=m4+β3c3 m4=β4c4

(a) Locality-Aware Reconstruction

: server : rack

CR c1

c4

c2 AS1

c1 Server1

c3 AS3

AS2

c2

c6

Server2 Server3 Server4 c6=β1c1+Ă+β4c4

c3

c4

Server5

Server6

(b) Star-Structured Reconstruction

Figure 5. Comparison of data flows in (a) Locality-Aware Reconstruction and (b) the conventional star-structured reconstruction mechanism. Each formula represents the computation performed by the participating server in the rack above it during reconstruction.

Figure 5 (a) illustrates LAR. Since Server1 and Server2 locate in the same aggregation domain, Server2 can compute and send β2 c2 to Server1 first. Server1 combines β2 c2 with its block c1 and then sends the intermediate block m1 to Server3 . Similarly, Server6 computes and sends β4 c4 to Server5 , where it will be combined with block c3 and then sent to Server3 . Finally, Server3 gets c6 by adding intermediate blocks m1 and m3 together. Noting that an intermediate block is equal in size to a coded block, LAR reduces core bandwidth usage by 50%. We emphasize that LAR does not consume more bandwidth in lower layers. While the traffic on the path between AS1 and Server1 and the path between AS3 and Server5 is increased by one block, the traffic on the path between AS2 and Server3 is decreased by two. In fact, LAR just balance the traffic in lower layers. As comparison, Figure 5 (b) illustrates the conventional star-structured reconstruction schema. In this schema, the four providers send their blocks directly to Server3 , which collects all the four blocks and completes the computation all by itself. While it is straightforward both in concept and implementation, this reconstruction schema has one great drawback. All the four coded blocks are transmitted through the core layer of the network topology. As aforementioned, the core layer is often the performance bottleneck. The heavy traffic will congest the core layer, not only c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls


10

F XU, ET AL.

constraining the highest recovery throughput that can be achieved but also impairing co-running normal workloads seriously. Previous tree-structured reconstruction mechanisms also transmit blocks through trees. However, they build reconstruction trees according to the available network bandwidths between servers, which are known to be fluctuating and difficult to measure. Besides, these mechanisms cannot reduce core bandwidth usage, compared to the conventional star-structured reconstruction mechanism. 4.3. Measuring the Data Locality We propose to use the length of the path between servers to measure the data locality. We refer this metric as the network distance and define it formally as following. Definition 2 (Network Distance) The network distance between server u and server v is the minimum number of physical links that data will flow through when transmitted between these two servers. The physical links in Definition 2 include all links that connect together servers and switching/routing elements such as hubs, switches and routers. For example, the minimum number of physical links between Server1 and Server3 in Figure 5 is 6. The advantages of the network distance are threefold. First, Definition 2 applies to all kinds of networks topologies, not only the tree-like networks like VL2 [30] but also other networks like Jellyfish [34]. Second, it measures data locality accurately. Finally, the network distance is very stable and can be obtained very easily. These advantages enable us to exploit all levels of locality and make it possible to find a way that works on all network topologies to determine the optimal reconstruction tree. There are many approaches to obtain the network distance with little cost. Location-related IP addresses of servers, routing algorithms of network topologies, man-configured information etc. all offer ways to compute the network distance quickly. Some system such as HDFS [2] even have been integrated with corresponding APIs [35]. Even if all approaches above are unavailable, we still can estimate network distances accurately by network latencies between servers. The latency between two servers is quite proportional to the network distance between them intra a data center, especially when the network is light-loaded. Measuring latencies between servers does take time. Fortunately, we only need to do it at quite long intervals (e.g., once a day when the system is lightest-loaded) and store the results for usage, as locations of servers rarely change. The network distance offers us a very concise view of the complicated network topology. In this study, we denote the network that interconnects all the servers participating in a reconstruction job as an edge-weighted, undirected, complete graph Gr = (V, E, ω), where V = {v0 , v1 , · · · , vr }, E = {(vi , vj )|i, j = 0, 1, · · · , r, i < j}. Here, vertex v0 represents the newcomer, vertices v1 , v2 , · · · , vr represent the r providers, edge (vi , vj ) represents the path between server vi and server vj , and ω(vi , vj ) represents the network distance between server vi and server vj . 4.4. Building the Reconstruction Tree The concise view of the network topology provides us with a general and quick way to build the optimal reconstruction tree. Theorem 1 Given the newcomer and the providers, a minimum spanning tree (MST) Tr of Gr is a reconstruction tree with minimum core bandwidth usage on a tree-like network. Proof All tree-like network topologies such as VL2 [30] and FatTree [36] can be modeled as hierarchy trees just like the one shown in Figure 5 [37]. On such a network, only data transmitted across aggregation domains consumes core bandwidth. As the amount of data transmitted on each edge of the reconstruction tree is uniform and equal to the size of a block, we only need to prove that an MST in Gr contains the minimum number of inter-aggregation-domain edges. Assume the (r + 1) c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls



servers of Gr locate in p aggregation domains, 1 < p ≤ r + 1. We know that at least (p − 1) interaggregation-domain edges are required to make a spanning tree of Gr . Therefore, Theorem 1 follows if we can prove that an MST of Gr contains only (p − 1) inter-aggregation-domain edges. Next, we prove this by giving a proof by contradiction. Assume that there exists an MST of Gr that contains more than (p − 1) inter-aggregation-domain edges and denote it by Tr′ . In this case, there must exist a circle in Tr′ when we consider each aggregation domain as a large virtual server, as shown in Figure 6. Noting that there is no circle in a tree, there must exist a disconnected subgraph that is contained in an aggregation domain in Tr′ . Let T and T ′ be two independent subtrees contained in such a subgraph. As T is not connected with T ′ , Tr′ contains no edge between T and T ′ . However, the Cut Property of MSTs† tells us that an MST contains at least one of the edges crossing T and T ′ , because these edges are cheaper than any other edge that connects T to Tr′ . Hence, Tr′ is not an MST. Theorem 1 follows. Aggregation Domain2

6

6

6

4

Aggregation Domain1 Aggregation Domain3

Figure 6. An example of the case when a spanning tree of Gr contains more than p − 1 inter-aggregationdomain edges, where p is the number of aggregation domains contained in Gr .

Theorem 1 tells us not only what spanning tree is optimal but also how to build it. Given the network distances between servers participating in the reconstruction, we can use any minimum spanning tree algorithm to build the optimal reconstruction tree. 4.5. Other Enhancements 4.5.1. Selecting the Providers For some erasure codes against multiple server failures, there are multiple choices of providers for reconstruction in single failure. The providers should be selected carefully, as these choices may be different in core bandwidth usage. Figure 7 shows such an example. In this example, any four blocks out of c1 , c2 , ..., c5 , which are stored in Server1 , Server2 , Server5 , Server6 and Server7 respectively, are sufficient to reconstruct the unavailable block c6 stored in failed Server4 . The best choice of providers is Server1 , Server2 , Server5 and Server6 , where only two blocks will flow through the core layer. In any other choice, three blocks will flow through the core layer. However, if we build reconstruction trees for all choices and then select the best one from them, it will take too much time because there are too many choices for some erasure codes. For 13 example, there are C10 = 286 choices for (14, 10)-RS code in single failure. We propose a method that can quickly select the optimal providers. The basic idea is selecting servers that locate in a same aggregation domain with each other as many as possible. To this end, we give each candidate provider a weight which is the sum of its network distances to all other candidate providers and the newcomer. By doing this, the weight of a candidate provider that locates together with other candidate providers or the newcomer will be smaller than that of a candidate provider that locates in an aggregation domain alone. And the more candidate providers locate in a same aggregation domain, the weight of the candidate providers in this aggregation domain will be smaller. Therefore, the r candidate providers with minimum weight will be the optimal providers. † For any cut C of a graph, if the weight of an edge in the cut-set of C is strictly smaller than the weights of all other edges of the cut-set of C , then this edge belongs to all MSTs of the graph.

c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls


12

F XU, ET AL.

: server

CR

: rack AS1

c1 Server1

AS3

AS2

c2 Server2 Server3

AS4

c6

c3

c4

c5

Server4

Server5

Server6

Server7

Figure 7. An example of the case when the providers should be selected carefully. Here, any four blocks out of c1 , c2 , ..., c5 , are sufficient to reconstruct the unavailable block c6 and Server3 has been selected as the newcomer. Therefore, any four out of Server1 , Server2 , Server5 , Server6 and Server7 can be used as providers. If the first four servers are selected, only two blocks will flow through the core layer. Otherwise, three blocks will flow through the core layer.

Algorithm 1 shows more details of this method. Take the reconstruction in Figure 7 as an example. The weights of Server1 , Server2 , Server5 , Server6 and Server7 are 22, 22, 22, 22 and 24 respectively. This guarantees that Server1 , Server2 , Server5 and Server6 are selected. Algorithm 1: selectProviders(newcomer, candidates, r, distances[][]) Input : newcomer: the newcomer. candidates: all the servers that can be used as providers. r: the number of providers to be selected. distances[][]: the network distances between involved servers. Output: providers: the selected providers. 1 2 3 4 5 6 7 8 9 10

providers ← φ; selectedN umber ← 0; foreach p in candidates do P weight(p) ← distances[newcomer][p] + q∈candidates,q6=p distances[q][p];

while selectedN umber < r do let p be a minimum weight server in candidates; add p into providers; delete p from candidates; selectedN umber ← selectedN umber + 1;

return providers;

4.5.2. Mitigating the TCP-Incast Problem Besides reducing the core bandwidth usage, reducing the degree of a reconstruction tree also means a lot to the tree-structured reconstruction schema. The degree of a tree is the maximum number of children of any servers in the tree. For example, the degree of the tree shown in Figure 3 is two. A degree of λ means that one of the servers in the reconstruction tree needs to fetch data from λ child servers simultaneously. If the degree is large, there will be a large influx of messages, which leads to the TCP-Incast problem [24] and reduces the overall transmission throughput of TCP substantially. What’s worse, the server having the most child servers will become a bottleneck and lowers the sequential reconstruction rate, which relates directly to the performance of degraded read. Therefore, reducing the degree will make reconstruction more efficient, especially for degraded read. However, common MST algorithms tend to generate trees with relatively larger degrees. This is because they tend to connect many servers to a same one when the costs of edges between these c 2017 John Wiley & Sons, Ltd. Copyright Prepared using cpeauth.cls



servers are equal, which is often the case on tree-like network topologies. This problem can be solved by modifying common MST algorithms slightly. Algorithm 2 shows the algorithm we use to build reconstruction trees for LAR. It is a version of Prim’s MST algorithm. The important point is we intentionally use “≤” in Step 10 of Algorithm 2, even though “

LAR: Locality-Aware Reconstruction for Erasure-Coded Distributed

LAR: Locality-Aware Reconstruction for Erasure-Coded Distributed

Suggest Documents

Pandora LAr TPC Reconstruction - Indico

Distributed Simulation and Middleware for Networked UAS - LAR - UA

Distributed Simulation and Middleware for Networked UAS - LAR - UA

Distributed State Reconstruction for Discrete Event ... - CiteSeerX

Fault Reconstruction Approach for Distributed Coordinated Spacecraft

LAR - Deoria

lar *A

Reliability Analysis of Deduplicated and ErasureCoded Storage

Distributed Bayesian Networks Reconstruction on the

Prediction and Reconstruction of Distributed Dynamic Phenomena ...

Demo: D-ForenRIA: Distributed Reconstruction of User-Interactions for ...

Distributed image reconstruction for very large arrays in radio astronomy

Nosso Lar - Ning

Software technology for implementing reusable ... - LAR-DEIS

www.researchgate.net › publication › fulltext › Lar

www.researchgate.net › publication › fulltext › Lar

www.researchgate.net › publication › fulltext › Lar

Software technology for implementing reusable ... - LAR-DEIS

Gripper System for ... - LAR-DEIS - Unibo

LAR Based Model - CiteSeerX

LAR Functional Poster

www.researchgate.net › publication › fulltext › Lar

Software technology for implementing reusable ... - LAR-DEIS

Software technology for implementing reusable ... - LAR-DEIS