Minimum-Cost Coding for Distributed Storage Systems - DiVA portal

Minimum-Cost Coding for Distributed Storage Systems

MAJID GERAMI

Master’s Degree Project Stockholm, Sweden April 2011

XR-EE-KT 2011:001

Minimum-Cost Coding for Distributed Storage Systems

MAJID GERAMI

Master Thesis at Communication Theory Lab Supervisor: Ming Xiao Examiner: Mikael Skoglund,

TRITA XR-EE-KT 2011:001

iii

In Memory of Japan Earthquake 2011 Victims

iv Abstract In distributed storage systems reliability is achieved through redundant storage nodes distributed in the network. Then a data collector can recover source information even if some nodes fail. To maintain reliability, an autonomous and efficient protocol should be used to reconstruct the failed node. The repair process causes traffic in the network. Recent results in e.g., [1], [2] found the optimal traffic-storage tradeoff, and proposed regenerating codes to achieve the optimality. We investigate the link costs and the impact of network topologies during the repair process. We formulate the minimum cost repair problem in joint and decoupled methods. We investigate the required field size for the joint method. For the decoupled method, we show that the problem is linear for the linear cost. We further show that the cooperation of surviving nodes could efficiently exploit the network topology and reduce the repair cost. The numerical results in tandem, star and grid networks show the benefits of our methods in term of the repair cost.

v

Acknowledgment I am heartily thankful to my supervisor, Assistant Prof. Ming Xiao, whose encouragement, guidance and support from the initial to the final level enabled me to develop an understanding of the subject. I remember the time when I was hesitant of working on the subject of network coding before our first meeting and then I got interested more and more every day till now that I am eagerly following the topic. These all could not be happened without the Ming’s guidance. My deepest thanks to him for kindly and patiently correcting various documents of mine with attention and care. I would also like to thank Prof. Mikael Skoglund for his willingness to serve as an examiner and for allowing me to do this thesis with the Communication Theory group. It is a pleasure to thank those who made this thesis possible such as my wife (Mahsa), my mother in law, my father, and my family who give me the true love and moral support. I would like to take this opportunity to acknowledge Swedish government to offer me such a great chance to study in Sweden, without whose policy and generous I can not imagine I could be here and share the same education resource with the native students. I can not stop giving my thanks to KTH who supplies open and international atmosphere to accept every nation and color who show their personality in the equal stage.

Contents Contents

vi

1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Background and Related Works 2.1 Network Coding . . . . . . . . . . . 2.1.1 Min-Cut Max Flow Theorem 2.2 Distributed Storage Systems (DSS) . 2.2.1 Performance Metrics in DSS 2.2.2 Erasure Codes . . . . . . . . 2.2.3 Regenerating Codes . . . . . 2.3 Other Related Works . . . . . . . . .

1 1 2 2

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

3 3 3 4 6 6 8 10

3 Problem Formulation and Solution 3.1 Surviving Node Cooperation . . . . . . . . . . . . . . . . . 3.1.1 Motivation Example . . . . . . . . . . . . . . . . . 3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Modified Information Flow Graph . . . . . . . . . 3.3 Joint Method . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Decoupled Method: . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Constraint Region . . . . . . . . . . . . . . . . . . 3.4.2 Convexity of Optimization Problem . . . . . . . . 3.4.3 LP Formulation . . . . . . . . . . . . . . . . . . . . 3.4.4 Augmented Form of Linear Optimization . . . . . 3.4.5 LP Formulation and Solution for the Toy Example

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

13 13 13 15 16 18 20 20 21 21 22 22

4 Numerical Results 4.1 Tandem Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Star Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Grid Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27 27 28 29

vi

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

CONTENTS

vii

5 Conclusion and Future Works

33

Bibliography

35

6 Appendices 6.1 Appendix 6.2 Appendix 6.3 Appendix 6.4 Appendix 6.5 Appendix

A B C D E

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

39 39 41 42 44 46

Chapter 1

Introduction Increasing applications (e.g., video, VoIP, social networks) have caused the exponential growth of data and traffic [3] in Internet during the last decades. The requirement of QoS (quality of service) in e.g., fast and ubiquitous access and reliability poses high challenge in information storage. Furthermore, the limitation on read/write speed and network bandwidth renders the centralized storage system insufficient [4]. Meanwhile, the development of fast computing devices [3], [5] and pervasive network structures makes it possible to distribute the storage system/node among the network. The benefits of distributed storage systems include fast and ubiquitously access of stored data, high reliability, availability and scalability. These properties are especially suitable for Internet applications (e.g., Google Bigtable [6]), peer-to-peer networks, wireless sensor networks, cloud and mobile computing [4] etc. Reliability is usually obtained through redundant nodes in a distributed storage system, in which error control (EC) codes are normally used to increase storage efficiency. An EC code with the MDS (maximum distance separable) property is optimal in term of the redundancy-storage tradeoff. Yet, MDS codes may not be optimal considering the traffic for reconstructing failed nodes. References [1], [2] investigate the traffic (repair bandwidth) and propose the optimal storage-bandwidth tradeoff. A new class of erasure codes namely regenerating codes based on network coding ( [8], [9]) are proposed in [1], [2] to achieve the optimal tradeoff. To further decrease the repair traffic, multiple simultaneously-failed nodes [11] cooperate in the repair process. The exact reconstruction of failed nodes is studied in [12].

1.1

Motivation

Though references [1], [2], [11], [12] have well addressed the optimal storage-bandwidth tradeoff in distributed systems from different aspects, the link cost (transmission cost in a channel) and the impact of the network topology have not been considered. In a practical system, the cost is an important design consideration and different links (channels) of a network may have different cost. The topology of networks may 1

2

CHAPTER 1. INTRODUCTION

impact the repair schemes too. Recently, reference [10] considers the cost for the traffic from surviving nodes to the new node and then the cost-bandwidth tradeoff is derived. Yet, reference [10] does not consider the impact of the network topology on the cost, and the storage-cost tradeoff has not been studied either. Here we will study the link cost in the repair process. The impact of the network topology on repair will also be considered. Further, we will propose a surviving node cooperation (SNC) scheme to reduce the repair cost. For a general setting, we shall formulate the minimum-cost problem and study the solution for them. A related work is also found in [13], in which the minimum-cost multi-cast is studied in networks with network coding.

1.2

Thesis Contribution

Our contributions in this thesis are summarized as follows: • We bring the dimension of link cost into the repair problem. Links weight differently based on the physical or higher layer transmission cost so that repair process takes the cost into account. Consequently the repair coststorage tradeoff is derived. • We propose surviving node cooperation (SNC) in distributed storage systems built on multi-hop structures to reduce repair cost. Based on this idea, intermediate nodes are allowed to combine the received messages with their own stored data. • We introduce modified information flow graph and then formulate optimization problems in joint and decoupled ways. The joint-method refers to determining the optimum coding and subgraph (how much flow to put on each link [13]) jointly. However in the decoupled-method, these two problems are separated. Furthermore, the decoupled-method presents the constrains on the subgraph vector by cut analysis on the network which finally leads to a linear programming problem for linear cost.

1.3

Outline

The thesis is organized as follows. Chapter2 reviews fundamental results and theories in network coding and distributed storage systems. In Chapter 3, we state the problem and formulate optimization problems for finding the optimal-cost repair in joint and decoupled methods. Chapter 4 shows the advantages of the optimization in tandem, star, and grid networks. Finally, in Chapter 5, we conclude the thesis and suggest the road for future works.

Chapter 2

Background and Related Works In this chapter, we provide a brief discussion of the essential background needed throughout the thesis. This includes illustration of network coding, distributed storage systems and an application of network coding in distributed storage systems.

2.1

Network Coding

Network coding allows combining of the received data in intermediate nodes. Thus it can be seen as a generalized form of the conventional method of store and forward method. This elegant idea first time was introduced in satellite communication by Yeung and Zhang in 1999 ( [7]) and later has been completed by Ahlswede et. al in [8]. There, the authors noted the mixing of data in intermediate nodes as network coding and show that the maximum rate in theory of max-flow min-cut is achievable in multi-cast networks by using network coding.

2.1.1

Min-Cut Max Flow Theorem

Reference [8] showed that by using network coding max-flow min-cut bound is achievable in multi-cast networks-i.e the maximum rate is equal to the minimum cut of the network. A cut in a network is a subset of edges which divides nodes on ¯ in which one part contains source node (s) and the network to two parts of U and U the other part holds destination (t) so that every path from source to destination (t) use at-least one of the edges in the cut subset. The value of a cut is the sum of edges’ capacities from the source to the destination as shown in Figure 2.1. The minimum-cut is a cut with the minimum cut value. Later, reference [16] showed that linear network coding is sufficient to achieve multi-cast capacity. Subsequently, Koetter and Medard in [9] introduced an algebraic platform for network coding analysis. In continue, references [17], [18] brought random linear network coding into the subject and showed that a random selection of coding coefficients from a large finite field guarantees the multi-cast capacity achievement with a high probability. 3

4

CHAPTER 2. BACKGROUND AND RELATED WORKS

Figure 2.1. Cut analysis description.

Network coding can improve throughput, enhance networks’ power and bandwidth efficiency, lower complexity and boost network robustness [19]. Here an example of a network known as a butterfly network may clarify the advantage of network coding in increasing throughput of a multi-cast network [8]. Assume a source s is willing to transmit bits of information to destinations t1 and t2 where each edge has the capacity of one bit per time unit. As it is shown in the Figure 2.2 by using xor coding in the intermediate node each destination can receive 2 bits per time unit which is equal to the multi-cast capacity (min-cut). However, this rate can not be achieved by a routing method. Another advantage of network coding is in power efficiency especially in wireless systems. Network coding exploits the broadcast nature of wireless systems which can result in a better power efficiency. Consider the network shown in Figure 2.3 where nodes are using omnidirectional antennas. In this example, transmitting one message to destinations {t1 , t2 } costs 5 transmissions while by applying network coding it needs totally 9 transmissions for receiving two messages by destinations {t1 , t2 }. This benefits half a transmission less cost. Applying network coding on nodes simplifies the minimum-cost multi-cast problem. Finding the minimum-cost routing in multi-cast networks without using network coding leads to solving of a Steiner tree problem which is known to be NPhard [13], [19]. However, if network coding is used, it can be solved by a linear programming with polynomial-time complexity [13].

2.2

Distributed Storage Systems (DSS)

Demand for storage is increasing due to widespread application of email, photos and videos [5], [22]. In this situation saving information in a large disk is expensive and unreliable-i.e., if the disk fails all the data is lost. Besides, centralized storage system

2.2. DISTRIBUTED STORAGE SYSTEMS (DSS)

5

Figure 2.2. Network coding achieves multi-cast capacity. Figure (a) shows capacity of each edge of the network.(adapted from [8])

Figure 2.3. Network coding can reduce number of transmissions in wireless network. In Figure (a) 5 transmissions are needed for sending a message to destinations t1 , t2 whereas by network coding it is possible to transmit 2 messages by 9 transmissions. (adapted from [20])

leads to a large latency for read/write of data. Instead, if the data is distributed with redundant information among some (possibly cheap and unreliable) nodes, then if a node fails we still have enough information to regenerate the failed node or reconstruct the the original file. Therefore, distributed storage systems bring

6


about lower cost, faster access and higher reliability in comparison to centralized one. This is the key success of google Bigtable by distributing information in the Internet [6].

2.2.1

Performance Metrics in DSS

Conventionally there is the following metrics for evaluating performance of distributed storage systems: Reliability : Tolerance to storage node failure. Availability : Promptly access to the original file. Scalability : Flexibility to network dynamic; replacement or adding new nodes is easy. Storage efficiency : The amount of redundant information stored in the system. Rebuild time : The time is needed for reconstructing the failed node (regenerating new node). Encoding/Decoding complexity : The computational power which is needed for regenerating a new node or reconstructing the original file. Read/write latency : The delay in a read/write from/to a storage system. As a node fails, to keep reliability of the system the failed node should be reconstructed. Dimakis et. al. in [1] introduced a new criterion termed as repair bandwidth and then find the optimal repair bandwidth given the amount of storage per node. In addition, we define repair-cost as a new performance metric and try to minimize it during the repair process.

2.2.2

Erasure Codes

Reliability is usually obtained through redundant nodes in distributed storage systems. The simplest way of providing redundancy might be by repetition. In repetition, the source file of size M bits is divided to k parts and then nodes, containing M/k bits, may be repeated several times. This method is implemented in RAID1 -1 as shown in Figure 2.5. If l and p denote the number of repetition and the probability of individual node failure, respectively, then the probability of losing the source file is kpl . Another approach of using redundant nodes is by using erasure code. MDS2 codes distribute the original file of size M bits to n nodes in such a way that each 1 2

Redundant Array of Independent Disks Maximum Distance Separable


7

node stores M/k bits and any k nodes can rebuild the original file. By this way, the probability of losing the original file is: n X r=n−k+1

!

n r p (1 − p)n−r r

(2.1)

which is significantly lower than repetition code by the fixed amount of storage. MDS-codes are optimum in the redundancy-reliability tradeoff [1]. This code can be constructed by Reed-Solomon codes. RAID-5 has used erasure code for providing redundancy as shown in Figure 2.6.

Figure 2.4. The first distributed storage system is implemented as RAID-0.(adapted from [22])

Figure 2.5. Repetition is used in RAID-1 for providing reliability.(adapted from [22])

Although Reed-Solomon codes could achieve redundancy-reliability tradeoff, they are complicated in encoding and decoding. Over the last years, many studies have been done in creating simpler erasure codes [1], [3]. Reference [3] proposes parity array codes using the xor function which is termed as X-Erasure codes. LDPC codes is another code with lower encoding/decoding complexity. Another complication in MDS-codes arises when a node fails. An MDS-code needs to download all the original file for regenerating a failed node which has M/k bits. In other words, downloading M bits for constructing M/k bits looks inefficient from the amount of traffic posing on the network. In literature, bandwidth traffic for repair a failed node is termed as repair bandwidth. For couple of years, it was supposed this extra traffic for the repair in the MDS-code is the price of reliability.

8


Figure 2.6. MDS-code is used in RAID-5 for providing reliability.(adapted from [22])

However, Dimakis et. al. in their elegant work in [1] show this repair bandwidth can be reduced. The authors introduce a new class of erasure codes called regenerating code which are optimum from bandwidth traffic needed to be downloaded for the repair of a failed node.

2.2.3

Regenerating Codes

Regenerating code is a new class of erasure codes which is optimum in term of repair bandwidth traffic. It preserves the MDS-code property on which any k of n nodes can rebuild the original file. Dimakis et. al. in [1] proved the fundamental tradeoff between storage and repair bandwidth traffic. Through cut-based analysis in information flow graph they proved the existence of linear network coding for achieving the optimal trade-off but they have not discussed about the construction of the code. Later, Wu in [2] proposed a code construction by a method called pathweaving. In this section, we describe an information flow graph for a distributed storage system and then explain the fundamental ’storage-bandwidth tradeoff’. Information Flow Graph An information flow graph models a distributed storage system to a directed acyclic graph. As illustrated in [1], this graph consists of three kind of nodes: (1) a source node, denoted as (s) in Figure 3.2., contains the original file. (2) storage nodes which models by two nodes as input and output nodes (xiin ,xiout ) which are connected by a link with capacity equal to α (the amount of storage per each node). (3) Data Collector (DC) which can rebuild the original file by connecting to storage nodes assumably by links of infinite capacity. The information flow graph can evolve in time. At initial time, source (s) is active which distributes encoded information to storage nodes through links of infinite capacity. Then (s) gets inactive and storage nodes become active. Whenever a new node j joins the system, it connects to d active storage nodes and downloads β bits from each node by a link of capacity equal to β. The information flow graph related to (4,2) erasure code and d = 3 is depicted in Figure 2.7. A cut between source (s) and DC is a subset (C) of edges on which each path from s to DC has at-least has an edge in C. The minimum cut is the cut between


9

Cut

node 1 α X1 in

∞

X1 out

∞

node 2 X2 in

∞

α

β

X2 out

DC S

node 3

∞

X3 in

α

β X3 out

∞

β

∞

node 4 X4 in

α

X4 out

X5 in

α X5 out

node 5 Figure 2.7. Information flow graph for (4,2)-Erasure code

source (s) and DC on which the summation of the edge capacities is minimum. Min-cut analysis refers to the Max Flow-Min Cut theorem in multi-cast networks which states that the maximum achievable rate between a source and destinations equals to the minimum-cut in the network [8]. Satisfying min-cut constraints on the information flow graph guarantees the existence of linear network coding solution for the multi-cast problem [17]. Storage-Bandwidth Tradeoff Using notation in [1], in a distributed storage system with parameters (n, k, d, α, β) n denotes number of nodes, k is the least number of nodes for recovering the original file and each node stores α bits. When a node fails, the new node downloads β bits from d surviving nodes. For this set of parameters there is a family of information flow graphs regrading each node’s failure and repair [1]. By cut-based analysis through this graph the fundamental storage-bandwidth tradeoff is derived. This is re-stated here from the theorem 1 in [1]. Theorem 1 [1]: For any α ≥ α∗ (d, β), the points (n, k, d, α, β) are feasible, and linear network codes suffice to achieve them. It is information theoretically impossible to achieve with α ≤ α∗ (d, β). The threshold function with the source file of size M bits and γ = dβ is the following: ( ∗

α (d, β) = where f (i) ,

M k , M −g(i)γ k−i ,

γ[f (0), +∞) γ[f (i), f (i − 1))

2M d (2k − i − 1)i + 2k(d − k + 1)

and

(2d − 2k + i + 1)i 2d The optimal tradeoff curve for n=15, k=10 is depicted in Figure 2.8. g(i) ,

(2.2)

10


Figure 2.8. The optimal storage-bandwidth tradeoff for n=15, k=10, and d=14,12,10. ( [1])

2.3

Other Related Works

Many research studies in the last decades have justified the gain of cooperative communication e.g., in the efficient usage of resources like energy and bandwidth. The idea of fail node cooperation has been introduced in [11] so as to repair several nodes mutually where the number of simultaneous failed node is equal or greater than 2. In our approach surviving nodes cooperate in order to minimize the repairing cost. Therefore even with one node failure, it’s yet applicable. The exact reconstruction of a failed node has been studied ( [14], [15], [12]) recently which has benefits like simplifying recovery of the original file. However, hereafter in this thesis from repairing we aim functional repairing. Minimizing cost with the exact repairing can be followed in the future. In a functional repairing the reconstructed node is not exactly the same as before but the new node beside other nodes preserve the code’s property-i.e, any k of n nodes suffice for rebuilding the source file.

2.3. OTHER RELATED WORKS

11

Distributed storage systems is yet an interesting research topic. In a recent work in [23], the optimum storage allocation using a probabilistic optimization platform has been studied. In the other work, reference [21] by assuming that new node can flexibly download different size of data from surviving nodes introduces a Flexible Regenerating codes. Similarly, our study shows that to have an optimum repair cost the amount of data on the links might be different due to the link cost and network topology.

Chapter 3

Problem Formulation and Solution In this chapter, we study the optimal-cost repair problem in distributed storage systems. For this task, we define a new performance metric in distributed storage systems which is termed as repair-cost and then propose surviving node cooperation (SNC) to reduce the repair-cost. A toy example in section 3.1.1 illustrate how SNC works. In continue, we formulate optimization problems by joint and decoupled methods in the presence of SNC. By introducing Modified Information Flow Graph we model the optimal-cost repair problem to a minimum-cost multi-cast problem which can be solved by a linear programming [13].

3.1

Surviving Node Cooperation

In the SNC an intermediate node combines the received fragments with own stored fragments. For the analysis of system when network coding is applied, we use linear algebra methods illustrated by Koetter and Medard [9]. Coefficients in linear network coding can be chosen either determinately based on Jaggi et al. [18] or randomly from a large finite filed size [2], [26]. Generally, deterministic way requires larger finite field than random coding. In the SNC, all the available links are likely to adjoint in the repair process. However, to minimize cost, the use of a link and the amount of traffic on that link is determined by an optimization algorithm. In analysis, we form a vector on which its arrays represent the number of fragments on the corresponding links. We denote this vector as a subgraph [13].

3.1.1

Motivation Example

We first give an example to illustrate the idea. Consider a distributed storage system in a four-node tandem network shown in Figure 3.1. A file of 4 mega-bits (M = 4 mega-bits) is coded with a regenerating code [1] and distributed among 4 nodes (n = 4). Each node stores 2 one-mega-bit fragments (α = 2) and the source file can be recovered by any 2 nodes (k = 2). When a node fails (say node 4), a 13

14

CHAPTER 3. PROBLEM FORMULATION AND SOLUTION

Figure 3.1. Distributed storage in a tandem network. Dashed lines denote how network codewords are formed. Node 4 fails and node 5 is the new node. Cut

node 1 α X1 in

∞

X1 out

∞

node 2

∞

X2 in

α

β

X2 out

DC S

node 3

∞

X3 in

α

β X3 out

∞

β

∞

node 4 X4 in

α

X4 out

X5 in

α X5 out

node 5 Figure 3.2. Cut analysis. S denotes the source file.

new node downloads β fragments from each of 3 surviving nodes (d = 3). Here we follow the notation of the regenerating codes [1], [2], which are specified by a vector (n, k, α, d, β) denoting the number of storing nodes, the minimum number of nodes to recover the source file, the storage amount of each node, the number of surviving nodes used to repair a failed node and the amount downloaded from a surviving node, respectively. Then the distributed storage system can be represented with a directed acyclic graph known as information flow graph [1]. In this representation, as shown in Figure 3.2, there is a source file (is denoted as S) which connected to storage nodes through infinite capacity links. Each storage node is depicted by input (in) and output (out) nodes relating together by a link of storage node capacity (α). Also when a node fails, new node downloads β amount of data from existing nodes. DC as a data collector recovers the source file from storage nodes

3.2. PROBLEM STATEMENT

15

through (assumably) infinite capacity links. By cut analysis of the information flow graph in Figure 3.2, we can see that for a data collector (DC) to recover the source file, it requires α + 2β ≥ M . Since α = 2, M = 4, β ≥ 1, the new node must download at least β = 1 mega-bits from each surviving node. The optimal-traffic repair to achieve this lower bound is to download p1 , p2 , p3 as in Figure 3.1 from node 1, node 2 and node 3, respectively. Here p1 , p2 and p3 are formed by coding at nodes 1, 2 and 3, respectively. Yet, the analysis is different if we consider the link cost. We assume that each fragment of 1 mega-bits transmitted in a single channel costs one transmission unit. To reach the new node (node 5), we can easily see that p1 passes the route (node1 7→ node2 7→ node3 7→ node 5) with a cost of 3 units. Accordingly, p2 passes node 2 7→ node3 7→ node 5 with a cost of 2 units and p3 passes node3 7→ node 5 with a cost of 1 unit. Thus, the total cost in repair is 6 units. However, consider the repair scheme in Figure 3.3, where we allow surviving node cooperation (SNC). Here the cooperation means that one surviving node can combine/encode the codeword of another node. That is, surviving nodes are allowed to linearly combine its own fragment and the fragments of other nodes. For instance, at node 2, p2 and p3 are encoded from the received p1 of node 1 and stored fragments in node 2. At node 3, p4 and p5 are obtained by encoding p2 and p3 with the fragments of node 3. Note that, we only consider functional repair, in which the reconstructed node may not be identical to the failed node but it has the same code property. That is, with the new node 5, any 2 out of 4 nodes can rebuild the source. It is easy to see that the cost of repair is reduced to 5 units as shown in Figure 3.3 (only two packets transmitted from node 3 to node 5). Here we further note that SNC applies to the scenario of one or multiple node failure, but the failed-node cooperation in [11] should have multiple node failure.

3.2

Problem Statement

Above we consider the repair cost and propose SNC to reduce the cost of a specific network. Naturally, we may ask what is the optimal cost and how to achieve the optimality for a more general scenario. To study the problem of repair cost optimization for general networks, we define a cost matrix C on which each array represents the cost of direct transmission between two nodes. We denote that each element of the cost matrix just indicates the direct link cost between two nodes. Obviously, in a multi-hop structure where intermediate storage node relays other node fragments, from this matrix one can extract available pathes between any pair of nodes either directly or indirectly with possibly varied costs. Thus, optimization algorithm by using C has enough information about network topology, intermediate nodes and link costs. It may worth to note that in this work we only consider linear cost. The linearity of cost means if transmission of a fragment from node i to j, costs cij , then it costs m · cij for transmitting m fragments from node i to j.

16


Figure 3.3. Reconstruction by surviving node cooperation in a tandem network.

   C=   

0 c21 .. . cn1



c12 . . . c1n 0 . . . c2n   .. . . ..  . .  .  ... ... 0

where    Transmission cost between nodes i and j

cij =

 

0

if there is a direct connection i →j; if there is not a direct connection i 9j. or i=j

Considering network structure in the repair process of a failed node with a given cost matrix (C), we can express information flow in the repair process by a directed and acyclic graph which is termed as modified information flow graph.

3.2.1

Modified Information Flow Graph

Similar to the information flow graph in [1], every node is shown with (in) and (out) nodes connecting by a link of capacity equals to amount of nodes’ storage. Also, nodes are coupled to the source through infinite capacity links and DC can recover the original file by connecting to storage nodes. The only difference is in the repair process where new node gets the data through available links based on network topology, and the amount of this data can be changed. For instance, the modified information flow graph for a four node distributed storage system in tandem network regarding failed node 4 is depicted in Figure 3.4.

3.2. PROBLEM STATEMENT

17

node 1

α ∞

z(12)

∞

DC

node 2

α ∞

S

z(23)

∞ node 3

∞

α

∞

z(34) α node 4

Figure 3.4.

α

new node

Modified information flow graph

.

The graph is denoted by G = (N, A), where N is the set of nodes and A is the set of directed links. We use a vector to denote amount of fragments transmitted on the links. The vector is called as a subgraph [13]. For a subgraph (Z = [z(ij) ], where link (ij) ∈ A is from node i to node j), the repair cost is

σc ,

X

cij z(ij) .

(3.1)

(ij)∈A

Our design objective is to minimize σc during the repair process. To find the minimum-cost repair, we consider two different methods: (1) For a given distributed storage system, we find the minimum-cost regenerating codes defined in [1], [2]. In this way, we find jointly the subgraph and corresponding coding coefficients. We call this approach joint method. (2) Decoupling the minimum-cost problem into two steps: We first find the optimum-cost subgraph (without considering coding constraints) and then find network codes for the subgraph [13]. Therefore, we use modified information flow graph to achieve the optimum-cost solution. In decoupled method code construction in decoupled method We call this approach as a decoupled method. The motivation behind using joint method is complexity of the code construction in the decoupled method. In the joint-method, we initialize the code on nodes by applying regenerating code using the path-weaving approach in [2] which is independent of network topology and link costs. However in the decoupled method, the optimal-cost code construction even from the initialization step depends on the network topology and link costs. In what follows, we shall discuss them separately.

18

3.3


Joint Method

We consider a source file of size M fragments coded by (n, k, α, d, β) regenerating codes [1], [2]. The coding block of node i is denoted by a matrix Qi = [xi1 , · · · , xiα ] of size M × α where each column vector xil on the matrix denotes a fragment of node i. When a node fails (without loss of generality, we assume node 1 fails), the system seeks to find the minimum-cost regenerating codes to reconstruct node 0 1 as Q1 from d = n − 1 surviving nodes. The resulting codes should satisfy the (n, k, α, d, β) regenerating code property (RCP). Here the RCP refers to a property that in a distributed storage system coded by (n, k, α, d, β) regenerating codes, the source file can be recovered by any k out of n nodes. If a code satisfies the RCP, the corresponding subgraph is among the feasible set. The optimized solution is the subgraph in the feasible set with the minimum cost; therefore optimization problem can be expressed as: minimize σc 0 subject to: {Q1 , · · · , Qn } satisfies RCP. zij ≥0

(3.2)

We can verify if a code satisfies the RCP by the regenerating code rank test [2]: Definition 1 Regenerating code rank test [2]. Let vector c as a length-n vector c , [c1 , · · · , ck , 0, · · · , 0] where ci , min{(d − i + 1)β, α}. Also, assume a test vector h = (h1 , . . ., hn ), where each array has integer value in interval [0, α], and the vector c majorizes h (denoted as c < h)-vector P P a majorize b when, after ascend sorting of the arrays of a, b, ni ai = ni bi and Pk Pk 0 i ai ≥ i bi for all k ∈ {1, · · · , n}. A code with coded blocks Q1 , Q2 , . . . , Qn passes the rank test with a test vector h = (h1 , . . ., hn ) if [Q1 Eh1 , . . . , Qn Ehn ] has a full rank M , where Qi Ehi is the first hi columns of Qi . 0 To preserve the RCP it is required to det([Q1 Eh1 , . . . , Qn Ehn ]) 6= 0 for every n |c < h}. Thus, the optimization problem can be formulated as h ∈ H , {h ∈ Z+ minimize subject to

Q

σc det([Q1 Eh1 , . . . , Qn Ehn ]) 6= 0. z(ij) ≥0 0

h∈H

(3.3)

Now we discuss the existence of the solution for the problem (3.3). First we recall the approach in [1], [2] where in the repair process, a new node downloads β fragments from each surviving node (using e.g. random network coding as regenerating codes). With the received γ = dβ codewords, a new node is reconstructed. We denote this approach as the non-optimized method since the link costs are not optimized. Note that the optimality here means link costs rather than the storagetraffic tradeoff [1], [2]. In the joint-optimization method, SNC may be used to reduce the cost. Then some surviving nodes may encode the fragments of other surviving nodes. For instance, consider intermediate node i where it receives Y 1 (j, i) coded fragments from nodes j. Hence, Y r (i, p), the r−th transmitted fragment (note that

3.3. JOINT METHOD

19

in SNC, multiple fragments might be sent over a link) from node i to next node p, is calculated as: Y r (i, p) =

α X

i ζl,r xil + ρi Y 1 (j, i),

(3.4)

l=1 i , ρ are selected randomly from a finite field F. Then, new node where coefficients ζl,r i is reconstructed by network coding of its received fragments. The optimization process finds the regenerating codes incorporated SNC with the minimum cost in the feasible set (the RCP is preserved). In either non-optimized or joint-optimization method, code coefficients are chosen from a sufficiently large finite field to guarantee the existence of a solution. We generally are interested to smaller finite field since it poses less coding and decoding complexity. One can refer to Schwartz-Zippel theorem to find requirement of the field size |F| in the case of using random network coding. Lemma 1: (Schwartz-Zippel theorem [26]) Consider polynomial g(α1 , α2 , ..., αn ) where variables α1 , ..., αn are selected randomly and uniformly from finite field F. If the maximum degree of polynomial g is d0 and g is a non-zero polynomial then

P r[g(α1 , α2 , ..., αn ) = 0] ≤

d0 |F|

(3.5)

For non-optimized method, reference [2] illustrates the required field size |F| using lemma 1. Lemma 2: In the non-optimized approach of repair a failed node (let say Q1 ) of distributed storage system coded by (n, k, α, d, β) regenerating code, if the field nα size |F| ≥ d1 , max{ M , 2M |H|}, then there exists a linear code in the finite field F by which the failed node is reconstructed and the RCP is preserved. Proof : See proof of theorem 3 in [2].¥ For finding the field size in joint-method, we note that the difference between joint and non-optimized methods arise from possibility of using SNC in joint method. 0 SNC affects maximum degree of Q1 since in each intermediate nodes a random network is performed on the received fragments. So in joint-method network structure and number of intermediate nodes on that structure would affect the maximum degree of new node. Lemma 3: In joint-optimization repair process of a failed node of distributed 0 storage system coded by (n, k, α, d, β) the maximum degree of the new node (Q1 ) is n. Proof : Given n nodes in the network the maximum number of intermediate nodes is in tandem network as it constitutes n − 2 intermediate nodes. Maximum degree is when in tandem network, the last node (node n) is joined in repair process 0 of the first node (Q1 ) then by (3.4) , the new node fragment is calculated as: n n n x1 + · · · + ζ1,α xnα )ρ1 ρ2 · · · ρn−1 + · · · x1l = (ζ1,l

where xn1 is first fragment of node n. Clearly we see that maximum degree equals to n as n variables are multiplying to the fragment xnl .¥

20


Proposition 1: For joint-method optimization if the finite field size |F| ≥ d2 , then there exists a linear network coding solution in which the failed node is reconstructed and the RCP is preserved, where !

nα d2 , max{ , nM |H|}. M

(3.6)

Proof : We follow the same way of [2] for initial construction of (n, k, α, d, β) regenerating code. So we use generator matrix G of size M × α in F for generating (nα, M )-MDS code. From lemma 1, it is required that |F| > nα M . Now when a node fails the difference between non-optimized approach and joint-method optimization is in possibility of using SNC. In joint-method, from lemma 3, the maxQ 0 imum degree of failed node is n. So to have h∈H det([Q1 Eh1 , . . . , Qn Ehn ]) 6= 0, by lemma 1, the field size have to be greater than nM |H|. Consequently, |F| > d2 = max{ nα M , nM |H|}. ¥ Thus, to solve the problem in a joint method, we just find the regenerating codes (SNC is allowed) in a sufficiently large field. The optimized one is obtained by searching among the codes in the feasible sets. However, we note that the complexity of finding all feasible codes and their cost may be high. Since in repair process at worst case there is γ = dβ fragments on a link so the complexity of search is O(γ |A| ) where |A| is number of available links. To reduce complexity, we can limit the search among the subgraph with the cost less than non-optimized approach. Yet the complexity is high

3.4

Decoupled Method:

In the decoupled method, we first minimize σc in a subgraph with the constraint that any k of n nodes can rebuild the original file. Then we find the network code for the subgraph. The constraint region of a subgraph is analyzed as follows.

3.4.1

Constraint Region

We use Min-Cut Max-Flow theorem [8] on the graph corresponding the distributed storage system. So providing that minimum cut between source and Data Collector (DC) is greater than original file size then there is a linear network coding which original file can be recovered by data collector (DC). Therefore, by connecting DC to new node and k − 1 other nodes, we find the minimum cut of the graph. Running Min-Cut analysis for each selection of k − 1 of n − 1 nodes gives us n−1 k−1 linear constraints. These linear inequalities present a region in multi-dimensional space which subgraph vector must satisfy in order to be among the feasible solutions. The feasible solution region, generally called polytope in optimization literature [27]. To denote the polytope in a matrix form, we present the subgraph as a column vector n−1 Z = [z(ij) ], edge (ij) ∈ A and use a k−1 × |A| matrix L as a coefficient matrix [27] (each row is for one inequality denoting the requirement of the min-cut). Then we

3.4. DECOUPLED METHOD:

21

put all constants to the right side of inequality which constitutes a column vector n−1 b with k−1 rows. Thus, we can express n−1 k−1 min-cut constraints by the matrix LZ ≥ b. Consequently the polytope is S = {Z = [z(ij) ] | z(ij) ≥ 0, LZ ≥ b}.

(3.7)

Here the comparison of two vectors e.g, B ≥ D means every element in B is greater or equal to the element in D with the same position.

3.4.2

Convexity of Optimization Problem P

Since target function (fz = (i,j)A cij z(ij) ) is a linear function, if zij could take rational number then the constraint region would be be convex region. Note that the file transmitted in the link are quite large, we can consider z(ij) real valued since one fragment has lots of bits. Assuming reality of z(ij) Polytope S constitutes a convex region : S = {Z = [z(ij) ] | z(ij) ≥ 0, z(ij) R, LZ ≥ b}

(3.8)

Where R represents set of real numbers. Lemma 4: Polytope S with assumption of z(ij) R constitutes a convex region. Proof : By assumption of zij R, now we shall prove whenever Z 1 , Z 2 S and 0 ≤ ρ≤1 then ρZ 1 + (1 − ρ)Z 2 S Z 1 S ⇒ LZ ≥ b Z 2 S ⇒ LZ ≥ b ⇒ L(ρZ 1 + (1 − ρ)Z 2 )

= ρLZ 1 + (1 − ρ)LZ 2 ≥ ρb + (1 − ρ)b ≥b

⇒ ρZ 1 + (1 − ρ)Z 2 S.¥ Hence, problem of finding minimum repair cost is a convex optimization problem. Convexity eases the optimization problem in two ways. Firstly, local optimum point is globally optimum. Secondly, optimum points are located on the vertex of constraint region. So the search for finding the optimum point is limited to vertex points as simplex algorithm runs in this way.

3.4.3

LP Formulation

Since target function and constraint inequalities are linear function we can model the problem as a linear programming problem as: minimize fz = subject to:

P

(i,j)A cij

LZ ≥ b zij ≥0

· z(ij) (3.9)

22


Proposition 2: In problem (3.9), subgraph z is in feasible solution subset if and only if there is a network code on which data collector by connecting any k out of n nodes can recover the original file. Proof : First, if subgraph z is among feasible solution then from [8], [9] there is network code because min-cuts are greater than M . Inversely, if there is a network code on which the any data collector can recover the original file then again by [8], [9] all the min-cuts are greater than M . Consequently corresponding subgraph is in feasible solution subset.¥ By proposition 2, we can always find a linear coding scheme (possibly with SNC) for the repair with the optimal subgraph.

3.4.4

Augmented Form of Linear Optimization

For solving the problem by simplex method, the problem should be presented in augmented form (Converting inequalities to equalities). It can be done by adding some non-negative temporary variable known as slack variables to Z matrix. If we name Z with added slack variables as Z + the augmented form of optimization problem can be: P minimize fz = (i,j)A cij .z(ij) (3.10) subject to: L+ Z + = b zij ≥0

3.4.5

LP Formulation and Solution for the Toy Example

Here we formulate the optimization problem in distributed storage system in our first example (Figure 3.1) with the given cost matrix as: 

0 0  C= 0 0

1 0 0 0

0 1 0 0



0 0   1 0

(3.11)

Step1: Min Cut Analysis Min-Cut analysis when DC connects to nodes 1 and new node 4 gives the following constraint equation (see Figure 3.5): z(34) + α ≥ M = 4 ⇒ z(34) ≥2

(3.12)

Connecting DC to node 3 and new 4 (Figure 3.6) gives the second constraint: z(23) + α ≥ M = 4 ⇒ z(23) ≥2 and final constraint as shown in Figure 3.7 is:

(3.13)


23

node 1

α

z(12)

∞

∞

DC

node 2

α ∞

S

z(23)

∞ node 3

∞

α

∞

z(34) α α

node 4

Figure 3.5.

new node

Min-Cut when DC connects to node 1 and new node

.

Min-Cut node 1

α

∞

z1 2 node 2

∞

DC α

∞

S z2 3

∞ node 3

∞ α

∞

z3 4 α node 4

Figure 3.6.

α

new node


z(12) + z(34) ≥ 2.

.

(3.14)

24


node 1

cut

α

∞

z12 ∞ node 2

∞

DC

α

S z23 ∞ node 3

∞ α

∞

z34 α α

node 4

Figure 3.7.

new node


.

Step2: Linear Optimization Formulation Consequently, we have linear optimization problem for finding optimum cost repair on node 4, as: P minimize fz = (i,j)A cij .z(ij)    z(34) ≥ 2 (3.15) subject to: z(23) ≥ 2    z (12) + z(34) ≥ 2 Step3: LP in Augmented Form 0

00

000

By using (z , z , z ) as slack variables, the augmented form is obtained: minimize

subject to:

fz =

            

P

(i,j)A cij .zij 0

z(34) − z = 2 00 z(23) − z = 2 000 z(12) + z(34) − z = 2 0

00

z(ij) , z , z , z

000

(3.16)

≥0

Step4: Solving the LP problem Solving the linear optimization problem (e.g., by a simplex method [27]) gives the optimal subgraph (z(12) , z(23) , z(34) ) = (0, 2, 2) with a cost of 4 units. A network coding scheme with SNC in Figure 3.8 can meet the subgraph as p1 = 0, p2 = 2a2 + b2 , p3 = a2 + 2b2 , p4 = p2 + (a1 + b1 + a2 + b2 ) = a1 + b1 + 3a2 + 2b2 and


25

p5 = p3 + (a1 + 2b1 + a2 + 2b2 ) = a1 + 2b1 + 2a2 + 4b2 . Here p4 , p5 are fragments for the new node.

Figure 3.8. Decoupled optimal-cost reconstruction in four node tandem network.

Chapter 4

Numerical Results We present advantage of optimization approach in the tandem, star and grid networks as follows. We note that they are quite fundamental network topologies and more complicated networks can be built by some combinations of them. Here we define the optimization gain as gc =

σnon−Opt , σc

(4.1)

where σnon−Opt and σc are the cost of the scheme without and with cost optimization respectively.

4.1

Tandem Networks

For a general tandem network, the repair costs depend on the number of nodes and the failing node location. In Figure 4.1, we compare the repair storage-cost tradeoff for the joint method, decoupled method and non-optimization for the example in Figure 3.1. The results are obtained by changing α and calculating the cost. The figure highlights the gain of optimization and SNC. The difference between decoupled and joint methods shows that regenerating code ( [2], [1]) used by the joint method might be sub-optimum for repair-cost. The decoupled method can choose the optimal subgraph without the constraint of codes. For instance, a less-cost scheme to repair node 4 may use no codewords from node 1 (while regenerating codes use since d = 3). Further, we formulate minimum repair-cost in tandem network for the case of equality cost of communication between neighboring nodes. Claim 1: Suppose a tandem distributed storage system consisting n nodes where each node stores α fragments and every k nodes can recover the original file of size M . Assuming cost of a link (between neighbor nodes) equals to one unit cost, the optimum repair cost equals to σc = k(M − (k − 1)α) where α ≥ Proof : See appendix A. 27

M . k

(4.2)

28

CHAPTER 4. NUMERICAL RESULTS Tradeoff in tandem network for n=4, k=2

α: storage per node

0.6 0.58 0.56 0.54

non−Optimization Joint−metod Optimization Decoupled−method Opt.

0.52 0.5 0.8

0.9

1

1.1

1.2

1.3

1.4

σc : repairing cost

1.5

Figure 4.1. Storage-cost tradeoff in tandem networks. M is normalized to 1.

4.2

Star Networks

In a star network, communication between any two nodes passes through central node. Figure 4.2 shows an example of a distributed storage system in a star network. A file containing 8 fragments (x1 , x2 , · · · , x8 ) is encoded and distributed among 6 nodes. Every node stores 2 fragments such that any 4 nodes can recover the original file. If node 6 fails, in non-optimized repair, each of 5 (d = n − 1 = 5) surviving nodes transmits one fragment (β = 1 by information flow min-cut analysis) to node 6. Fragments p1 , p2 , p3 , p4 and p5 are formed by network coding at node 1, 2, 3, 4, 5, respectively. Then these fragments are sent to the new node through the central node. We assume that a fragment cost one unit for a single channel. The total costs σnon−Opt = 9 units. As shown in Figure 4.3, in the joint method (using SNC), the central node encode the received p1 , p2 , p3 , p4 to form p6 and p7 fragments, where p6 = p1 + p2 + p3 + p4 + y1 = 2x1 + 2x2 + 6x3 + x4 + 3x5 + 6x6 + 4x7 + 4x8 and p7 = p1 + p2 + p3 + p4 + y2 = x1 + 3x2 + 5x3 + 2x4 + 2x5 + 7x6 + 3x7 + 5x8 . We can see that p6 and p7 preserve the RCP and they would be the new node’s code. The total cost is 6 units. Thus, the optimization gain for the joint method is gc = 69 = 1.5, which means around 33% cost reduction. In the example, we can show that the decoupled method can also lead to the same gain. Claim 2: Assume (6,4)-MDS code distributed storage system on star platform of Figure 4.3 and a failed node 6 given cost matrix C 2 . Optimum-cost in the repair of node 6 equals to 6 transmission costs with corresponding subgraph of (z15 , z25 , z35 , z45 , z56 ) = (1, 1, 1, 1, 2) 

0 0  0  C2 =  0  0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

1 1 1 1 0 0



0 0  0   0  1 0

(4.3)

Proof : See appendix B. Further, we can obtain the storage-cost tradeoff by solving the optimization problem. The result is shown in Figure 4.3. Similar to tandem networks, the joint method with regenerating codes might be suboptimal in term of repair cost. In

4.3. GRID NETWORKS

29

a equality of cost of transmission from surrounding node to central node we can express the optimum-cost solution as: Claim 3: In star distributed storage system consisting n nodes where each node stores α fragments and every k nodes recover the original file of size M then the optimum repair cost equals to (

σc =

(1 + k2 )(M − (k − 1)α) k(M − (k − 1)α)

for k ≥ n + 2 for k = n + 1.

(4.4)

Proof : See appendix C.

Figure 4.2. Repair without SNC in star network with 6 nodes.

node 1

x1 x2

node 2 x3 x4

node 4 p2=x4+5x3

node 5

x7 x8

p1=x1+2x2

p4=3x7+4x8

y1 y2

p6

p3=2x5+6x6 p7 x5 x6

node 3

y1=x1+x3+x5+x7 y2=x2+x4+x6+x8

y3 y4

node 6

Figure 4.3. Repair with SNC in star network with 6 nodes.

4.3

Grid Networks

Consider a 2 × 3 grid network in Figure 4.5. Each node only connects to their neighbors. We assume the cost of every link is 1 unit. Similar to that of star

30

CHAPTER 4. NUMERICAL RESULTS

α : storage per node

Tradeoff in star network for n=6, k=4 non−Optimization Joint−method Optimization Decoupled−method Opt.

0.34 0.32 0.3 0.28 0.26 0.24

0.6

0.7

0.8 0.9 1 1.1 σc : repairing−cost

1.2

Figure 4.4. Storage-cost tradeoff of the star network in Figure 4.3. M is normalized to 1.

networks, we assume here 6 nodes storing 8 fragments (x1 , x2 , · · · , x8 ). If node 6 fails, the non-optimized repair costs 9 unit as shown in Figure 4.5. In the decoupled method, Claim 4: The optimized repair costs 7 units with the subgraph (z(12) , z(14) , z(23) , z(25) , z(36) , z(45) , z(56) )= (0, 1, 0, 1, 1, 2, 2). Proof : See appendix D. The optimal subgraph and the corresponding network code is shown in Figure 4.6. Thus the new node has fragments y3 = p3 + p6 = 3x1 + 3x2 + 3x3 + 3x4 + 7x5 + 8x6 + 2x7 , y4 = p3 + p7 = 6x1 + 10x2 + 2x3 + 4x4 + 6x5 + 9x6 + x7 + 2x8 . Accordingly, gc = 79 = 1.28. Formulating the minimum repair cost in grid network in general is more complicated than tandem and star networks but we can show the maximum of the optimum repair cost, Claim 5: In grid distributed storage system consisting n nodes where each node stores α fragments (α ≥ M k ) and every k nodes recover the original file of size M then the maximum of optimum repair cost equals to max{σc } = k(M − (k − 1)α). Proof : See appendix E.

Figure 4.5. Non-optimized repair in a 2 × 3 grid network. Dashed lines show available links which are not used in the repair process.

(4.5)

4.3. GRID NETWORKS

31

node3

node2

x5 x6

node1

x3 x4

x1 x2

p2=2x3+3x4

p3=6x5+8x6

p1

p1=2x1+3x2 p6=y1+p4+p2 y3 y4

p4=p1+x7 x7 x8

y1 y2 p7=y2+p2+p4+p5

node6

node5

p5=2p1+x8 node4

Figure 4.6. Decoupled optimization repair in the 2 × 3 grid network.

Chapter 5

Conclusion and Future Works We studied the repair-cost problem in distributed storage systems. Considering the network effect we proposed surviving nodes cooperation to reduce the repair-cost. Furthermore, by joint and decoupled methods the optimization problems have been formulated. In continue, the repair cost-storage tradeoff is derived. Comparing the repair-cost tradeoff from joint and decoupled methods shows that regenerating codes might be suboptimum from the repair-cost point of view. Moreover, there are some practical issues in our approach which can be followed in future works. • In the joint method way, we extracted the coding and the optimum subgraph jointly. However, our algorithm is naive algorithm of search among all possible cooperations. • In the decoupled method, the optimum subgraph is obtained through cutbased analysis. To model the problem as a convex optimization problem, we assumed the number of fragments in each link zij can be a rational number. If the zij is an integer value then our constraint region in the optimization problem would not be convex. Therefore, the problem shall be solved by an integer programming (IP) method which is more difficult than a linear optimization method. One approach is assuming that zij is a rational number and then round up the answer. That’s the LP relaxation of IP. • In addition, finding task.

n−1 k−1

min-cut constraints in a big network is not an easy

Finally a fast algorithm is demanding and can be followed in future work.

33

Bibliography [1] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ramchandran, “Network coding for distributed storage systems," IEEE Trans. on Info. Theory, Sep. 2010. Preliminary version appeared in Infocom 2007. [2] Y. Wu, A. G. Dimakis, and K. Ramchandran, “Deterministic regenerating codes for distributed storage," In Allerton Conference on Control, Computing, and Communication, Urbana-Champaign, IL, September 2007. [3] L. Xu, “Highly available distributed storage system," PhD thesis, California Institue of Technology Nov.1999. [4] M. Placek and R. Buyy, “A taxonomy of distributed storage system," www.cloudbus.org/reports/DistributedStorageTaxonomy.pdf, University of Melborne. [5] P. N. Yianilos and S. Sobti, “The evolving field of distributed storage," IEEE Internet Computing, Sep. - Oct. 2001. [6] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach M. Burrows, T. Chandra, A. Fikes and R. E. Gruber, “Bigtable: A distributed storage system for structured data," To appear in OSDI 2006. [7] R. W. Yeung, Z. Zhang, ”Distributed source coding for satellite communications,”IEEE Trans. on Info. Theory, vol. 45, no. 3, pp. 1111-1120, May 1999. [8] R. Ahlswede, N. Cai, S. Y. Robert Li and R. W. Yeung, “Network information flow," IEEE Transaction on Information Theory, Vol. 46, No.4, July 2000, pages 1204-1216. [9] R. Koetter and M. M´edard, “An algebraic approach to network coding," IEEE/ACM Trans. Networking, vol. 11, no. 5, pp. 782-795, Oct. 2003. [10] S. Akhlaghi, A. Kiani and M. R. Ghanavati, “Cost-bandwidth tradeoff in distributed storage systems," arXiv:1004.0785v2 [cs.IT] 14 Apr 2010. [11] Y. Hu, Y. Xu, X. Wang, Ch. Zhan and P. Li, “Cooperative recovery of distributed storage systems from multiple losses with network coding," IEEE Journal on Selected Areas in Communications, VOL. 28, NO. 2, Feb. 2010. 35

36

BIBLIOGRAPHY

[12] K. V. Rashmi, Nihar B. Shah, P. V. Kumar and K. Ramchandran, “Explicit construction of optimal exact regenerating codes for distributed storage," to appear in Proc. Allerton Conf., Sep 2009. Available online at arXiv:0906.4913 [cs.IT]. [13] D. S. Lun, N. Ratnakar, M. Medard, D. Karger, T. Ho, E. Ahmad and F. Zhao, “Minimum-cost multicast over coded packet networks," IEEE Trans. Inform. Theory,vol.52, pp 2608-2623, June 2006. [14] Y. Wu and A. G. Dimakis, “Reducing repair traffic for erasure coding- based storage via interference alignment," In Proc. ISIT 2009. [15] N. B. Shah, K. V. Rashmi, P. V. Kumar and K. Ramchandran, “Explicit codes minimizing repair bandwidth for distributed storage," rrXiv:0908.2964v2, Sep.2009. [16] S. Y. R. Li, R. W. Yeung, and N. Cai, “ Linear network coding,´´ IEEE Trans. on Information Theory, vol. 49, pp.371-381, February 2003. [17] T. Ho, M. Medard, R. Koetter, D. R. Karger, M. Effros, J.Shi, and B. Leong, “ A random linear network coding approach to multicast,´´IEEE Trans. Inform. Theory, vol.52, pp.4413-4430, Oct. 2006. [18] S. Jaggi, P. Sanders, P. A. Chou, M. Effros, S. Egner, K. Jain, and L. Tolhuizen, “Polynomial time algorithms for network code construction," IEEE Trans. on Inform. Theory, vol.51, pp.1973-1982, June 2005. [19] T. HO, and D. S. Lun, Network Coding, Cambridge University Press, 2008. [20] Y. Wu, “ Network coding for wireless networks," Technical Report, July 2007. [21] N. B. Shah, K. V. Rashmi and P. V. Kumar, “A Flexible class of regenerating codes for distributed storage," ISIT 2010, Austin, Texas, USA, June 2010. [22] V. Venkatesan,“Distributed storage systems", IBM Zurich Research Lab. EPFL. [23] D. Leong, A. Dimakis and T. Ho, “Distributed storage allocation," rrXiv:1011.5287v1, Nov.2010 [24] D. Traskov, N. Ratnakar, D. S. Lun, R. Koetter, and M. Medard, “Network coding for multiple unicast: An Approach based on Linear Optimization," ISIT Seattle, USA, July 2006. [25] J. E. Wieselthier, G. D. Nguyen and A. Ephremides, “On the construction of energy-efficient broadcast and multicast trees in wireless Networks," Infocom 2000, Tel Aviv , Israel, March 2000.

BIBLIOGRAPHY

37

[26] R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University Press, 1995. [27] J. Lungren, M. Rönnqvist and P. Värbrand, Optimization. Studenlitterature, Sweden 2010. [28] T. Cui, L. Chen and T. Ho, “Energy efficient oppurtunstic network coding for wireless networks," Infocom 2008, Phonix, AZ, May 2008.

Chapter 6

Appendices 6.1

Appendix A

Claim 1:In tandem distributed storage system consisting n nodes where each node stores α fragments and every k nodes recover the original file of size M then the optimum repair cost equals to σc = k(M − (k − 1)α) where α ≥

M . k

(6.1)

Proof: We know every k nodes can reconstruct the original file. So, intuitively if k nodes is allowed to combine their fragments (possibly using SNC) then they can reconstruct the failed node and when the cost of transmission on each channel is equal, the k nearest neighbor to the failed node pose less transmission on the network so that might be the optimum solution. Now let us show it mathematically. We follow the induction approach. Step 1: Let’s start from k = 2 and n = 3. Consider that we assume n > k since if n ≤ k whenever a node fails, the reconstruction is not possible. Assume node 2 fails. As shown in the Figure 6.1 min-cut constraints are:

node 1 cut α ∞

z(12)

∞

DC

node 2 S

∞

α

α

∞ ∞

new node 2 z(32)

node 3 α

Figure 6.1. cut analysis.

39

40

CHAPTER 6. APPENDICES (

z(12) + α z(32) + α

≥M ≥M

(6.2)

Which yields: z(12) + z(32) ≥ 2(M − α)

(6.3)

=⇒ σc = 2(M − α) = k(M − (k − 1)α). Now by keeping k = 2, we see if n increases, it does not affect σc . Since the added nodes, in fact, does not affect feasible region. In Figure 6.2 one node is added then the new min-cut gives: z(12) + z(32) ≥ (M − α) clearly it already is satisfied by 6.3. So for every n > k,σc = 2(M − α) for the k = 2. Now if k increases, cut

node 1 α ∞

z(12) node 2

S

∞ ∞

DC

∞ α

α

∞

new node 3 z(32)

node 3

∞

α

z(43)

node 4 α


node 1

cut α

node 2

z(12) α

∞ ∞

∞

z(23)

node 3 S

∞

DC α

α

∞ ∞

new node 3 node 4

z(43) α


Step 2: For k = 3 and n = 4, and assuming failed node 3 the min-cut constraints

6.2. APPENDIX B

41

are (Figure 6.3):

   z(12) + 2α

z

+ 2α

(23)   z + 2α (43)

≥M ≥M ≥M

(6.4)

These yield: z(12) + z(23) + z(43) ≥ 3(M − 2α)

(6.5)

=⇒ σc = 3(M − 2α) = k(M − (k − 1)α). As before we see adding nodes n, does not change feasible region. Now in general case, Step 3: For general k, we first show for n = k + 1 minimum repair cost equals to σc = k(M − (k − 1)α).. Then adding more nodes do not affect the feasible region and consequently the minimum repair cost. So for n = k + 1 and assuming a failed node j the min-cut constraints are:   z(12) + (k − 1)α      z(23) + (k − 1)α     .  ..  z(j−1,j) + (k − 1)α     .   ..     

z(k,k−1) + (k − 1)α

≥M ≥M ≥M

(6.6)

≥M

⇒ σc = k(M − (k − 1)α). ¥

6.2

Appendix B

Claim 2: Assume (6,4)-MDS code distributed storage system on star platform of Figure 4.3 and a failed node 6 given cost matrix C2 . Optimum-cost in the repair of node 6 equals to 6 transmission costs with corresponding subgraph of (z15 , z25 , z35 , z45 , z56 ) = (1, 1, 1, 1, 2) Proof: given failed node on node 6 and cost matrix of C2 , we formulate linear optimization problem as the following: There are 0 0 6−1 = 10, min-cut constraints. by using z1 , · · · , z10 as slack variables the following 4−1 augmented form is resulted. minimize σc subject to:

(6.7)

42

CHAPTER 6. APPENDICES 



                 

1 1 0 1 0 0 1 0 0 0

1 0 1 0 1 0 0 1 0 0

0 1 1 0 0 1 0 0 0 0

0 0 0 1 1 1 0 0 0 0

0 0 0 0 0 0 1 1 1 1

1 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 −1 0 0 0 0 0 0 0 0 0 0 −1

 

z15 1     z25  1     z   1  35    z45     1      z56  1  0    z      10  1       z2  1  0      z3  = 1·(M −3α)  0      z  1  4     0      z5  1  0      z6  1  0     z7   1   0     z  1  80      1   z9  0 1 z 10

Solving this linear optimization (for case of M = 8 and α = 2 in this example) by a simplex method (e.g., using linprog command in Matlab) gives the minimum repair cost (σc ) and corresponding optimum subgraph as: σc = 6; (z15 , z25 , z35 , z45 , z56 ) = (1, 1, 1, 1, 2)¥

6.3

Appendix C

Claim 2: In star distributed storage system consisting n nodes where each node stores α fragments and every k out of n nodes recover the original file of size M then the optimum repair cost equals to (

σc =

(1 + k2 )(M − (k − 1)α) k(M − (k − 1)α)

for k ≥ n + 2 for k = n + 1.

(6.8)

Proof: Same as proof of claim 1, let us follow induction. Step 1: For k = 2andn = 3 as shown in Figure 6.4 min-cut constraints are: (

z(12) + α z(23) + α

≥M ≥M

z(12) + z(23) ≥ 2(M − α)

(6.9) (6.10)

=⇒ σc = 2(M − α) = k(M − (k − 1)α). Now for k = 2 if n increases, for n=4, as Figure 6.5 shows min-cut constraints are:    z(13) + z(23) + α

z

+z

(13) (34)   z +α (34)

+α

≥M ≥M ≥M

(6.11)

6.3. APPENDIX C

43

node 1 cut α ∞

∞

z(12)

node 2 S

DC

α

∞

∞

z(23)

∞

node 3 α α

new node 3 Figure 6.4. cut analysis.

node 1

cut α

z(13)

node 2 α

∞

∞ ∞

z(23)

node 3 S

DC

α

∞

∞ ∞

z(34) node 4 α α


z(13) + z(23) + z(34) ≥ 3(M − α)

(6.12)

k =⇒ σc = 3(M − α) = (1 + )(M − (k − 1)α). 2 for n ≥ 5, we see all the min-cut are satisfied with the solution of n = 4. Step 2: now lets increase k. For k = 3 and n = 4 the min-cut constraints would be:    z(13) + 2α

z

+ 2α

(23)   z + 2α (34)

≥M ≥M ≥M

z(13) + z(23) + z(34) ≥ 3(M − 2α) =⇒ σc = 3(M − 2α) = k(M − (k − 1)α).

(6.13)

(6.14)

44

CHAPTER 6. APPENDICES

node 1

cut α

z(13)

node 2 α

∞

∞ ∞ ∞

z(23)

node 3 S

DC

α

∞

∞ ∞

z(34) node 4 α α


Now for k = 3 if n increases, for n=5, as Figure 6.6 shows min-cut constraints are:   z(14) + z(24) + 2α       z(14) + z(34) + 2α

z

+z

(24) (45)     z(45) + 2α    

+ 2α

≥M ≥M ≥M ≥M

(6.15)

Solving the LP problem (6.15) gives: z(14) = z(24) = z(34) = (M − 2α)/2 and z(45) = M − 2α then, σc = z(14) + z(24) + z(34) + z(34) = 3 k (1 + )(M − 2α) = (1 + )(M − (k − 1)α) 2 2 This solution is satisfied for all n ≥ 5. Same way we can continue for all k and n values.¥

6.4

Appendix D

Claim 3: Optimum-cost in a distributed storage system on 2 × 3 grid network (Figure 4.5) given a failed node on node 6 and cost matrix C3 equals to 7 transmission costs with corresponding subgraph of (z(12) , z(14) , z(23) , z(25) , z(36) , z(45) , z(56) ) = (0, 1, 0, 1, 1, 2, 2)

6.4. APPENDIX D

45



0 0  0  C3 =  0  0 0

1 0 0 0 0 0

0 1 0 0 0 0

1 0 0 0 0 0



0 1 0 1 0 0

0 0  1  . 0  1 0

(6.16)

node 1 cut α

z(12)

node 2 α

z(23)

∞

node 3 ∞

∞

z(14)

∞

node 4 ∞

S

∞

α

∞

DC

z(25) α ∞

∞

z(36)

z(45)

node 5

∞

z(56)

α

node 6

α

new node 6

α


Proof: With the failed node on node 6 given cost matrix of C3 , we formulate linear optimization problem as the following: There are 6−1 4−1 = 10, min-cut constraints. As shown in Figure 6.7 sketching the modified flow graph is more complicated than preceding examples. minimize σc subject to: H2 .Z2 ≥ Mα2 zij ≥0 Where H2 =

                 

0 0 0 0 1 0 0 0 0 0

0 0 0 1 1 1 0 0 0 0

0 1 1 0 0 1 0 0 1 0

0 1 0 0 0 1 0 0 1 0

1 1 0 1 0 0 1 0 0 0

0 0 0 0 0 0 1 0 1 1

1 0 1 0 1 0 0 1 0 0

(6.17)

     z12      z14      z23      and Z2 = z25      z    36      z45    z56 

Mα2 = Ones(10,1) · (M − 3α). Solving this linear optimization problem (e.g. by simplex method) for the M = 8 and α = 2 results in:

46

CHAPTER 6. APPENDICES

σc = 7; (z(12) , z(14) , z(23) , z(25) , z(36) , z(45) , z(56) ) = (0, 1, 0, 1, 1, 2, 2).¥

6.5

Appendix E

Claim 4: In grid distributed storage system consisting n nodes where each node stores α fragments (α ≥ M k ) and every k out of n nodes recover the original file of size M then the maximum of optimum repair cost equals to max{σc } = k(M − (k − 1)α).

(6.18)

Proof: Grid network in worst case looks like a tandem network so the maximum repair cost equals to repair cost in tandem network which is = k(M − (k − 1)α).¥

Minimum-Cost Coding for Distributed Storage Systems - DiVA portal

Minimum-Cost Coding for Distributed Storage Systems - DiVA portal

Suggest Documents

Minimum-Cost Coding for Distributed Storage Systems - DiVA portal

Network Coding for Distributed Storage Systems - arXiv

Coding Schemes for Distributed Storage Systems - Google Sites

Distributed Self-Triggered Control for Multi-Agent Systems - DiVA portal

Cold Storage Applications - DiVA portal

Aquaponic systems - DiVA portal

Distributed Storage Allocation Problems - Erasure Coding for ...

Architecture-aware Coding for Distributed Storage ...

Randomized Network Coding in Distributed Storage Systems with ...

Randomized Network Coding in Distributed Storage Systems with

Automated Computer Systems for Manufacturability ... - DiVA portal

DECOMPOSING PROGRAMMES. Re-Coding hospital ... - DiVA portal

Information Systems Actability - DiVA portal

Symbiotic Autonomous Systems - DiVA portal

Tradeoff for Heterogeneous Distributed Storage Systems ... - arXiv

Experiences Building Network-Coding-Based Distributed Storage ...

Fuel-Efficient Distributed Control for Heavy Duty Vehicle ... - DiVA portal

Network Coding for Distributed Storage in Wireless ... - CiteSeerX

mWeb: a framework for distributed presentations using ... - DiVA portal

Remote Data Checking for Network Coding-based Distributed Storage ...

Atomic Ring Maintenance for Distributed Hash Tables - DiVA portal

Decentralized Coding Algorithms for Distributed Storage in ... - arXiv

A Distributed Minimum Variance Estimator for Sensor ... - DiVA portal

A Pattern for Designing Distributed Heterogeneous ... - DiVA portal