data in a Distributed Storage System (DSS) is a pivotal issue in a distributed storage network. ..... recover the lost data just a single symbol from each node will.
2016 8th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)
Adversarial Attacks in the Twin-code Framework Ninoslav Marina, Natasha Paunkoska and Aneta Velkoska
Abstract—Ensuring the security and reliability of the stored data in a Distributed Storage System (DSS) is a pivotal issue in a distributed storage network. The storage system must be secured against passive and active types of attacks for efficient protection of the stored data. The aim is achieving satisfactory level of security of the storage system and at the same time maintaining the desired level of reliability, which arises when there are node failures. In our work we apply this concept in a Twincode framework, in presence of an intruder who can observe and modify the stored data on the networks nodes. We are making a new code construction, which is simultaneously secure against passive and active attacks and at the same time satisfies particular bound for reliability. Moreover, for this construction we are proving the level of security under the above mentioned constraints. Index Terms—Active attack, distributed storage system (DSS), information-theoretic secrecy, passive attack, security, twin-code framework.
I. I NTRODUCTION Distributed storage systems (DSS) operate in several different environments such as peer-to-peer (P2P) systems and data centers which constitute the backbone infrastructure of cloud computing. Although the data centers and P2P storage systems have relatively different size, very specific topology and network dynamics, they both have some common features. Since the volume of data for storing is huge, thus it is distributed over large number of nodes, which makes the stored data in the individual nodes to become occasionally unavailable due to different reasons. In P2P, besides failure of a storage node or a communication link to a node, a user operating a storage node may decide to make the node temporarily or permanently offline. Therefore, redundancy is needed to ensure data availability and long term data resilience. Moreover, the lost redundancy may also need to be restored. The most popular way of achieving redundancy in DSS is the simplest form of replication. Note however, due to the exponential growth of the stored data [1] this way becomes very expensive and inefficient. Therefore, in recent years the MDS (Maximum Distance Separable) erasure codes in DSS reaches enormous interest among major commercial players such as Google and Microsoft [2]. DSS consist of 𝑁 storage nodes, each with a storage capacity of 𝛼. With MDS erasure codes, data of size 𝑆 symbols, is stored across 𝑛 (𝑛 < 𝑁 ) nodes in the network in such a way that the entire message can be recovered by a data-collector by connecting any arbitrary 𝑘 (𝑘 ≤ 𝑛) nodes. This process of data recovery is referred to as data-reconstruction. However, to repair a failed node in a DSS with erasure codes, first the entire data stored in any 𝑘 live nodes need to be downloaded and then extracted to the data which was stored in the failed node, that results with wasting
978-1-4673-8817-7/16/$31.00 ©2016 IEEE
network resources [3]. These results that provided the concept of regenerating code were introduced by Dimakis et al. [4]. In a DSS with regenerating codes, a new replacement node (newcomer) can contact 𝑑 (𝑑 ≥ 𝑘) helper nodes out of the remaining 𝑛 − 1 nodes during the repair process, and from each one of them to download 𝛽 (𝛽 ≤ 𝛼) symbols. In the regenerating framework, the total amount 𝑑𝛽 = 𝛾 of downloaded data, known as the repair bandwidth, during the repair process is less than the message size 𝑆. The parameters in the regenerating framework that aim to store a file of size 𝑆 must satisfy the following condition examined in [4] 𝑆≤
𝑘−1 ∑
min (𝛼, (𝑑 − 𝑖) 𝛽) .
(1)
𝑖=0
More important, in [4] authors consider a tradeoff between the storage space 𝛼 and the repair-bandwidth 𝑑𝛽. There the equality in (1) is achieved with fixed parameters 𝑆, 𝑘 and 𝑑. In order to reconstruct the data of size 𝑆 from 𝑘 nodes, the storage per node 𝛼 should be at least 𝑆/𝑘 and when 𝛼 = 𝑆/𝑘 the extreme point is termed as the Minimum Storage Regeneration (MSR) point. Otherwise, in the case when the repair bandwidth 𝑑𝛽 is equal to 𝛼, the extreme point is referred to as the Minimum Bandwidth Regeneration (MBR) point. According to the fact that both extreme conditions cannot be satisfed at the same time, Rashmi et al. [3] propose a new concept for DSS called Twin-code framework in order to examine whether it is possible to further reduce storage and repair-bandwidth and whether erasure codes can be employed in a DSS while still enjoying the benefits of efficient node repair. In this paper we go further by investigating the security of this Twin-code framework in the presence of an illegitimate intruder who can eavesdrop on some of the storage nodes, and possibly alter the stored data on some of the storage nodes in order to sabotage the system. By defining the intruder model we construct a new code based on the Twin-code framework that is secure against passive and active attacks. The advantage of the Twin-code framework for efficient reduction of the storage per node and the repair bandwidth simultaneously is applied also in a sense of the security. The secrecy bound achieved by the new secure construction is compared with the most relevant existing results in the literature of this area, i.e., by the secrecy bound at MBR point and at MSR point, respectively. The rest of the paper is organized as follows: Section II describes the system model, more precisely the Twincode framework functioning, the used intruder model and the
273
2016 8th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)
secrecy capacities of the relevant existing results. In Section III we construct a secure Twin MDS code framework against an intruder who can observe and modify the stored data on a subset of nodes and additionally we calculate the secrecy capacity in a case of such intruder attack. Section IV concludes the paper. II. S YSTEM M ODEL A. Twin-code framework concept Following the Twin-code framework in [3] all nodes in the distributed network are partitioned into two groups: nodes of Type 1 and nodes of Type 2. The data of 𝑆TW = 𝑘 2 message symbols arranged in the form of a (𝑘 × 𝑘) message matrix is stored in the system by encoding it with a linear MDS code 𝒞 1 and distributing on the nodes of Type 1, and simultaneously the same message by transposing and encoding with a linear MDS code 𝒞2 is distributed on the nodes of Type 2. Note that if the message size is less than 𝑘 2 , we will pad it with zeros such that it becomes 𝑘 2 symbols long. Both codes can be of different types. In this paper we consider the case where the linear codes 𝒞1 and 𝒞2 are MDS codes over 𝔽𝑞 , so we designate this Twincode framework as the Twin MDS code. In this case, the data collector can perform the reconstruction process of the entire message only with contacting any 𝑘 nodes of same type. When a node fails, a newcomer from a certain type must contact any 𝑘 nodes belonging to the opposite type, downloading just a single symbol from each node, i.e., 𝛽 = 1. Note that in order to guarantee the availability of such a subset of 𝑘 nodes the connectivity in the Twin-code must be at least 2𝑘 − 1. B. Intruder Model By intruder model we are considering an illegitimate attacker in the DSS who can access the data stored on some nodes, or can pretend that is a newcomer when some node fails, so he receives all downloaded data during the repair process, and moreover can modify some of the data. The power of an intruder is characterized by two parameters 𝑙 and 𝑏, where 𝑙 denotes the number of nodes that the intruder can eavesdrop on, and 𝑏 denotes the number of nodes it can control by corrupting its data. We will consider two categories of intruders: a) Passive eavesdropper: Let us assume that the eavesdropper is passive and can only read the data on the 𝑙 observed nodes. In this work, we consider an (𝑙1 , 𝑙2 ) passive eavesdropper model, 𝑙 = 𝑙1 + 𝑙2 < 𝑘, defined in [5] where an eavesdropper may gain access to: either the data stored in a subset of 𝑙1 storage nodes, or to the data downloaded during the repair process of other 𝑙2 nodes. b) Active Limited-knowledge Adversary [6]: The active adversary has limited knowledge about the data stored in the system. In particular, it has a limited eavesdropping capability, 𝑙 (𝑙 < 𝑘), that is not sufficient enough to know all the stored data. In addition, it can control 𝑏 out of these 𝑙 nodes of his choice and maliciously corrupt their data. In distributed storage systems, an intruder controlling a node will also observe its
data. Therefore, we assume that 𝑏 ≤ 𝑙, and that these 𝑏 nodes are a subset of the 𝑙 eavesdropped nodes. C. Secrecy capacities in DSS In this subsection we calculate the secrecy capacities under the defined intruder model (𝑙 passive eavesdrops and 𝑏 active modifications). For the passive part we consider an (𝑙1 , 𝑙2 ) passive eavesdropper model which is generalized version from the eavesdropper model considered by Pawar et al. in [7]. The authors in the paper provided an upper bound to the number of message symbols 𝑆 (𝑠) that can be securely stored in the system in presence of 𝑙 eavesdroppers given by, 𝑆 (𝑠) ≤
𝑘−1 ∑
min (𝛼, (𝑑 − 𝑖) 𝛽) .
(2)
𝑖=𝑙
In their model if an eavesdropper may gain access only to the data stored on the nodes, 𝑙 = 𝑙1 , that indicates to an MBR attack. In a DSS with MBR codes 𝑑𝛽 = 𝛼, so the replacement node downloads only the original stored data. Therefore, in this case the eavesdropper cannot obtain any extra downloaded information from the repair process. Thus, without loss of generality, it may be assumed that 𝑙2 = 0. Hence, the upper bound for secure MBR codes from (2) can be obtained by substituting 𝛼 = 𝑑𝛽 and replacing the inequality with equality. Then the bound becomes ( ( )) ( ( )) 𝑘 𝑙 𝑆 (𝑠) = 𝑘𝑑 − 𝛽 − 𝑙𝑑 − 𝛽. (3) 2 2 When the repair bandwidth is strictly greater than the per node storage, then it is refered to as the MSR point, meaning an eavesdropper gains more information if it has access to the data downloaded during node repair if compared to the case when it observes only the data stored on the node. At the MSR point in [8], Goparaju et al. have established an upper bound of the achievable secure file size in an (𝑙1 , 𝑙2 ) eavesdropper model, and this bound is ( ) 𝑙2 1 𝑆 (𝑠) = (𝑘 − 𝑙1 − 𝑙2 ) 1 − 𝛼. (4) 𝑑−𝑘+1 Motivated by this, [5] considers an (𝑙1 , 𝑙2 ) eavesdropper, 𝑙1 + 𝑙2 < 𝑘, which can access the data stored on any 𝑙1 nodes, or/and access the downloaded data from any 𝑙2 node repairs. The achievability of secure file size in this (𝑙1 , 𝑙2 ) eavesdropper model is formalized by the following definition: Definition 1 (Security Against an (𝑙1 , 𝑙2 ) Eavesdropper [5]). Consider a DSS in which an eavesdropper gains access to the data stored on some 𝑙1 nodes, and the data downloaded during repair on some other 𝑙2 nodes. An (𝑙1 , 𝑙2 ) secure DSS with secure file size 𝑆 (𝑠) is such, where an eavesdropper obtains no information about the message, i.e., 𝐼(𝑓 𝑠 ; e) = 0, where 𝑓 𝑠 is the secure file of size 𝑆 (𝑠) , and e is the eavesdropper’s observation. In the paper we use the following lemma from [5] to show that the proposed frameworks satisfy the secrecy constraints.
274
2016 8th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)
Lemma 1. Consider a system with secure information f s , random symbols r (independent of f s ), and an eavesdropper with observations given by e. If 𝐻(e) ≤ 𝐻(r) and 𝐻(r∣f s , e) = 0, then the mutual information leakage to eavesdropper is zero, i.e., 𝐼(f s ; e) = 0. The proof of the lemma is given in [5]. The secure Twin-code framework against passive attacks is examined in [9]. In Lemma 2 of [9] we prove that the eavesdropper can observe at most 𝑙1 𝑘 independent symbols, when she has access to data stored on 𝑙 = 𝑙1 nodes. The secrecy bound in this case is given by, (𝑠)
𝑆TW = 𝑘(𝑘 − 𝑙).
(5)
If in addition to data stored on 𝑙1 nodes she observes the downloaded data on 𝑙2 nodes during the repair process, the maximum number of independent symbols she can reveal is 𝑘(𝑙1 + 𝑙2 ), producing secrecy bound of (𝑠)
𝑆TW = 𝑘(𝑘 − 𝑙1 − 𝑙2 ).
(6)
Considering these results in [9] we have constructed a secure Twin MDS code framework against passive attack and we show that for either MBR or MSR points, the Twin MDS code framework it gives better results than regenerating codes regarding the security, when 𝛽 = 1 and 𝛼 = 𝑑 = 𝑘 or 𝛽 = 1 and 𝑑 = 2𝑘 − 1, respectively. Considering the active attack together with the passive one in [6], Pawar et al. provided an upper bound to the number of message symbols 𝐶 (𝑠) that can be securely stored in the system in presence of 𝑙 eavesdroppers and 𝑏 modifications, when 𝑏 ≤ 𝑙 given by, 𝐶 (𝑠) ≤
𝑘 ∑
min (𝛼, (𝑑 − 𝑖) 𝛽) .
(7)
𝑖=𝑏+1
The summary of the formula is that an adversary can eavesdrop on 𝑙 < 𝑘 nodes and control a subset of size 𝑏 of these 𝑙 nodes (𝑏 ≤ 𝑙). For that case is defined an upper secrecy bound on the resilience capacity for a (𝑛, 𝑘, 𝑑) distributed storage system using regenerating codes. III. S ECURE CODE CONSTRUCTION AGAINST ADVERSARY This section gives the new code construction and the appropriate proof which is simultaneously resilient against the passive and the active attacks, unlike in [9] where is given only for the passive attack case. A. Secrecy capacities of active limited-knowledge adversary In this subsection we explain the procedure for ensuring the security of the DSS in the presence of an active limitedknowledge adversary. We assume that the adversary can eavesdrop on 𝑙 = 𝑙1 nodes at MBR point and 𝑙 = 𝑙1 + 𝑙2 nodes at MSR point, where 𝑙 < 𝑘, and control some subset of 𝑏 out of these 𝑙 nodes. The amount of the stored data that the adversary knows is limited to what she can deduce from the observed nodes. Apparently, if 𝑙 ≥ 𝑘, the adversary has unlimited knowledge about the data stored in the system. Same as in
the passive eavesdropper model, we assume that the adversary has the coding and decoding schemes at every node in the system. Hence, let assume that 𝑐 = 𝑙−𝑏, or 𝑐 is the number of nodes which are not modified but they are observed by the intruder, 𝑏 is number of nodes modified and 𝑙 is the total number of nodes affected by the adversary. The secrecy bounds for the regenerating codes at MBR and MSR points from (7) are given as following: From MBR point the maximum number of secure message symbols when an active limited-knowledge adversary eavesdrops 𝑙 = 𝑙1 and modified 𝑏 out of these 𝑙 nodes is given by (3) and for the parameters 𝛼 = 𝑑 = 𝑘, 𝛽 = 1 equals to 𝑘 − (𝑏 + 𝑐) ⋅ (𝑘 + 1 − (𝑏 + 𝑐)) . (8) 2 The maximum number of secure message symbols of a DSS with MSR regenerating code when an active limitedknowledge adversary eavesdrops 𝑙 = 𝑙1 + 𝑙2 nodes and modified 𝑏 out of these 𝑙 nodes is given by (4) and for the parameters 𝛼 = 𝑘, 𝛽 = 1 and 𝑑 = 2𝑘 − 1 equals to ( )(𝑏+𝑐)−𝑙1 𝑘−1 𝑘. (9) 𝑆 (𝑠) = (𝑘 − (𝑏 + 𝑐)) 𝑘 𝑆 (𝑠) =
B. Code construction Next, we construct Twin MDS code in the presence of an active limited-knowledge adversary which achieves secure file size greater than the secure file size in a DSS with a regenerating code. Our coding scheme is constructed on Twin MDS code such (𝑠) that the message matrix first is modified by 𝑆TW −𝑆TW random (𝑠) symbols, where from [9] 𝑆TW = 𝑘 2 and 𝑆TW equal 𝑘(𝑘 − 𝑙) or 𝑘(𝑘 − 𝑙1 − 𝑙2 ) representing the number of message symbols that can be securely stored using a Twin MDS code for the (𝑙1 , 𝑙2 ) eavesdropper model. The following lemma is similar to Lemma 2 from [9]: Lemma 2. Assume that an active limited-knowledge adversary: 1) Has access to the data stored on any 𝑙1 nodes; 2) Observes the downloaded data from other 𝑙2 nodes that are in a reparation process; and 3) Controls a subset of 𝑏 ≤ 𝑙 out of these 𝑙 = 𝑙1 + 𝑙2 nodes. If this condition is fulfill, the adversary can observe and modify at most 𝑘𝑙 or 𝑘(𝑙1 + 𝑙2 ) and 𝑏𝑘 independent symbols, respectively. Proof. The proof is analogous to the proof of Lemma 2 in [9]. In lack of space we are not presenting the proof here. Now, we explain our coding scheme in the presence of an active limited-knowledge adversary. There are 𝑛1 Type 1 nodes, and 𝑛2 Type 2 nodes, where 𝑛 = 𝑛1 + 𝑛2 is the total number of storage nodes in the network. Note that the storage nodes of both types have the same characteristics. (𝑠) Let f s be secure information of size 𝑆TW = 𝑘(𝑘 − 𝑙) or (𝑠) 𝑆TW = 𝑘(𝑘 − 𝑙1 − 𝑙2 ) in 𝔽𝑞 , i.e., f s = (𝑎1 , 𝑎2 , ..., 𝑎𝑘(𝑘−𝑙) ) or
275
2016 8th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)
f s = (𝑎1 , 𝑎2 , ..., 𝑎𝑘(𝑘−𝑙1 −𝑙2 ) ). We take 𝑘𝑙 or 𝑘(𝑙1 + 𝑙2 ) i.i.d. random symbols r = (𝑟1 , ..., 𝑟𝑘𝑙 ) or r = (𝑟1 , ..., 𝑟𝑘(𝑙1 +𝑙2 ) ), distributed uniformly at random over 𝔽𝑞 . Prepend r to obtain f = (r, f s ) ∈ 𝔽𝑞 , that will be encoded in the following manner: ∙ Arrange the message f = (𝑓1 , ..., 𝑓𝑘2 ), row by row, into (𝑘 × ⎡ 𝑘) matrix 𝐴1 called message matrix as ⎤ 𝑟1 𝑟2 ... 𝑟𝑘 ⎥ ⎢ .. .. .. ⎥ ⎢ . . . ⎥ ⎢ ⎥ ⎢ 𝑟(𝑙−1)𝑘+1 𝑟 ... 𝑟 𝑙𝑘 (𝑙−1)𝑘+2 ⎥ ⎢ ⎥. 𝑎 𝑎 ... 𝑎 𝐴1 = ⎢ 1 2 𝑘 ⎥ ⎢ ⎥ ⎢ 𝑎 𝑎 ... 𝑎 𝑘+1 𝑘+2 2𝑘 ⎥ ⎢ ⎥ ⎢ .. .. .. ⎦ ⎣ . . . 𝑎(𝑘−𝑙−1)𝑘+1 𝑎(𝑘−𝑙−1)𝑘+1 ... 𝑎𝑘(𝑘−𝑙) Let 𝐴2 ≜ 𝐴𝑇1 , where the superscript T denotes a transpose of a matrix. ∙ For 𝑖 = 1, 2 each node of Type 𝑖 stores 𝑘 symbols from the appropriate column of the (𝑘 × 𝑛𝑖 ) matrix 𝐴𝑖 𝐺𝑖 , where 𝐺𝑖 are the generator matrices, i.e., in the node 𝑗 (1 ≤ 𝑗 ≤ 𝑛𝑖 ) of Type 𝑖 we store the 𝑗-th column of the matrix 𝐴𝑖 𝐺𝑖 , 𝑖 = 1, 2 defined by 𝐴𝑖 g(𝑖,𝑗) . ∙ In order to have a more intuitive representation, we introduce the following notation for the symbols stored in the nodes: The first node of Type 𝑖 is denoted by x(𝑖,1) = (x𝑖1 , ..., x𝑖𝑘 ), the second node of Type 𝑖 as x(𝑖,2) = (x𝑖𝑘+1 , ..., x𝑖2𝑘 ), ... ,and the last 𝑛𝑖 -th node of Type 𝑖 as x(𝑖,𝑛𝑖 ) = (x𝑖(𝑛𝑖 −1)𝑘+1 , ..., x𝑖𝑛𝑖 𝑘 ), where x𝑖𝑚 ∈ 𝔽𝑞 for 𝑚 = 1, ...𝑛𝑖 𝑘 for each 𝑖 ∈ {1, 2}. With this encoding algorithm every node stores 𝑘 symbols and each node 𝑗 of Type 𝑖 is associated with a different column g(𝑖,𝑗) of 𝐺𝑖 . It enables the data to be encoded and mapped into the network. The data collector can perform the reconstruction process of the entire message only by contacting any 𝑘 nodes of the same type. In the case of a failed node, the newcomer of certain type must contact any 𝑘 nodes belonging to the opposite type. To recover the lost data just a single symbol from each node will be downloaded, that is 𝛽 = 1. For a successful repair process, in any moment, 𝑘 nodes from Type 1 and 𝑘 nodes from Type 2 must be alive. For example, if we assume that node m of Type 1 fails, the newcomer (a replacement node) must recover the following k symbols 𝐴1 g(1,𝑚) . Therefore, the newcomer contacts k helper nodes of Type 2. The 𝑗𝑟 –th helper node (1 ≤ 𝑗𝑟 ≤ 𝑛2 ) for all 𝑟, 1 ≤ 𝑟 ≤ 𝑘, sends the product of the encoding vector g(1,𝑚) with the k symbols of the helper node 𝐴2 g(2,𝑗𝑟 ) , i.e., 𝑇 g(1,𝑚) 𝐴2 g(2,𝑗𝑟 ) . So, the replacement node obtains access to the following k symbols [ ] 𝑇 (10) g(1,𝑚) 𝐴2 g(2,𝑗1 ) . . . g(2,𝑗𝑘 ) . ∙
Defining
𝑇 v𝑇 ≜ g(1,𝑚) 𝐴2 ,
the newcomer has access to [ ] v𝑇 g(2,𝑗1 ) . . . g(2,𝑗𝑘 )
and v𝑇 is recovered by erasure decoding of the MDS code 𝒞2 . Therefore, the 𝑘 symbols that have to be recovered at the newcomer are the symbols contained into the vector ( )𝑇 𝑇 𝐴1 g(1,𝑚) = g(1,𝑚) 𝐴2 = v. (13) Thus, the repair process of a Type 1 node is brought to the erasure decoding of the code 𝒞2 . 𝑖 ∙ To each data packet x𝑚 , 𝑖 = 1, 2 we append a hash vector 𝑖 𝑖 𝑖 h𝑚 = (ℎ(𝑚,1) , ..., ℎ(𝑚,𝑞) ) from 𝔽𝑞 . The values of these hashes are computed as follows: 𝑇
ℎ𝑖(𝑚,𝑝) = x𝑖𝑚 x𝑖𝑝 ,
(14)
for all 𝑚 = 1, .., 𝑛𝑖 𝑘 and 𝑝 = 1, ..., 𝑞. These hashes are stored on the nodes in the network, as an addition to the data. For simplicity, we assume that the hashes stored on the nodes are made secure from the intruder operating as an active limited-knowledge adversary and it can neither observe, nor corrupt them. But although it cannot directly observe the hash values, it can generate some of the hash values using the observed data packets on 𝑙 compromised nodes, since he knows the coding scheme. Therefore, an active adversary, though it may corrupt and modify the data stored on the subset of 𝑏 out of 𝑙 compromised nodes, he can use the computed hash values to introduce errors in the data symbols such that it is still consistent with these hash values. The main advantage of the hashes is the possibility of the data collector to conclude if the data stored in some nodes is maliciously changed. How the data collector becomes aware of possible corrupted data by using the hashes is explained later in the paper. Next, we present the following theorem for the general coding scheme described above. Theorem 1. 1) The code based on the Twin MDS code that is modifying the message matrix with 𝑘𝑙 random symbols and appending hash values to each data packet, explained as above, achieves a secure file size
(11)
(12)
276
(𝑠)
𝐶TW = 𝑘(𝑘 − (𝑏 + 𝑐))
(15)
if 𝑐 = 𝑙 − 𝑏 when an active limited-knowledge adversary has access to the data stored on any 𝑙1 nodes and controls a subset of 𝑏 ≤ 𝑙 out of these 𝑙 = 𝑙1 nodes at MBR point with = 𝑘 and 𝛽 = 1. 2) The code based on the Twin MDS code that is modifying the message matrix with 𝑘(𝑙1 + 𝑙2 ) random symbols and appending hash values to each data packet, explained as above, achieves a secure file size (𝑠)
𝐶TW = 𝑘(𝑘 − (𝑏 + 𝑐))
(16)
if 𝑐 = (𝑙1 + 𝑙2 ) − 𝑏 when an active limited-knowledge adversary has access to the data stored on any 𝑙1 nodes, observes the downloaded data from 𝑙2 nodes which are in a repair process and controls a subset of 𝑏 ≤ 𝑙 out of
2016 8th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)
Type 1 nodes
Type 2 nodes
Node 1
Node 2
Node 3
Node 4
Node 1
Node 2
Node 3
x11 = 𝑟1
x14 = 𝑟2
x17 = 𝑟3
x110 = 𝑟1 + 𝑟2 + 𝑟3
x21 = 𝑟1
x24 = 𝑟2
x27 = 𝑟1 + 𝑟2 + 𝑟3
x210 = 𝑟1 + 3𝑟2 + 2𝑟3 x213 = 𝑟1 + 2𝑟2 + 3𝑟3
x12
x15
x18
x111
x22
x25
x28
= 𝑟4 + 𝑟5 + 𝑟6
x211 = 𝑟4 + 3𝑟5 + 2𝑟6 x214 = 𝑟4 + 3𝑟5 + 3𝑟6 x212 = 𝑎7 +3𝑎8 +2𝑎9 x215 = 𝑎7 +2𝑎8 +3𝑎9
= 𝑟4
= 𝑟5
= 𝑟6
= 𝑟4 + 𝑟5 + 𝑟6
= 𝑟4
= 𝑟5
x13 = 𝑎7
x16 = 𝑎8
x19 = 𝑎9
x112 = 𝑎7 + 𝑎8 + 𝑎9
x23 = 𝑎7
x26 = 𝑎8
x29 = 𝑎7 + 𝑎8 + 𝑎9
ℎ11 , ℎ12 , ℎ13
ℎ14 , ℎ15 , ℎ16
ℎ17 , ℎ18 , ℎ19
ℎ110 , ℎ111 , ℎ112
ℎ21 , ℎ22 , ℎ33
ℎ24 , ℎ25 , ℎ26
ℎ27 , ℎ28 , ℎ29
Node 4
ℎ210 , ℎ211 , ℎ212
Node 5
ℎ213 , ℎ214 , ℎ215
Fig. 1: Security in Twin (4, 3) MDS - (5, 3) MDS code in presence of active attack 𝑏 = 1, and (𝑙1 , 𝑙2 )=(1,1) a) Blue edge square: Symbols obtained by repair of Node 2 of Type 2. b) Yellow square: Controlled symbols by the active limited-knowledge adversary. c) Red square: Compromised symbols. Node 1 of Type 1 observed symbols that can be modified and symbols gain from a). d) Last row from the tables represent hash values for the corresponding stored symbol. these 𝑙 = 𝑙1 + 𝑙2 nodes at MSR point with = 𝑘, 𝛽 = 1 and 𝑑 = 2𝑘 − 1. Proof. 1) The repair and data reconstruction properties of the proposed code follow from the construction code in [3]. We use Lemma 1 to prove the security of this code against an (𝑙1 , 𝑙2 ) eavesdropper with 𝑙2 = 0, 𝑙 = 𝑙1 and 𝑏 modified nodes. Considering that e denotes the symbols observed by an eavesdropper and by the statement given above e is always greater than the number of modified symbols, we need to show: (i) 𝐻(e) ≤ 𝐻(r) and (ii) 𝐻(r∣f s , e) = 0. It follows from Lemma 2, 1) that an eavesdropper observes 𝑘𝑙 independent symbols, and since ∣e∣ = 𝑘𝑙, follows that 𝐻(e) = 𝐻(r), which is the first requirement for establishing the security claim. It remains 𝐻(r∣f s , e) = 0, i.e., to show that given the message symbols as side information, an eavesdropper can decode all the random symbols. To this end, without loss of generality we assume that the eavesdropper gain access to the data stored on 𝑙 nodes of Type 1 and (𝑠) modified 𝑏 nodes, but 𝑏 ≤ 𝑙. Now, define 𝐴1 as a (𝑘 × 𝑘) matrix obtained by setting all symbols of the (𝑠) secure file f s in 𝐴1 to zero. Thus 𝐴1 has its first 𝑙 columns identical to that of ]𝐴1 , and zeros elsewhere. (𝑠) [ Let ˜ e = 𝐴1 g(1,1) . . . g(1,𝑙) are the 𝑙𝑘 symbols that the eavesdropper has access to, given the secure message symbols as side information. The MDS property of code 𝒞1 guarantees linear independence[ of the corresponding ] 𝑙 columns of generator matrix g(1,1) . . . g(1,𝑙) . So, recovering the random symbols r from ˜ e is identical to data reconstruction in the original 𝒞˜1 code designed for (𝑛1 , 𝑘 = 𝑙) and no eavesdroppers. Thus, given the secure message symbols, the eavesdropper can decode all the random symbols, i.e., 𝐻(r∣f s , e) = 0. 2) Similar as the proof of Theorem 1, 1).
Corollary 2. The secure file size in presence of an active limited-knowledge adversary, when 𝑙 = 𝑏, becomes, (𝑠)
𝐶TW = 𝑘(𝑘 − 𝑏)
(17)
Proof. Knowing that 𝑐 = 𝑙 − 𝑏 and substituting 𝑏 = 𝑙 we are getting 𝑐 = 0. Replacing that in (15) and (16) we are obtaining secure size of 𝑘(𝑘 − 𝑏). C. Comparison of the secrecy capacities In Fig.1 we present an example of secure Twin MDS framework with parameters 𝑛1 = 4, 𝑛2 = 5 and 𝑘 = 3 in the presence of an (2, 0) and (1, 1) active limited-knowledge adversary such that he controls and can modify 𝑏 = 1 node. Moreover, this adversary eavesdrops the downloaded data during the repair process of Node 2 from Type 2 and at the same time controls and can modify Node 1 of Type 1. Using (8) and (9), we derive that the number of secure message symbols in DSS with regenerating code for the given parameters above. By (15) and (16) we construct a secure Twin MDS framework that achieves 3 secure symbols. First, we modify the message matrix of 9 symbols, by six random and three secure message symbols. The new original message is [𝑟1 , 𝑟2 , 𝑟3 , 𝑟4 , 𝑟5 , 𝑟6 , 𝑎7 , 𝑎8 , 𝑎9 ] and it is stored by Twin (4, 3) - (5, 3) MDS code on 4 nodes of Type 1 and 5 nodes of Type 2. The Twin-MDS code operates over 𝔽7 and its generator matrices are ⎡ ⎤ ⎡ ⎤ 1 0 0 1 1 0 1 1 1 𝐺1 = ⎣ 0 1 0 1 ⎦ , 𝐺2 = ⎣ 0 1 1 3 2 ⎦ . 0 0 1 1 0 0 1 2 3 At the end, to each symbol in both types of nodes we append hash values defined by (14). We show in Fig. 1 all compromised symbols obtained by eavesdropping the stored data on Node 1 of Type 1 and downloaded data in the reparation of Node 2 from Type 2. From the directly observed Node 1 of Type 1, the eavesdropped symbols are {𝑟1 , 𝑟4 , 𝑎7 } and from the downloaded data needed for repairing Node 2 of Type 2 the symbols are {𝑟4 , 𝑟6 , 𝑟4 + 𝑟5 + 𝑟6 }. So, the total eavesdropped symbols are {𝑟1 , 𝑟4 , 𝑟5 , 𝑟6 , 𝑎7 }, therefore the intruder cannot reveal the whole message. In this example the active limited-knowledge adversary also controls and can modify Node 1 of Type 1, so the intruder has access to three symbols {𝑟1 , 𝑟4 , 𝑎7 }. There is a risk for the data collector to download corrupted message if it decides to
277
2016 8th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT)
contact 𝑘 = 3 nodes of Type 1 nodes, among which is Node 1. In the worst case scenario, data collector contacting 3 nodes, Node 1, 2 and 4 of Type 1, observes 9 symbols with their hash values and 3 out of these 9 symbols are corrupted. Moreover, the intruder maliciously changes the stored data on Node 1 of 1 for all 𝑚 = 1, 2, 3, where the symbol Type 1 from x1𝑚 to y𝑚 𝑖 y𝑚 , 𝑖 = 1, 2 denotes the possibly corrupted version of x𝑖𝑚 . Therefore, the data collector computes its own hash values, ˆ 𝑖 = (ℎ ˆ𝑖 ˆ𝑖 similar as (14), h 𝑚 (𝑚,1) , ..., ℎ(𝑚,𝑞) ) by, 𝑖 𝑖𝑇 ˆ𝑖 ℎ (𝑚,𝑝) = y𝑚 y𝑝 ,
900
Secure message symbols
700
ˆ Hash (𝐻)
ℎ12 × ˆ1 ℎ 2
ℎ13 × ˆ1 ℎ 3
ℎ14 ✓ ˆ1 ℎ 4
ℎ5 ✓ ˆ1 ℎ 5
ℎ16 ✓ ˆ1 ℎ 6
ℎ110 ✓ ˆ1 ℎ 10
ℎ111 ✓ ˆ1 ℎ 11
ℎ112 ✓ ˆ1 ℎ 12
Now, the data collector notices that the first three hash values did not match with the computed ones, so it decides to contact a different node than the corrupted Node 1 of Type 1 in order to reconstruct the total and valid information. In the repair process subset of 𝑘 = 3 nodes from the opposite type of the failed are contacted. If Node 1 of Type 1 is part of the connected subset, it represent possible difficulty due to the modified 𝑟4 symbol. After the repair process the exact repair data on the newcomer will be different with the one previously stored on Node 2 of Type 2 nodes. Table II compares the new and the old values of the downloaded data.
ˆ Hash (𝐻)
ℎ2 × ˆ2 ℎ
ℎ8 ✓ ˆ8 ℎ
400
300
100
0
ℎ11 ✓ ˆ 11 ℎ
In Fig. 2 we compare the secrecy capacity in the Twin MDS code storage system (15) and (16) with a DSS with regenerating codes at MBR point (8) and at MSR point (9) in a presence of an active limited-knowledge adversary. Again, we notice that the secrecy capacity in the Twin MDS code is greater than in the regenerating code at both cases, when is used MBR code and MSR code, respectively. IV. C ONCLUSION In this paper we considered the problem of securing a distributed storage system in the presence of an illegitimate attacker in two scenarios: First as a passive eavesdropper, and second as an active limited-knowledge adversary. In both categories of intruders, passive eavesdropper or active
0
5
10 15 20 Number of read−access eavesdroppers
25
30
Fig. 2: Comparison of the secrecy capacity in Twin MDS (squares) and regenerating code (circles) at MBR and MSR points under active attack.
limited-knowledge adversary, our Twin MDS frameworks are similar. The only difference is storing small hash values on the nodes in the network, in addition to the data, when the intruder can corrupt or modify the stored data. So, the data collector by comparing these hash values with the hash values that he computes after contacting 𝑘 nodes to reconstruct the information and finding any mismatches, becomes aware of the possible intruder in the system. Therefore, in both scenarios our framework is reliable and secure. Moreover, we show that these Twin MDS frameworks result with a larger total storage, however they provide better secrecy and reliability than the regenerating codes. R EFERENCES
TABLE II: Comparison of the real and estimated hash value. Hash (𝐻)
500
200
TABLE I: Comparison of the real and estimated hash value. ℎ11 × ˆ1 ℎ 1
600
(18)
and then it compares to the obtained hashes. Table I is an example of such a comparison table where a “✓” indicates that the computed hash and the observed hash match, and “×” that the computed hash and the observed hash do not match.
Hash (𝐻)
Twin MDS code at MBR point MBR code Twin MDS code at MSR point MSR code
800
[1] T. James, “IBMs big data platform and decision management,” May 2011. [Online]. Available: http://www-01.ibm.com/software/data/bigdata/ [2] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin, “Erasure coding in windows azure storage,” in Proc. of USENIX ATC, Jun 2012. [3] K. V. Rashmi, N. B. Shah, and P. V. Kumar, “Enabling node repair in any erasure code for distributed storage,” in Proc. IEEE ISIT, Jul. 2011, pp. 1235–1239. [4] A. G. Dimakis, P. B. Godfrey, Y. Wu, and K. Ramchandran, “Network coding for distributed storage systems,” IEEE Trans. Inf. Theory, vol. 56, no. 9, pp. 4539–4551, Sep. 2010. [5] N. B. Shah, K. V. Rashmi, and P. V. Kumar, “Information-theoretically secure regenerating codes for distributed storage,” in Proc. of IEEE Globecom, Mar. 2011. [6] S. Pawar, S. El Rouayheb, and K. Ramchandran, “Securing dynamic distributed storage systems against eavesdropping and adversarial attacks,” IEEE Trans. Inf. Theory, vol. 57, no. 10, pp. 6734–6753, 2011. [7] S. Pawar, S. El Rouayheb, and K. Ramchandran, “On secure distributed data storage under repair dynamics,” in Proc. IEEE ISIT, June 2010. [8] S. Goparaju, S. El Rouayheb, R. Calderbank, and H. V. Poor, “Data secrecy in distributed storage systems under exact repair,” in Proc. Symp. Netw. Coding, 2013, pp. 1–6. [9] N. Marina, A. Velkoska, N. Paunkoska, and L. Baleski, “Security in twincode framework,” in ICUMT 2015, 2015.
278