PRIVATE INFORMATION RETRIEVAL FROM CODED DATABASES WITH COLLUDING SERVERS∗
arXiv:1611.02062v1 [cs.IT] 7 Nov 2016
RAGNAR FREIJ-HOLLANTI† , OLIVER GNILKE‡ , CAMILLA HOLLANTI§ , AND DAVID KARPUK¶ Abstract. We present a general framework for Private Information Retrieval (PIR) from arbitrary coded databases, that allows one to adjust the rate of the scheme according to the suspected number of colluding servers. If the storage code is a generalized Reed-Solomon code of length n and dimension k, we design PIR schemes which simultaneously protect against t colluding servers , for all t such that 1 ≤ t ≤ n − k. This interpolates between the and provide PIR rate 1 − k+t−1 n previously studied cases of t = 1 and k = 1 and asymptotically achieves the known capacity bounds in both of these cases, as the size of the database grows. Key words. Coded Private Information Retrieval, Distributed Storage Systems, Generalized Reed-Solomon Codes AMS subject classifications. 68P20, 68P30, 94B27, 14G50
1. Introduction. Private information retrieval (PIR) addresses the question of how to retrieve data items from a database without disclosing information about the identity of the data items retrieved, and was introduced by Chor, Goldreich, Kushilevitz and Sudan in [4, 5]. The classic PIR model of [5] views the database as an n-bit binary string x = (x1 , . . . , xn ) ∈ {0, 1}n , and assumes that the user wants to retrieve a single bit xi without revealing any information about the index i. The rate of a PIR scheme is measured as the ratio between the size of the desired data item and the number of downloaded symbols, while upload costs of the requests are usually ignored. The trivial solution is to download the entire database. This, however, incurs a significant communication overhead whenever the database is large, and is therefore not useful in practice. While the trivial solution is the only way to guarantee information-theoretic privacy in the case of a single server [5], this problem can be remedied by replicating the database onto k servers that do not communicate, as in [5, 8, 7] and the references therein. The study of PIR recently received renewed attention, when Shah, Rashmi, Ramchandran and Kumar introduced a model of coded private information retrieval (cPIR) [12, 11]. Here, all files are distributed over the servers according to a storage code, so there is no assumption of the contents of all the servers being identical. It is shown in [11] that for a suitably constructed storage code, privacy can be guaranteed by downloading a single bit more than the size of the desired file. However, this requires exponentially many servers in terms of the number of files. Later, Blackburn, Etzion and Paterson achieved the same low download complexity with a linear number of servers [3]. This is a vast improvement upon [11], but still far from applicable storage systems where the number of files tends to dwarf the number of servers. ∗ Submitted
to the editors November 8, 2016. Funding: D. Karpuk is supported by Academy of Finland grant 268364. C. Hollanti is supported by Academy of Finland grants 276031, 282938, and 303819. All authors are with the Department of Mathematics and Systems Analysis, Aalto University, Finland. † Email:
[email protected] ‡ Email:
[email protected] § Email:
[email protected] ¶ Email:
[email protected] 1
While [11] effectively answered the question on how low the communication cost of a cPIR scheme can be, it highlighted another cost parameter that should not be neglected in the era of big data, namely the storage overhead. We define the storage overhead as the ratio of the total number of coded bits stored on all servers to the total number of uncoded bits of data. Fazeli, Vardy and Yaakobi showed in [9] that it is possible to reduce the storage overhead significantly. More precisely, they show that all known k-server PIR protocols can be efficiently emulated, while preserving their privacy and communication complexity (up to a constant), while reducing the storage overhead to 1 + ε. However, this requires subpacketizing the file and distributing it over a number of servers that grows like ε−1 . Subsequent work improving the storage overhead when the fraction of the database stored on each node is kept fixed, appears in [2, 18]. In contrast to the schemes in [11] and [9], whose strengths appear as the number of servers tends to infinity, we are considering the setting where we are given a storage system with a fixed number of servers. Moreover, we allow some subsets of the servers to collude, by which we mean that they may inform each other of their interaction with the user. This is very natural in a distributed storage system where communication between servers is required to recover data in the case of node failures. PIR over fixed maximum distance separable (MDS)-coded storage systems were considered in [16]. There, two PIR schemes were presented for arbitrary [n, k, d]-MDS codes, one of which had rate n1 and protected against t = n − k colluding servers, and the other of which had rate n−k and t = 1. In Theorem 8 we present these as special instances of a n scheme that can handle any number 1 ≤ t ≤ n − k of colluding servers. Curiously, neither the description nor the performance of our scheme depends on the number of files stored. The rate of our scheme depends on the existence of certain orthogonal Schur products, and in the case where the storage code is a generalized Reed Solomon (GRS) code (Theorem 9), the rate is n−k−t+1 . Our techniques extend nicely to n the case of non-responsive and Byzantine server models, which have previously been studied in the case of replication storage [10, 6]. The details of this extension is work in progress. The maximum possible rate of a PIR scheme for a replicated storage system was derived in [14] without collusion, and in [15] with colluding servers. The corresponding capacity of a coded storage system was given in [1], in the case of non-colluding servers. We conjecture (Conjecture 1) a formula for the capacity of coded PIR with colluding servers, that gives the capacity bounds of [15, 1] as special cases. Assuming our conjecture, our schemes are asymptotically capacity-achieving as the number of files goes to infinity. 2. Coding-Theoretic Preliminaries. In this section we briefly collect some standard coding-theoretic definitions and results that we will need in the following sections. As is commonplace, we will use Fq to denote the field with q elements, where q is a prime power, and [n] for the set {1, 2, . . . , n}. For any two vectors v, w ∈ Fnq , we denote their standard inner product by hv, wi. If V ⊆ Fnq , then we denote its perpendicular complement by (1)
V ⊥ = {w ∈ Fnq | hv, wi = 0 for all v ∈ V }
and we write V ⊥ W if W ⊆ V ⊥ . For a code C ⊆ Fnq we use CI to denote its projection to the coordinates in I ⊆ [n]. The support of a codeword c ∈ Fnq is supp(c) = {i ∈ [n] : ci 6= 0}, and the support of 2
a code C ⊆ Fnq is supp(C) = ∪c∈C supp(c). The minimum distance of C ⊆ Fnq is (2)
d = dC = min{|I| : |C[n]\I | < |C|}.
For linear codes, this can alternatively be written as (3)
d = min{| supp(c)| : c ∈ C}.
A linear code C ⊆ Fnq of dimension k and with minimum distance d is called an [n, k, d]-code, or an [n, k, d]q -code if we wish to emphasize the field of definition. By an elementary result that is usually attributed to Singleton [13], if C is an [n, k, d]-code, then (4)
d + k − 1 ≤ n.
A code that satisfies (4) with equality is called a maximum distance separable (MDS) code. The repetition code Rep(n)q ⊆ Fnq is the one-dimensional code generated by the all-ones vector. It is an MDS code with parameters [n, 1, n]. Definition 1 (GRS Codes). Let α = (α1 , . . . , αn ) ∈ Fnq satisfy αi 6= αj for i 6= j, n and let v = (vi ) ∈ F× q . We define the generalized Reed-Solomon code of dimension k associated to these n-tuples to be (5)
GRSk (α, v) = {(vi f (αi ))1≤i≤n | f ∈ Fq [x], deg(f ) < k} .
The canonical generator matrix for this code is given by 1 ··· 1 α1 ··· αn 2 α1 · · · αn2 (6) G(α, v) := · diag(v), .. .. .. . . . k−1 k−1 α1 ··· αn where diag(v) is the diagonal matrix with the values vi on the diagonal. In data storage applications, it is often desirable to have an explicit encoding matrix that is systematic, i.e. having an identity submatrix in the first k columns. For this purpose, define f1 (α1 ) · · · f1 (αn ) .. · diag(v), .. ˜ (7) G(α, v) := ... . . fk (α1 ) · · ·
fk (αn )
where (8)
Y
fi (x) = vi−1
j∈[k]\{i}
x − αj αi − αj
for i = 1, . . . , k. Note that fi (αi ) = vi−1 and fi (αj ) = 0 for j ∈ [k] \ {i}, hence this is a systematic generator matrix for GRSk (α, v). GRS codes are maximum distance separable (MDS) with parameters [n, k, n−k + 1]. From the Lagrange interpolation formula, it follows that the dual of GRSk (α, v) is given by GRSn−k (α, u) where Y (9) ui = (vi (αi − αj ))−1 . j6=i
3
Definition 2. Let V, W be sub-vector spaces of Fnq . We define the star (or Schur) product V ? W to be the vector space generated by the Hadamard products v ? w = (v1 w1 , . . . , vn wn ) for all pairs v ∈ V, w ∈ W . The following proposition collects some basic properties of the star product that will prove useful in the coming sections. Proposition 3. The star product satisfies the following properties: (i) If C is any linear code in Fnq and Rep(n)q ⊆ Fnq is the repetition code of length n over Fq , then C ? Rep(n)q = C. (ii) If C, D, and H are any linear codes in Fnq then (C ? D) ⊥ H if and only if (C ? H) ⊥ D. (iii) If C and D are any linear codes in Fnq with supp(C) = supp(D) = [n], and (C ? D)⊥ = H, then dH ≥ dC ⊥ + dD⊥ − 2. (iv) If C ⊆ Fnq is any MDS code, then (C ? C ⊥ )⊥ = Rep(n)q . (v) The star product of two generalized Reed-Solomon codes in Fnq with the same parameter α is again a generalized Reed-Solomon code with parameter α. More specifically, GRSk (α, v) ? GRS` (α, w) = GRSmin{k+`−1,n} (α, v ? w) for any parameters v, w. Proof. Properties (i) and (ii) follow immediately from the definition of the star product. Property (iii) is Theorem 5 of [17]. To see that property (iv) holds, let H = (C ? C ⊥ )⊥ . The containment Rep(n)q ⊆ H is obvious for any code C. If C is MDS with parameters [n, k, n−k+1] then C ⊥ is MDS with parameters [n, n−k, k+1], so property (iii) implies that dH ≥ dC + dC ⊥ − 2 = n. Hence by the Singleton bound the dimension of H is 1 and therefore H = Rep(n)q . To see that property (v) holds, consider some arbitrary codewords (vi f (αi )) ∈ GRSk (α, v) and (wi g(αi )) ∈ GRS` (α, w). We clearly have (10)
(vi f (αi )) ? (wi g(αi )) = (vi wi (f g)(αi )) ∈ GRSmin{k+`−1,n} (α, v ? w)
hence the containment GRSk (α, v) ? GRS` (α, w) ⊆ GRSmin{k+`−1,n} (α, v ? w) holds. To see the reverse containment, note that GRSmin{k+`−1,n} (α, v ? w) is generated as an Fq -vector space by codewords of the form (vi wi fm (αi )) where fm (x) = xm is a monomial of degree m < k + ` − 1. We can clearly decompose such a codeword as (11)
(vi wi fm (αi )) = (vi fa (αi )) ? (wi fb (αi ))
where fa (x) = xa and fb (x) = xb for any a, b such that a < k, b < `, and a + b = m. This shows the reverse inclusion and completes the proof. 3. Coded Storage and Private Information Retrieval. Let us describe the distributed storage systems we consider; this setup follows that of [16, 1]. To provide clear and concise notation, we have consistently used superscripts to refer to files, subscripts to refer to servers, and parenthetical indices for entries of a vector. So, for example, the query qji is sent to the j th server when downloading the ith file, and yj (a) is the ath entry of the vector stored on server n. Suppose we have files x1 , . . . , xm ∈ Fkq . Data storage proceeds by arranging the files into an m × k matrix 1 1 x x (1) · · · x1 (k) .. . .. (12) X = ... = ... . . xm
xm (1) 4
···
xm (k)
Each file xi is encoded using a linear [n, k, d]-code C with generator matrix GC into an encoded file y i = xi GC . In matrix form, we encode the matrix X into a matrix Y by right-multiplying by GC : 1 y .. (13) Y = XGC = . = y1 · · · yn ym
th The j th column yj ∈ Fm server. q of the matrix Y is stored by the j Such a storage system allows dC − 1 many servers to fail while still allowing users to successfully access any of the files xi . In particular, if C is an MDS code, the resulting distributed storage system is maximally robust against server failures. Private Information Retrieval (PIR) is the process of downloading a file from a distributed storage system without revealing which file is being downloaded [5]. Definition 4. Suppose we have a distributed storage system as described above. A linear PIR scheme over Fq for such a storage system consists of: 1. For each index i ∈ [m], a probability space (Qi , µi ) of queries. When the user wishes to download xi , a query q i ∈ Qi is selected randomly according to the probability measure µi . Each q i is itself a tuple q i = (q1i , . . . , qni ), where th server. qji ∈ Fm q is sent to the j i i 2. Responses rj = hyj , qj i which the servers compute and transmit to the user. 3. A reconstruction function h : Fnq → Fcq which takes as inputs the rji and returns some c coordinates of the file xi . Definition 5. A PIR scheme protects against t colluding servers if for every set J = {j1 , . . . , jt } ⊆ [n] of size t, we have
(14)
I(qji1 , . . . , qjit ; i) = 0
where I(· ; ·) denotes the mutual information of two random variables. In other words, for every set J ⊆ [n] of size |J| ≤ t, there exists a probability distribution (QJ , µJ ) such that, for all i ∈ [m], the projection of (Qi , µi ) to the coordinates in J is (QJ , µJ ). Hence, no subset of servers of size ≤ t will learn anything about the index i of the file that is being requested. Definition 6. Let ρ be the average number of symbols downloaded by the PIR scheme P in order to retrieve a file of size k. Then we define the rate of P to be kρ . Proposition 7. The rate of a linear PIR scheme as defined in Definition 4 is c/n. Proof. After subdividing the file if necessary, we may assume that c divides k = αc, and that we can retrieve the desired file by applying the PIR scheme α times, downloading a new set of c desired symbols from n response symbols each time. Then αc the rate can be written as kρ = αn = nc . With a small modification of an arbitrary linear PIR scheme, we would have a rate c equal to E[w(q i )] . Indeed, we could replacing the number of servers with the expected weight of the overall query vector, since subqueries qji that are the zero vector can be ignored. For comparison with earlier work on PIR, where every server is assumed to respond in each round, we use Proposition 7 for the remainder of this paper. Note that Definition 6 ignores the cost to the user of uploading the queries to the servers. This can be justified by considering xi ∈ V k where V is some finitedimensional vector space over Fq with Fq -linear encoding. In this setting the size of the queries is easily seen to be minimal in comparison to download costs when dim V 1. 5
4. A General PIR Scheme for Coded Storage with Colluding Servers. Our goal is to find high-rate PIR schemes which protect against many colluding servers. To that end, the following theorem provides a general PIR scheme for coded databases which protects against a flexible number of colluding servers. Theorem 8. Let C be a linear [n, k, d]q code with generator matrix G, and consider the distributed storage system Y = XGC as in Section 3. For any linear codes d −1 D and H such that (C ? D) ⊥ H, there exists a linear PIR scheme with rate H ⊥n which protects against dD⊥ − 1 colluding servers. Proof. We will describe an explicit data retrieval strategy. Select m codewords d` = (d` (1), . . . , d` (n)) uniformly at random from D for ` = 1, · · · , m. For j ∈ [n], let th dj = (d1 (j), d2 (j), . . . , dm (j)) ∈ Fm coordinates q be the vector consisting of all the j ` of the codewords d . Suppose we query server j with the vector dj , who then responds with rj = hyj , dj i. The vector consisting of the responses from all servers can be written as Pm r1 hy1 , d1 i m `=1 y1 (`)d1 (`) X ` .. .. (15) r0 = ... = = = y ? d` ∈ C ? D . . Pm `=1 rn hyn , dn i `=1 yn (`)dn (`) and hence r0 is in H ⊥ . The idea now is to “perturb” such a request, to a point where the code C ? D still can correct the “errors” that have been introduced. Suppose we want to retrieve the ith file. Let S be a generator matrix for H, let J ⊆ [n] be an index set of servers of size |J| = dH ⊥ − 1, and let SJ be the restriction of S to the columns indexed by J. Then SJ has full rank by construction. Let the dj be as above and create the queries qji by (16)
qji = dj + ei
for all j ∈ J
where ei is the ith standard basis vector. The same calculation as in the previous paragraph shows that the new total response vector is i y˜ (1) hy1 , q1i i i m X y (j) j ∈ J .. .. ` ` 0 i y ? d + . = r + y˜, y˜ (j) = (17) r = . = 0 j 6∈ J `=1 y˜i (n) hyn , qni i Since r0 is in H ⊥ we see that Sr = S y˜, and since SJ is full rank we can decode all of the |J| symbols in y˜ from the vector S y˜. Hence we can decode |J| of the symbols y i (j), or equivalently, dH ⊥ − 1 coordinates of xi . Thus provided we have privacy, the d −1 rate of this PIR scheme is clearly seen to be H ⊥n . It remains to see that the scheme protects against dD⊥ −1 colluding servers. Since t ≤ dD⊥ − 1, every t columns of the generator matrix of D are linearly independent, t so the distribution of the vectors {dj | j ∈ T } is uniform over Fm for all subsets q i i T ⊆ [n] of size t. Hence also {qj1 , . . . , qjt } = {dj + ei | j ∈ T } is uniformly distributed for all i. Since the distribution is independent of the index i of the desired file, the mutual information condition (14) is satisfied and thus our scheme protects against dD⊥ − 1 colluding servers. This completes the proof. Example 1. Let C be any [n, k, d]q MDS code and set D = C ⊥ . Let H = Rep(n)q . Clearly we have (C ? D) ⊥ H. Since dH ⊥ = 2 and dD⊥ = n − k + 1, the above-outlined scheme has rate n1 and provides privacy against any n − k colluding servers. This is the scheme of [16, Theorem 2]. 6
Example 2. Let C be as in Example 1, but now set H = C ⊥ and let D = Rep(n)q . We again clearly have (C ?D) ⊥ H. Since dH ⊥ = n−k+1 and dD⊥ = 2, the above-outlined scheme has rate 1 − nk , but only provides privacy against non-colluding servers (t = 1). This is the scheme of [16, Theorem 1]. Example 3. Let C = Rep(n)q , let D be any [n, k, d]q MDS code, and let H = D⊥ . It is again clear that (C ? D) ⊥ H, and that dH ⊥ = n − k + 1 and dD⊥ = k + 1. The above-outlined scheme thus has rate 1− nk and provides privacy against any k colluding servers. The technique in Theorem 8 can be adapted to the setting where a bounded number of servers may fail to respond, or even may respond untruthfully, in a socalled Byzantine error model. The details of this extension is work in progress. 5. Private Information Retrieval from GRS Codes. In [16], two PIR schemes were presented for arbitrary [n, k, d]-MDS codes, one of which had rate n1 and protected against t = n − k colluding servers, and the other of which had rate n−k n and t = 1. These schemes are essentially variations of Example 1 and Example 2 in this paper. The authors asked if one can adapt their schemes in the “intermediate regime” where 1 < t < n − k for arbitrary n and k. In this section, we show how to do this via Theorem 8 for some suitably chosen [n, k, n − k + 1]-MDS storage codes, namely for GRS codes. By Proposition 3 we know that the class of all GRS codes associated to a fixed n-tuple α ∈ Fnq is closed under taking star products and duals. Moreover, while the dimension of a star product C ? D of two generic codes can be as high as dim(C) · dim(D), in the case of GRS codes it is only dim(C) + dim(D) − 1, which is useful when we want to maximize minimum distances. The following theorem instantiates Theorem 8 in the case where the storage code C is GRS. Theorem 9. Let C = GRSk (α, v) and consider the distributed storage system Y = XGC as in Section 3. Then for all t such that 1 ≤ t ≤ n − k there exists a linear PIR scheme for Y with rate 1 − k+t−1 which protects against any t colluding servers. n Proof. We will give linear codes D and H satisfying (C ? D) ⊥ H such that dD⊥ − 1 = t and dH ⊥ − 1 = n − (k + t) + 1; the theorem then follows immediately n from Theorem 8. Let D = GRSt (α, u) for an arbitrary vector u ∈ F× q , since its dual is also MDS we see that dD⊥ = t + 1. Define H = (C ? D)⊥ = GRSn−k−t+1 (α, w) with w as given by (9). It follows that dH ⊥ = dC?D = n − k − t + 1 as desired. When k = 1, that is, the data is stored via a replication system, our scheme provides a rate of 1 − t/n. It is known by [15] that the capacity for PIR in this 1−t/n case is 1−(t/n) m . Thus our schemes are asymptotically capacity-achieving, in that the resulting rates approach capacity as the number of files m → ∞. Similarly, when t = 1, that is, without server collusion, our scheme provides a 1−k/n rate of 1 − k/n. By [1] the capacity for PIR in this case is 1−(k/n) m , thus our schemes are again asymptotically capacity-achieving. There are no known capacity results for colluding coded PIR, i.e. when k > 1 and t > 1. With intuition from the results in [15, 1], we conjecture the following. Conjecture 1. Let C be an [n, k, d] code that stores m files via the distributed storage system Y = XGC , and fix 1 ≤ t ≤ n − k. Any PIR scheme for Y that protects against any t colluding servers has rate at most
1− k+t−1 n m. 1−( k+t−1 ) n
The cases k = 1 and t = 1 are the main theorems from [15] and [1] respectively. If Conjecture 1 is true, then our scheme in Theorem 9 is asymptotically capacityachieving as m → ∞. Moreover, our rate converges to the conjectured capacity at a 7
0.9
k=1 k=3 k=6 k=8 k=9 Capacity for k = 1 Capacity for t = 1
0.8
PIR rate
0.7 0.6 0.5 0.4 0.3 0.2 0.1 2
4
6
8
10
t = number of colluding servers Fig. 1. Achievable PIR rates for n = 12 servers and m = 8 files, as a function of t = the number of colluding servers, for various storage code rates. The black curve represents the PIR capacity for the case of k = 1 (data stored via a replication system) as computed from [15]. The asterisks show the capacity for the non-colluding case t = 1 as given in [1]. The PIR capacity is unknown when t ≥ 2 and k ≥ 2.
rate that is exponential in the number of files. Observe that the rate of our schemes does not depend on the number of files stored. In Fig. 1 we plot for n = 12 servers the achievable PIR rates as a function of the number of colluding servers t, for various code rates k/n. The black curve represents the capacity for the case of k = 1, that is, when the data is stored using a replication system [15]. The asterisks represent the capacity bound obtained in [1] for the noncolluding case t = 1 at different code rates. One can see that even for a relatively small number of files, our scheme is quite close to capacity. Lastly, note that one can always achieve PIR with rate 1/m by downloading all of the files from the database. Thus for the scheme of Theorem 9 to outperform 1 the trivial scheme, one must have 1 − k+t−1 ≥ m . But this is easily seen to hold n whenever m ≥ n and is thus a rather mild condition, as in practice m dwarfs the other parameters. 6. An example in the intermediate regime. We will illustrate our scheme with an explicit example in the case when t = 2, c = 2 and k > 1. This is the first case not covered by our Examples 1–3 and the storage code has parameters [5, 2, 4]. Example 4. Let α = (0, 1, 2, 3, 4) ∈ F55 and let 1 ∈ F55 be the all-ones vector. Consider the storage code C = GRS2 (α, 1) over F5 , encoded by its systematic generator matrix
(18)
GC :=
1 0
0 1
4 2
3 3
2 4
as in (7). So each file xi is divided into two blocks xi (1) and xi (2), and distributed 8
onto the five servers as follows: i x (1) xi (2) 4xi (1) + 2xi (2) (19) 3xi (1) + 3xi (2) 2xi (1) + 4xi (2)
on on on on on
server server server server server
1 2 3 4 5.
The random codewords d1 , . . . , dm used to query the servers will be drawn from D = GRS2 (α, 1), for which we choose the canonical generator matrix 1 1 1 1 1 (20) GD := . 0 1 2 3 4 Note that D⊥ is an [5, 3, 3]-MDS code, so our scheme protects against t = dD⊥ −1 = 2 colluding servers. The reason we choose different generator matrices for C and D is practical, as the systematic generator matrix is better for decoding, while the canonical generator matrix is preferable for computations. ⊥ Observe that C ? D = Q GRS3 (α, 1). −1We compute its dual H = (C ? D) = GRS2 (α, u), where ui = ( j6=i (αi − αj )) for i = 1, . . . , 5. Since αi runs over the entire field F5 , these products are unusually easy to evaluate, indeed we have ui = −1 for all i = 1, . . . , 5. Thus H = GRS2 (α, −1) = GRS2 (α, 1) so H is identical to C and we use the same generator matrix GH = GC . For each file index ` ∈ [m], we sample uniformly at random from D by multiplying GD on the left by a uniform random vector z ` = (z ` (1), z ` (2)) ∈ F25 , so that d` = z ` GD and dj = (d1 (j), . . . , dm (j)) for j ∈ [m]. We let z1 = (z ` (1), . . . , z m (1)) and z2 = (z 1 (2), . . . , z m (2)) which are independent and uniformly distributed over Fm 5 . Suppose we want to retrieve the file xi for some i ∈ [m]. We select dH ⊥ − 1 = d(C?D) − 1 = 2 servers from which to download blocks from xi , and for simplicity we here choose the systematic nodes. The queries q i sent to the servers will now be the following vectors in Fm 5 :
(21)
q1i = d1 + ei
= z1 + ei
q2i q3i q4i q5i
= d2 + ei
= z1 + z2 + ei
= d3
= z1 + 2z2
= d4
= z1 + 3z2
= d5
= z1 + 4z2
where ei is the ith standard basis vector. Observe that for each pair of servers, the 2 corresponding joint distribution of queries is the uniform distribution over (Fm 5 ) . The servers now respond by projecting their stored data onto the query vector, whence we obtain a response vector i Pm ` d (1)x` (1) + xi (1) x (1) Pm `=1 ` ` ` i xi (2) (d (1) + d (2))x (2) + x (2) `=1 Pm ∈ C ? D + 0 . (d` (1) + 2d` (2))(4x` (1) + 2x` (2)) (22) ri = P`=1 m (d` (1) + 3d` (2))(3x` (1) + 2x` (2)) 0 P`=1 m ` ` ` ` 0 `=1 (d (1) + 4d (2))(2x (1) + 4x (2)) To finally decode the desired symbols, we now compute the matrix product GH · ri . 9
One calculates that indeed
(23)
i x (1) xi (2) i x (1) GH · ri = GH · 0 = xi (2) . 0 0
We have therefore extracted the c = 2 desired data blocks using 5 queries, while maintaining privacy against t = 2 colluding servers. Acknowledgment. The authors thank Ferdinand Blomqvist for assistance with the introduction, and Salim El Rouayheb, Razan Tajeddine, Rawad Bitar, Serge Kas ´ C´ Hanna and Padraig O athain for fruitful discussions. REFERENCES [1] K. Banawan and S. Ulukus, The capacity of private information retrieval from coded databases, (2016), https://arxiv.org/abs/1609.08138. [2] S. Blackburn and T. Etzion, PIR array codes with optimal PIR rate, (2016), https://arxiv. org/abs/1607.00235. [3] S. Blackburn, T. Etzion, and M. Paterson, PIR schemes with small download complexity and low storage requirements, (2016), https://arxiv.org/abs/1609.07027. [4] B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, Private information retrieval, in IEEE Annual Symposium on Foundations of Computer Science, 1995, pp. 41–50. [5] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan, Private information retrieval, Journal of the ACM (JACM), 45 (1998), pp. 965–981. [6] C. Devet, I. Goldberg, and N. Heninger, Optimally robust private information retrieval, in USENIX Security Symposium, 2012, pp. 269–283. [7] Z. Dvir and S. Gopi, 2-server PIR with sub-polynomial communication, (2014), https://arxiv. org/abs/1407.6692. [8] K. Efremenko, 3-query locally repairable codes of subexponential length, in ACM Symposium on the Theory of Computing, 2009, pp. 39–44. [9] A. Fazeli, A. Vardy, and E. Jaakobi, PIR with low storage overhead: Coding instead of replication, (2015), https://arxiv.org/abs/1505.06241. [10] I. Goldberg, Improving the robustness of private information retrieval, in IEEE Symposium on Security and Privacy, 2007, pp. 131–148. [11] N. B. Shah, K. V. Rashmi, and K. Ramchandran, One extra bit of download ensures perfectly private information retrieval, in 2014 IEEE International Symposium on Information Theory, 2014, pp. 856–890. [12] N. B. Shah, K. V. Rashmi, K. Ramchandran, and P. V. Kumar, Privacy-preserving and secure distributed storage privacy-preserving and secure distributed storage codes, (2012), http://people.eecs.berkeley.edu/∼nihar/publications/privacy security.pdf. [13] R. C. Singleton, Maximum distance q-nary codes, IEEE Transactions on Information Theory, 10 (1964), pp. 116–118. [14] H. Sun and S. Jafar, The capacity of private information retrieval, (2016), https://arxiv.org/ abs/1602.09134. [15] H. Sun and S. Jafar, The capacity of robust private information retrieval with colluding databases, (2016), https://arxiv.org/abs/1605.00635. [16] R. Tajeddine and S. E. Rouayheb, Private information retrieval from MDS coded data in distributed storage systems, in 2016 IEEE International Symposium on Information Theory (ISIT), July 2016, pp. 1411–1415. See http://www.ece.iit.edu/∼salim/ for an extended version. [17] J. H. van Lint and R. M. Wilson, On the minimum distance of cyclic codes, IEEE Transactions on Information Theory, 32 (1986), pp. 23–40. [18] Y. Zhang, X. Wang, H. Wei, and G. Ge, On private information retrieval array codes, (2016), https://arxiv.org/abs/1609.09167.
10