Private Query Processing on Inverted Index

1 downloads 0 Views 276KB Size Report
secure query processing protocol defined over the inverted index programs and we .... invoking a reduce function Reduce with input matrix. to generate ...
Private Query Processing on Inverted Index Wee Keong Ng

Yonggang Wen

School of Computer Engineering NTU, Singapore

School of Computer Engineering NTU, Singapore

Abstract—A private query criteria 𝑄𝐾 is a Boolean logic expression in ∧, ∨ and ¬ of an input set 𝐾. A private query processing protocol takes as input a private query criteria 𝑄𝐾 and a public data set 𝑑𝐶 and outputs a document 𝑑 ∈ 𝑑𝐶 such that 𝑄𝐾 (𝑑) =1. This paper studies private query processing protocols in the context of inverted index programs and makes the following 3-fold contributions: 1) in the first fold, a new notion of private query processing protocols defined over inverted index programs within the MappingReducing-Filtering framework is introduced and formalized; Our formalization is general and can be applied to the other scenarios such as private searching on streams, data processing on large clusters and compressing term positions in web indexes as well; 2) in the second fold, a new implementation of private query processing protocols based on (𝑚, 𝑛)-Bloom filters with storages and additively homomorphic public-key encryptions is proposed. The idea behind our implementation is that a map function Map is activated to generate a matrix 𝑀 of form (document𝑗 : word𝑗,1 , . . ., word𝑗,𝑛 , 𝑗 = 1, . . . , 𝑚). The reduce function Reduce is then ˆ of form (keyword𝑖 , invoked to generate an inverted index 𝑀 document𝑖,1 , document𝑖,𝛼𝑖 ). Finally, a (𝑚, 𝑛)-Bloom-Filter with storage is activated to generate matched documents according to the specified query criteria; 3) in the third fold, we show that the proposed query processing protocol on the inverted index is semantically secure assuming that the underlying additively homomorphic public-key encryption is semantically secure. To the best of our knowledge, this is the first semantically secure query processing protocol defined over the inverted index programs and we expect more applications to be deployed within this framework.

I. I NTRODUCTION The task of retrieving commercial data in the presence of malicious adversaries falls into the general field of private information retrieval (PIR) which is well studied up to date [12], [13], [3], [4]. For example, Ostrovsky and Skeith [12], [13] have already proposed solutions of private searching on streaming data, where a client 𝑃 queries whether a server stores the data containing a keyword key, and in case that a data contains key, 𝑃 would like to obliviously retrieve this data such that the corrupted server knows nothing about what is the specified keyword and which data is retrieved. We however demonstrate that the OS protocols may not work efficiently in certain applications. For mining data sets in a Pet Identification scenario, a retriever may be interested in the owner information of a pet rather than other information. As a result, many words in a stored document 𝑑 can be ignored during the course of PIR (notice that we do not claim that c 978-1-4673-0279-1/11/ $26.00 ⃝2011 IEEE

Huafei Zhu

CAS, I2 R A*STAR, Singapore

the general PIR does not work for mining data sets rather we emphasize that the general PIR technique may not work efficiently. this argument applies to the results presented in [3], [4] as well), one can expect more efficient solutions to these problems rather than the general methods presented in [12], [13], [3], [4]. Since no known results deals with the motivation problem above, we formalize an interesting research problem below on input a private query criteria 𝑄𝐾 and a public set 𝑑𝐶 stored even in a possibly corrupted server 𝐶, how to implement an efficient query processing protocol such that it outputs a subsect of documents 𝑑 ⊆ 𝑑𝐶 satisfying 𝑄𝐾 (𝑑) =1 while 𝐶 knows nothing about what is the specified criteria 𝑄𝐾 and which 𝑑 is retrieved from 𝑑𝐶 ? A. This work This paper intends to provide an efficient implementation of private processing protocols in the context of inverted index programs. To help the reader understand the idea of our implementation, we would like to first sketch the basic notions of MapReduce and Inverted index and then provide a high level description of our implementation. MapReduce: MapReduce introduced by Dean and Ghemawat [6], [7], [8] is a programming model automatically parallelized and executed on a large cluster of the commodity machines. A MapReduce program supporting distributed computing on large data sets, consists of two functions: a map function and a reduce function. A map function Map transforms a piece of data into (key, value) pairs whereas a reduce function Reduce merges the emitted values of the same key into a single result (key: value1 . . . value𝑛 ). The MapReduce program is general and many interesting programs can be easily expressed as MapReduce computations: distributed grep, count URL access frequency, reverse weblink graph, term-vector per host, distributed sort and inverted index [6], [7], [8]. The applications of MapReduce in the Cloud Computing scenarios are discussed in [1], [10], [11]. Inverted index: An inverted index program is an instance of MapReduce that stores, for each keyword occurring somewhere in the collection, information about the locations where it occurs. The map function in the inverted index problem parses each document, and emits a sequence of pairs. The reduce function accepts all pairs of a given word, sorts the corresponding document IDs and emits a pair. The set of all

output pairs forms an inverted index. Numerous applications of the inverted index programs have been introduced so far. We refer to the reader [5], [17], [18], [15], [16] for further reference. A high-level description of our implementation: Let 𝑊 ={0, 1}∗ be a universe of words and 𝐷 ⊆ 𝑊 be a dictionary such that ∣𝐷∣ = 𝛼 < ∞. Let 𝐾 = (𝑘1 , . . . , 𝑘𝛾 ) ˆ = be a set of keywords selected by a query processor while 𝐾 ˆ ˆ (𝑘1 , . . . , 𝑘𝜆 ) be a set of keyword selected by the inverted index ˆ ⊆ 𝐷. Let 𝑑𝐶 = (𝑑1 , . . . , 𝑑𝑙 ) program. We assume that 𝐾 ⊆ 𝐾 be a document set stored in a server 𝐶. Our implementation is sketched below 1) Inverted index generation procedure on input 𝑑𝐶 , the query processor invokes the MapReduce function to perform the following computations ∙ invoking a map function Map with input ˆ = (ˆ (𝑑1 , . . . , 𝑑𝑙 ) and 𝐾 𝑘1 , . . . , ˆ 𝑘𝜆 ) to generate a matrix 𝑀 of form (document, keyword) below ⎞ ⎛ 𝑘1,1 , . . . , ˆ 𝑘1,𝛼1 𝑑1 : ˆ ⎜ 𝑑2 : ˆ 𝑘2,1 , . . . , ˆ 𝑘2,𝛼2 ⎟ ⎟ ⎜ ⎝ ... ..., ..., ... ⎠ 𝑑𝑙 : ˆ 𝑘𝑙,1 , . . . , ˆ 𝑘𝑙,𝛼 𝑙



invoking a reduce function Reduce with input matrix ˆ 𝑀 to generate inverted index 𝑀 ⎛ ˆ ⎞ 𝑘1 : 𝑑1,1 , . . . , 𝑑1,𝛽1 ⎜ 𝑘ˆ2 : 𝑑2,1 , . . . , 𝑑2,𝛽 ⎟ 2 ⎟ ⎜ ⎝ ... ..., ..., ⎠ ˆ 𝑘𝜆 : 𝑑𝑛,1 , . . . , 𝑑𝜆,𝛽𝜆

ˆ , the query 2) Document filtering procedure on input 𝑀 processor invokes additively homomorphic encryption scheme (say the Paillier’s encryption scheme) to filter all documents containing keywords in 𝐾 by the following processing ˆ by invoking Pail∙ the query processor encodes 𝐾 lier’s homomorphic encryption scheme 𝐸𝑝𝑘 () to ˆ Let generate ciphertexts {𝑤 ˆ𝑖 }𝜆𝑖=1 ) of the set 𝐾. ˆ = 𝐸𝑝𝑘 (0, 𝑟𝑘 ) if 𝑤 ˆ = 𝐸𝑝𝑘 (1, 𝑟𝑘 ) if 𝑘 ∈ 𝐾 and 𝑤 ˆ ∖ 𝐾. 𝑘∈𝐾 ˆ = {𝑤 ∙ Let 𝑊 ˆ𝑖 }𝜆𝑖=1 . Let 𝑑ˆ𝑖 =(𝑑𝑖,1 , . . . , 𝑑𝑖,𝛽𝑖 ) for 𝑖 = 1, . . . , 𝜆. The query processor generates 𝑐 ← 𝑤 ˆ𝑑ˆ𝑖 , 𝑖

𝑖

and then randomly distributes 𝑚 copies of 𝑐𝑖 into 𝑛-bin of the (𝑚, 𝑛)-Bloom Filter. Note that we can adjust the system parameter 𝑠 ≥ 1 in the Paillier’s encryption to guarantee that the message space is sufficiently large for encrypting each plaintext 𝑑ˆ𝑖 . See Section 3 for more details. 3) Basis retrieving procedure the query processor retrieves the matched document 𝑑ˆ𝐾 from 𝑛-bin of the ˆ, 𝑘𝑖 , 𝑑ˆ𝑖 ) ∈ 𝑀 (𝑚, 𝑛)-Bloom Filter, where 𝑑ˆ𝐾 = {𝑑ˆ𝑖 ∣ (ˆ ˆ 𝑘𝑖 ∈ 𝐾}; By the correctness of the protocol in Section 4, we know that the basis are retrievable in the

(𝑚, 𝑛)-Bloom Filter with storage with the overwhelming probability. 4) Off-line document processing procedure Given 𝑑ˆ𝐾 , the query processor computes 𝑑 from the specified Boolean logic expression 𝑄𝐾 (𝑑ˆ1 , . . . , 𝑑ˆ𝛾 ) by substituting 𝑘𝑖 in 𝑄𝐾 with the corresponding document set 𝑑ˆ𝑖 . This ends a brief description of our protocol. Clearly, the query processor in our model only needs to generate a set ˆ ∣𝐾∣ of ciphertexts {𝑤 ˆ𝑖 }𝑖=1 of the common reference keyword set ˆ projected on the specified private keyword set 𝐾. The 𝐾 retrieved basis 𝑑ˆ𝐾 for the Boolean criteria 𝑄𝐾 is sufficient for the query processor to compute 𝑑 =𝑄𝐾 (𝑑ˆ𝐾 ). Consequently, the computation complexity of query protocol is significantly reduced (i.e., the size of ciphertexts generated during the ˆ rather than 𝑑 as private query processing is reduced to ∣𝐾∣ ˆ ≪ ∣𝑑∣. that presented the OS protocols) in case that ∣𝐾∣ B. The novelty of our query protocol One can see that the mapping-reducing-filtering model is ˆ general in the sense that if the MapReduce keyword set 𝐾 is whole dictionary 𝐷 and the map function Map and the reduce function Reduce are both dummy, then the mappingreducing-filtering model is reduced to the MapReduce-free, filtering-only model [12], [13]; if only the reduce function is dummy in the mapping-reducing-filtering model, then the reduced model is equivalent to the mapping-filtering model. As a result, our framework can be applied to the other scenarios such as private searching on streams, data processing on large clusters and compressing term positions in web indexes as well (see [5], [17], [18], [15], [16] for detail); The mappingreducing-filtering model also benefits an inverted index to select a common reference word set independent with the selection of an private keyword set and the dictionary 𝐷. As a result, such a model allows us to avoid encrypting every word in the dictionary as the protocols presented [12], [13] and thus a private query processing protocol defined in the mappingreducing-filtering model is much more efficient and flexible than that defined in the MapReduce-free, filtering-only model as well as that defined in the mapping-filtering model. C. The result We remark that the private query processing protocol described above guarantees that non-match documents are not collected while the matched documents are collected with overwhelming probability. We show that the proposed query processing protocol proposed in this paper is privacypreserving assuming that the underlying additively homomorphic public-key encryption is semantically secure. To the best of our knowledge, this is the first private query processing protocol defined over the inverted index program in the Mapping-Reducing-Filtering framework. We therefore expect more applications to be deployed within this framework. RoadMap The rest of this paper is organized as follows: syntax and security definition of query processing protocols

are introduced and formalized in Section 2. In Section 3, building blocks − additively homomorphic public-key encryption scheme (say, the Paillier’s encryption scheme) and (𝑚, 𝑛)Bloom Filters with Storages are sketched; An implementation of private query processing protocol is described and analyzed in Section 4. We conclude this work in Section 5. II. P RIVACY- PRESERVING QUERY PROCESSING ON INVERTED INDEX

A. Syntax Let 𝑊 = {0, 1}∗ be a universe of words and 𝐷 ⊆ 𝑊 be a dictionary. Let 𝐾= (𝑘1 , . . . , 𝑘𝛾 ) be a set of keywords selected by a query processor 𝑃 . 𝑃 keeps 𝐾 secret (hence 𝐾 is called a ˆ = (ˆ 𝑘𝜆 ) be a set of keyword private keyword set). Let 𝐾 𝑘1 , . . . , ˆ ˆ is publicly selected by the inverted index. The keyword set 𝐾 ˆ is called a common reference keyword set). known (hence 𝐾 ˆ ⊆ 𝐷. We assume that 𝐾 ⊆ 𝐾 Let 𝒬 be a class of query type. A query type 𝒬 could be a class of logical expressions in ∧, ∨ and ¬. Let 𝑑𝐶 = (𝑑1 , . . . , 𝑑𝑙 ) be 𝑙 documents stored in a server 𝐶. Given a set of keywords 𝐾 ⊂ 𝐷 and a query 𝑄 ∈ 𝒬, we define 𝑄𝐾 : 𝑑 → {0, 1} that takes a subset document 𝑑 as input and returns 1, if and only if 𝑑 matches the criteria. Definition 1: (Syntax of query processing) For a query 𝑄𝐾 over a set of keywords 𝐾, and for a subset 𝑑 ⊆ 𝑑𝐶 , we say 𝑑 matches query 𝑄𝐾 if and only if 𝑄𝐾 (𝑑) =1. To compute 𝑑 from a given set 𝑑𝐾 ={𝑑1 , . . . , 𝑑𝛾 }, the query processor, first of all, maps the private query criteria 𝑄𝐾 to 𝑑𝐾 by the following procedure: ∙ ∙ ∙

𝑘𝑖 ∧ 𝑘𝑗 if and only if 𝑑𝑖 ∧ 𝑑𝑗 ; 𝑘𝑖 ∨ 𝑘𝑗 if and only if 𝑑𝑖 ∨ 𝑑𝑗 ; ¬𝑘𝑖 if and only ¬𝑑𝑖 . Note that ¬𝑘𝑖 = 𝐾 ∖ {𝑘𝑖 }, it follows that ¬𝑑𝑖 = 𝑑𝐾 ∖ {𝑑𝑖 }.

The query processor then substitutes 𝑘𝑖 in 𝑄𝐾 with the corresponding document set 𝑑ˆ𝑖 . Finally, the query processor obtains 𝑑 from the Boolean logic expression 𝑄𝐾 (𝑑ˆ1 , . . . , 𝑑ˆ𝛾 ). B. The correctness

C. The security Since the off-line document processing procedure is performed by query processor itself, it follows that the definition of security of a private query processing protocol should be isolated from the off-line processing procedure. We therefore consider the following game between an adversary 𝒜 and a challenger 𝒮. ∙ 𝒮 first invokes a key generation algorithm of an additively homomorphic encryption scheme to obtain (𝑝𝑘, 𝑠𝑘), and then sends 𝑝𝑘 to 𝒜; ∙ 𝒜 chooses two queries for two sets of keywords ˆ and sends (𝑄𝐾 , 𝑄𝐾 ) to 𝒮; 𝐾0 , 𝐾1 ⊂ 𝐾, 0 1 ∙ 𝒮 chooses a bit 𝑏 and invokes the encryption scheme ˆ𝑏 ={𝑤 ˆ𝑏 , where 𝑊 ˆ𝑏,1 , . . . , 𝑤 ˆ𝑏,𝜆 } and 𝑤 ˆ𝑏,𝑗 to generate 𝑊 = 𝐸𝑝𝑘 (1, 𝑟𝑏,𝑗 ) if ˆ 𝑘𝑏,𝑗 ∈ 𝐾𝑏 and 𝑤 ˆ𝑏,𝑗 = 𝐸𝑝𝑘 (0, 𝑟𝑏,𝑗 ) if ˆ ˆ𝑏 ∖ 𝐾𝑏 , 𝑗 = 1, . . . , 𝜆; 𝑘𝑏,𝑗 ∈ 𝐾 ˆ𝑏 , and ∙ 𝒮 creates an instance of filtering algorithm with 𝑊 sends the state 𝐵 of the underlying Blooming filter to 𝒜; ∙ 𝒜 can experiment with the code of 𝐵 and finally outputs 𝑏′ ∈ {0, 1} The adversary 𝒜 wins the game if 𝑏′ =𝑏 and loses otherwise. We define the adversary’s advantage in this game to be Adv𝒜 (𝑘) = ∣Pr(𝑏′ = 𝑏) − 1/2∣ Definition 3: (Semantic security of query processing protocols) A query processing protocol is semantically secure if for any adversary 𝒜 described in the above game, we have that Adv𝒜 (𝑘) is a negligible function, where the probability is taken over coin-tosses of the challenger and the adversary. III. B UILDING BLOCKS This section sketches building blocks that will be used to construct private query processing protocols on an inverted index program: (𝑚, 𝑛)-Bloom Filters and additively homomorphic public-key schemes. A. (𝑚, 𝑛)-Bloom filter with Storage A (𝑚, 𝑛)-Bloom Filter consists of an array of 𝑛-bits 𝐵[1], . . . , 𝐵[𝑛], initially set to 0 using 𝑚 independent random hash functions ℎ1 , . . . , ℎ𝑚 with range [1, . . . , 𝑛]. This work will use a variation of a Bloom Filter, called (𝑚, 𝑛)-Bloom Filter with Storage first introduced and formalized in [2]

The correctness of a query processing protocol means that we must save matched documents with overwhelming probability and saves non-matched documents with negligible probability. That is, the buffer decryption algorithm can distinguish collisions in the buffer from the valid documents.

Definition 4: A (𝑚, 𝑛)-Bloom Filter with Storage is a collection {ℎ𝑖 }𝑚 𝑖=1 of functions together with a collection of sets {𝐵𝑗 }𝑛𝑗=1 , where ℎ𝑖 : {0, 1}∗ → [1, . . . , 𝑛]. To insert a pair (𝑢, 𝑣) into this structure, 𝑣 is added to 𝐵ℎ𝑖 (𝑢) for all 𝑖 ∈ [𝑚], where [𝑚]=[1, . . . , 𝑚]. To determine whether or not 𝑣 is stored in a set 𝑈 , one examines all of the sets 𝑣 ∈ 𝐵ℎ𝑖 (𝑢) and returns true if all checks are valid.

Definition 2: (Correctness of query processing protocol) Let 𝜈(𝑘) be a negligible function and 𝑘 be a security parameter. Let 𝑑𝐶 be available documents stored at a server 𝒞. Let 𝐵 ∗ be a subset of the matching documents. We say that a query processing protocol is correct if

As usual, we model ℎ𝑖 as uniform, independent randomness. For each 𝑢 ∈ 𝑈 , we define 𝐻𝑢 = {ℎ𝑖 (𝑢)∣𝑖 ∈ [𝑚]}. The correctness of our construction relies on the following lemma due to Boneh et al [2].



Pr[𝐵 = {𝑑 ∈ 𝑑𝐶 ∣ 𝑄𝐾 (𝑑) = 1}] > 1 − 𝜈(𝑘)

𝑛 Lemma 1: Let ({ℎ𝑖 }𝑚 𝑖=1 , {𝐵𝑗 }𝑗=1 ) be a (𝑚, 𝑛)-Bloom Filter with Storage. Suppose the filter has been initialized to store

some set 𝑈 of size ∣𝑈 ∣ and associated values. Suppose also that 𝑛 = ⌈𝑐𝑚∣𝑈 ∣⌉, where 𝑐 > 1 is a constant. Denote the relationship of element-value associates by 𝑅(⋅, ⋅). Then for any 𝑢 ∈ 𝑈 , the following statements hold true with probability 1neg(𝑘), where the probability is over the uniform randomness used to model the ℎ𝑖 and neg(𝑘) is a negligible function 1) 𝑢 ∈ 𝑈 if and only if ( 𝐵ℎ𝑖 (𝑢) ∕= ∅, ∀ 𝑖 ∈ [𝑚] ); 2) ∩𝑖∈[𝑚] 𝐵ℎ𝑖 (𝑢) = {𝑣∣𝑅(𝑢, 𝑣) = 1} B. Additively homomorphic encryption scheme Paillier investigated a novel computational problem called the composite residuosity class problem (CRS), and its applications to public key cryptography in [14]. The decisional composite residuosity class problem states the following thing: ∗ given 𝑧 ∈𝑟 𝑍𝑁 2 deciding whether 𝑧 is 𝑁 -th residue or non 𝑁 th residue. The decisional composite residuosity class assumption means that there exists no polynomial time distinguisher for 𝑁 -th residues modulo 𝑁 2 . Paillier’s encryption scheme: The public key is a 2𝑘-bit RSA modulus 𝑁 =𝑝𝑞, where 𝑝, 𝑞 are two large safe primes with length 𝑘 and the secret key is (𝑝, 𝑞). The plain-text space is ∗ 𝑍𝑁 and the cipher-text space is 𝑍𝑁 2 . To encrypt a message ∗ uniformly at random and 𝑚 ∈ 𝑍𝑁 , one chooses 𝑟 ∈ 𝑍𝑁 computes the cipher-text as 𝐸𝑃 𝐾 (𝑚, 𝑟) = 𝑔 𝑚 𝑟𝑁 mod 𝑁 2 , ∗ where 𝑔 = (1 + 𝑁 ) has order 𝑁 in 𝑍𝑁 2 . The private key is (𝑝, 𝑞). To decrypt a ciphertetxt 𝑐 =(1 + 𝑁 )𝑚 𝑟𝑁 mod 𝑁 2 with the help of the trapdoor information (𝑝, 𝑞), one first computes 𝑐1 =𝑐 mod 𝑁 , and then computes 𝑟 from the equa𝑁 −1 mod𝜙(𝑁 ) tion 𝑟=𝑐1 mod 𝑁 ; Finally, one can compute 𝑚 from the equation 𝑐𝑟−𝑁 mod 𝑁 2 =1 + 𝑚𝑁 . The Paillier’s public-key cryptosystem is homomorphic, i.e., 𝐸𝑃 𝐾 (𝑚1 , 𝑟1 ) × 𝐸𝑃 𝐾 (𝑚2 , 𝑟2 ) mod 𝑁 2 = 𝐸𝑃 𝐾 (𝑚1 + 𝑚2 mod 𝑁 , 𝑟1 × 𝑟2 mod 𝑁 ) and it is semantically secure if the decisional composite residuosity class problem is hard. We refer to the reader [14] for more details. The Damg˚ard and Jurik [9] public-key encryption scheme, a length-flexible Paillier’s encryption schem, will be used when the size of 𝑑ˆ𝑗 ∈ 𝑑ˆ is large (e.g., ∣𝑑ˆ𝑗 ∣ > 𝑁 ) ∙ The public key is a 2𝑘-bit RSA modulus 𝑁 = 𝑃 𝑄, where 𝑃 , 𝑄 are two large safe primes. The plain-text space is ∗ 𝑍𝑁 𝑠 and the cipher-text space is 𝑍𝑁 𝑠+1 . The private key is (𝑃, 𝑄) and the public key is (𝑁, 𝑠), where 𝑠 ≥ 1. ∗ ∙ To encrypt 𝑚 ∈ 𝑍𝑁 𝑠 , one chooses 𝑟 ∈ 𝑍𝑁 uniformly at random and computes the cipher-text 𝑐 as 𝐸𝑃 𝐾 (𝑚, 𝑟) = 𝑠 (1 + 𝑁 )𝑚 𝑟𝑁 mod 𝑁 𝑠+1 . 𝑚 𝑁𝑠 ∙ To decrypt a ciphertext 𝑐 =(1 + 𝑁 ) 𝑟 mod 𝑁 𝑠+1 , the ′ decryption algorithm 𝒟𝑠𝑘 first computes 𝑐 =𝑐 mod 𝑁 and ∗ ; Once then using the secret key 𝜙(𝑁 ) to calculate 𝑟 ∈ 𝑍𝑁 given 𝑟, 𝒟𝑠𝑘 outputs the message 𝑚 ∈ 𝑍𝑁 𝑠 accordingly. The plaintext of Damg˚ard and Jurik’s public key encryption scheme is flexible and thus enables us to encrypt any length of the documents, say 𝑁 < ∣𝑑ˆ𝑗 ∣ < 𝑁 𝑠 . We will not distinguish the Paillier’s encryption from the Damg˚ard and Jurik lengthflexible encryption scheme. Both schemes are denoted by (𝐸𝑝𝑘 (), 𝐷𝑠𝑘 ()) uniformly throughout the paper.

IV. P RIVATE QUERY PROCESSING PROTOCOL In this section, an implementation of private query processing protocol in the mapping-reducing-filtering framework is introduced and formalized. We show that proposed query protocol is semantically secure if the underlying homomorphic public-key encryption scheme is semantically secure. A. The description 1) The input and output: An input of a query processor 𝑃 is a private keyword set 𝐾 =(𝑘1 , . . . , 𝑘𝛾 ); An input of a possibly corrupted server 𝑆 is a document set 𝑑𝐶 =(𝑑1 , . . . , 𝑑𝑙 ); An output of 𝑃 is a document set 𝑑𝐾 ⊂ 𝑑𝐶 , where 𝑑𝐾 ˆ, ˆ ={𝑑ˆ𝑖 ∣(ˆ 𝑘𝑖 , 𝑑ˆ𝑖 ) ∈ 𝑀 𝑘𝑖 ∈ 𝐾}. An output of 𝑆 is ⊥. 2) Common reference keyword set: Common-referencestring of query processing protocol: a common reference ˆ =(𝑘1 , . . . , 𝑘𝜆 ) and a public description of an keyword set 𝐾 ˆ ⊂ 𝐷. inverted index program. We assume that 𝐾 ⊂ 𝐾 3) An initialization algorithm: The initial algorithm ℐ comprises two PPT algorithms: additively homomorphic encryption generation algorithm ℐ1 (an instance of the Paillier’s encryption scheme throughout the paper) and a keyword hiding algorithm ℐ2 . The details of algorithms are described below 𝑘 ∙ on input a security parameter 1 , 𝑃 invokes ℐ1 to generate two large safe prime numbers 𝑝 and 𝑞 such that ∣𝑝∣ = ∣𝑞∣ =𝑘. Let 𝑁 = 𝑝𝑞, 𝑝𝑘 =(𝑁, 𝑠) (𝑠 ≥ 1) and 𝑠𝑘 =(𝑝, 𝑞). Let 𝐸𝑝𝑘 () be Paillier’s encryption scheme defined over 𝑝𝑘 and 𝐷𝑠𝑘 () be the corresponding decryption algorithm. ˆ and 𝐾, 𝑃 invokes ℐ2 to generate a ciphertext ∙ on input 𝐾 ˆ ˆ projected on 𝐾, where 𝑊 ˆ ={𝑤 ˆ𝜆 } set 𝑊 of 𝐾 ˆ1 , . . . , 𝑤 ˆ and 𝑤 ˆ𝑗 = 𝐸𝑝𝑘 (1, 𝑟𝑤 ) if 𝑘𝑗 ∈ 𝐾 and 𝑤 ˆ𝑗 = 𝐸𝑝𝑘 (0, 𝑟𝑤 ) if ˆ ˆ ∖ 𝐾, 𝑗 = 1, . . . , 𝜆; 𝑘𝑗 ∈ 𝐾 4) An inverted index program: On input 𝑑𝐶 , 𝑃 invokes the ˆ below inverted index program to generate an inverted index 𝑀 ⎞ ⎛ ˆ 𝑘1 : 𝑑1,1 , . . . , 𝑑1,𝛽1 ⎜ 𝑘ˆ2 : 𝑑2,1 , . . . , 𝑑2,𝛽 ⎟ 2 ⎟ ⎜ ⎠ ⎝ ... ..., ..., ˆ 𝑘𝜆 : 𝑑𝜆,1 , . . . , 𝑑𝜆,𝛽𝜆 5) A filtering algorithm: A filtering algorithm ℱ comprises the following three algorithms: a collection algorithm ℱ0 and a buffer encoding algorithm ℱ1 and a buffer decoding algorithm ℱ2 . On input a query 𝑄𝐾 the filtering algorithm performs the following computations ∙ For 𝑗 = 1, . . . , 𝜆, 𝑃 invokes the collection algorithm ℱ0 ˆ 𝑑 to construct a temporary collection 𝑐(𝑗) = 𝑤 ˆ𝑗 𝑗 , where 𝑑ˆ𝑗 = (𝑑𝑗,1 , . . . , 𝑑𝑗,𝛽𝑗 ) for 𝑗 = 1, . . . , 𝜆; ∙ for 𝑗 = 1, . . . , 𝜆, let 𝑢(𝑗) ← ˆ 𝑘𝑗 and 𝑣(𝑗) ← 𝑐(𝑗); Given (𝑢(𝑗), 𝑣(𝑗)), 𝑃 invokes the encoding algorithm ℱ1 to throw 𝑚 copies of 𝑣(𝑗) to 𝑛 bins of the (𝑚, 𝑛)-Bloom Filter with locations {ℎ𝑖 (𝑢)}𝑚 𝑖=1 . Let 𝐵 be the current state of the (𝑚, 𝑛)-Bloom Filter. ∙ Given 𝐵, 𝑃 invokes ℱ2 to compute the locations ℎ𝑖 (𝑢(𝑗)) (1 ≤ 𝑖 ≤ 𝑙, 1 ≤ 𝑗 ≤ 𝑙𝑗 ) and then checks each specified location is stored by some data; if some of the specified

location is empty, ℱ2 outputs 0 indicating the failure of the buffer storage; In case that the output is 1, ℱ2 checks ? that 𝑢(𝑗) = ˆ 𝑘𝑗 ; If the check is valid, ℱ2 decrypts 𝑐(𝑗) to ˆ obtain 𝑑𝑗 . ˆ𝐾 , the query processor computes 𝑑 from the ∙ Given 𝑑 Boolean logic expression 𝑄𝐾 (𝑑ˆ1 , . . . , 𝑑ˆ𝛾 ). This ends the description of query processing protocol. B. The correctness Before providing the security of the scheme, we show the correctness of the private query processing protocol. Let 𝜆 be the number of documents (𝑑ˆ1 , . . . , 𝑑ˆ𝜆 ) generated by the inverted index program. Each document has 𝑚 copies that are thrown into the 𝑚-out-of-𝑛 bins randomly specified by the hash functions {ℎ𝑖 }𝑚 𝑖=1 . Thus we have total 𝑚𝜆 documents thrown 𝑛 bins. Borrowing the notation from [12], [13], we call a document 𝑑ˆ𝑗 a color 𝐶𝑗 (𝑗 = 1, . . . , 𝜆) and call a copy of the color 𝐶𝑗 a ball 𝐵(𝑗, 𝑘), where 𝑘 = 1, . . . , 𝑚. Thus, we have total 𝑚𝜆 balls that are thrown into 𝑛 bins. We say a color 𝐶𝑗 survives if at least one ball of color 𝐶𝑗 survives. We say that the color-survival game succeeds of all 𝜆 colors survives, otherwise, we say that it fails. Let 𝐸 be an event that a single specified ball survives this 𝑚𝜆−1 process. Then Pr[𝐸] =( 𝑛−1 > √1𝑒 assuming that 𝑛 ≥ 𝑛 ) 2𝑚𝜆. Let 𝐸𝑗 be an event that the 𝑗-th ball of a certain color does not survive. Then the ∩ probability that all 𝑚 balls of this 𝑚 color does not survive is Pr[ 𝑗=1 𝐸𝑗 ] ≤ (1− √1𝑒 )𝑚 < (1/2)𝑚 . ∗ Let 𝐸 be an event that the at least one of the color does not survive and 𝐸𝑗∗ be an event that the color 𝐶𝑗 does not survive. ∪𝜆 ∑𝑚 Then Pr[𝐸 ∗ ] ≤ Pr[ 𝑗=1 𝐸𝑗∗ ] ≤ 𝑗=1 Pr[𝐸𝑗∗ ] ≤ 2𝜆𝑚 , which is clearly negligible in 𝑚. This means that with overwhelming probability that all colors survive and hence all 𝜆 documents (𝑑ˆ1 , . . . , 𝑑ˆ𝜆 ) are retrievable in our Bloom-Filter with storage with the overwhelming probability. C. The proof of security Theorem 1: The query processing protocol described above is semantically secure assuming that the underlying Paillier’s encryption is semantically secure. Proof Suppose there exists an adversary 𝒜 that can gain a non-negligible advantage 𝜖 in our semantic security game from the definition 4. We will show that 𝒜 can be used to gain an advantage in breaking semantic security of the underlying public-key encryption scheme. A challenger 𝒮 is first given an encryption 𝑐 of a message 𝑚𝑏 ∈ {0, 1} chosen uniformly at random, i.e., 𝑐 = 𝐸𝑝𝑘 (𝑚𝑏 ) (note that the challenger 𝒮 is also given the public key 𝑝𝑘 but not the secret key 𝑠𝑘 of the underlying Paillier’s encryption scheme). ˆ The challenger 𝒮 is also given two set of keywords 𝐾0 ⊂ 𝐾 ˆ such that 𝐾0 ∕= 𝐾1 . The challenger 𝒮 now and 𝐾1 ⊂ 𝐾 ˆ of the given reference keyword set 𝐾 ˆ generates a ciphertext 𝑊 by the following procedure: re-randomized encryption 𝐸𝑝𝑘 (0) ˆ ∖ 𝐾𝑏 and 𝐸𝑝𝑘 (0)𝑐 if 𝑤 ∈ 𝐾𝑏 if 𝑤 ∈ 𝐾

if 𝑚𝑏 =1, then the construction of MapReduce Filter is exactly same as that real protocol described above, hence in this case with probability 1/2 + 𝜖 the adversary returns 𝑏′ such that 𝑏′ = 𝑏. ∙ if 𝑚𝑏 =0, then the simulated MapReduce Filter searches nothing, hence in this case with probability 1/2 the adversary returns 𝑏′ such that 𝑏′ =𝑏. The 𝒮 now outputs what the adversary outputs. As a result, the challenger 𝒮 obtains the non-negligible advantage 1/2 + 𝜖/2 to break the semantic security of the Pailler’s encryption. ∙

V. C ONCLUSION We have implemented a private query processing protocol on an inverted index program and have shown that the proposed protocol is semantically secure if the underlying homomorphic public-key encryption scheme is semantically secure. The formalization of the private query processing protocol is general and we expect more applications can be deployed within the proposed framework. R EFERENCES [1] M.Armbrust, A.Fox, R.Griffith, A.D.Joseph, R.H.Katz, A.Konwinski, G.Lee, D.A.Patterson, A.Rabkin, I.Stoica, M.Zaharia: Above the Clouds: A Berkeley View of Cloud Computing, Technical Report No. UCB/EECS2009-28 [2] D.Boneh, E.Kushilevitz, R.Ostrovsky and W.E.Skeith III: Public Key Encryption That Allows PIR Queries. CRYPTO 2007: 50-67 [3] J.Bethencourt, D.X.Song and B.Waters: New Constructions and Practical Applications for Private Stream Searching (Extended Abstract). IEEE Symposium on Security and Privacy 2006: 132-139 [4] John Bethencourt, Dawn Xiaodong Song, Brent Waters: New Techniques for Private Stream Searching. ACM Trans. Inf. Syst. Secur. 12(3): (2009) [5] S.Ding, J.Attenberg and T.Suel: Scalable techniques for document identifier assignment in inverted indexes. WWW 2010: 311-320 [6] J.Dean and S.Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150 [7] J.Dean and S.Ghemawat: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1): 107-113 (2008) [8] J.Dean and S.Ghemawat: MapReduce: a flexible data processing tool. Commun. CACM 53(1): 72-77 (2010) [9] I.Damg˚ard and M.Jurik: A Generalisation, a Simplification and Some Applications of Paillier’s Probabilistic Public-Key System. Public Key Cryptography 2001: 119-136. [10] R.L.Grossman. The Case for Cloud Computing. IT Professional 11(2): 23-27 (2009) [11] Robert L. Grossman, Yunhong Gu, Michal Sabala, Wanzhi Zhang: Compute and storage clouds using wide area high performance networks. Future Generation Comp. Syst. 25(2): 179-183 (2009) [12] R.Ostrovsky and W.E.Skeith III: Private Searching on Streaming Data. CRYPTO 2005: 223-240 [13] R.Ostrovsky and W.E.Skeith III: Private Searching on Streaming Data. J. Cryptology 20(4): 397-430 (2007) [14] P.Paillier: Public-Key Cryptosystems Based on Composite Degree Residuosity Classes. EUROCRYPT 1999: 223-238. [15] S.Petrovic and P.Brown: Large Scale Analysis of the eDonkey P2P File Sharing System. INFOCOM 2009: 2746-2750 [16] H.Wan, C.Tan and Q.Li: Snoogle: A Search Engine for the Physical World. INFOCOM 2008: 1382-1390 [17] H.Yan, S.Ding and T.Suel: Compressing term positions in web indexes. SIGIR 2009: 147-154 [18] H.Yan, S.Ding and T.Suel: Inverted index compression and query processing with optimized document ordering. WWW 2009: 401-410