Relationship Privacy: Output Perturbation for Queries with Joins∗

184 downloads 185 Views 388KB Size Report
vibhor@cs.washington.edu ... gorithm guarantees privacy against adversaries in this class ..... For example, in online social networking applications, users.
Relationship Privacy: Output Perturbation for Queries with Joins ∗

Vibhor Rastogi †

Michael Hay

CSE, University of Washington

CS, UMass Amherst

Seattle, WA, USA, 98105-2350 [email protected]

Amherst, MA, USA, 01003-9264 [email protected]

Gerome Miklau

Dan Suciu

CS, UMass Amherst

CSE, University of Washington

Amherst, MA, USA, 01003-9264 [email protected]

Seattle, WA, USA, 98105-2350 [email protected]

ABSTRACT

General Terms

We study privacy-preserving query answering over data containing relationships. A social network is a prime example of such data, where the nodes represent individuals and edges represent relationships. Nearly all interesting queries over social networks involve joins, and for such queries, existing output perturbation algorithms severely distort query answers. We propose an algorithm that significantly improves utility over competing techniques, typically reducing the error bound from polynomial in the number of nodes to polylogarithmic. The algorithm is, to the best of our knowledge, the first to answer such queries with acceptable accuracy, even for worst-case inputs. The improved utility is achieved by relaxing the privacy condition. Instead of ensuring strict differential privacy, we guarantee a weaker (but still quite practical) condition based on adversarial privacy. To explain precisely the nature of our relaxation in privacy, we provide a new result that characterizes the relationship between -indistinguishability (a variant of the differential privacy definition) and adversarial privacy, which is of independent interest: an algorithm is -indistinguishable iff it is private for a particular class of adversaries (defined precisely herein). Our perturbation algorithm guarantees privacy against adversaries in this class whose prior distribution is numerically bounded.

Algorithms, Security, Theory

Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— Statistical databases; G.3 [Probability and statistics]: Distribution functions ∗ Supported in part by the National Science Foundation under the grant IIS-0627585 † Supported in part by the National Science Foundation under the grant IIS-0415193

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PODS’09, June 29–July 2, 2009, Providence, Rhode Island, USA. Copyright 2009 ACM 978-1-60558-553-6 /09/06 ...$5.00.

Keywords private data analysis, privacy preserving data mining, output perturbation, sensitivity, social networks, join queries

1.

INTRODUCTION

The significant advances in privacy-preserving query answering have focused primarily on datasets describing properties of independent entities, but not relationships. Queries over data containing relationships typically involve joins. For such queries many existing techniques cannot be applied, and others yield severely distorted answers. We are driven to consider data with relationships by the compelling application of social network analysis.

Application: Analysis of private social networks A social network is a graph representing a set of entities and relationships between them. We model a social network as a simple directed graph, described by a relationship table R(u, v). Figure 1 shows a simple social network, (a), and its relationship table, (b). Each edge of the graph G occurs as a tuple in R. Social network analysis is a rapidly growing discipline focused on the structure and function of networks. Many of the most interesting social network analyses involve topological properties of the graph, which require self-joins on the relationship table. We list a number of such queries in Fig. 1(c), each of which counts the occurrence in the graph of some pattern or structure. For example, the clustering coefficient [15] of a graph is the likelihood that a friend of one’s friend is also a friend. The clustering coefficient can be computed from two of the counting queries shown in the figure: ∆ (the number of triangles) and ∨ (the number of two-length paths). Analysts also commonly count the occurrence of interesting patterns (sometimes called motif analysis) and the distribution of path lengths in the graph. Queries ci , di,v , Ki , and hi,j are examples of such queries. The collection and release of social network data is highly constrained by privacy concerns. Relationships in a social network are often sensitive, as they may represent personal contacts, flows in information, collaborations, employment

Node 1 Alice Bob Bob Ed Ed Greg Ed (a) Social Network G

Node 2 Bob Carol Dave Bob Dave Dave Greg

∆ : # of triangles in G ∨ : # of 2-length paths in G ci : # of cycles of length i in G di,v : # of nodes at distance i from v Ki : # of i-node cliques hi,j : # of matches for conn. subgraph h with i nodes and j edges

(b) Relationship table R

(c) Social network queries

Figure 1: An example social network represented as a directed graph and edge table, along with selected social network queries. relationships, etc. Our goal is to permit accurate social network analysis while protecting against the threat of edge disclosure. Researchers have begun to investigate threats against social networks [1, 11] and a number of anonymization schemes have been proposed [11, 12, 20, 21, 22, 3]. These techniques do not, however, provide quantitative guarantees of privacy and utility. Existing query-answering techniques that do provide guarantees are difficult to adapt to data containing relationships. Rastogi et al. [17] publish a randomized version of a database that supports accurate estimates of counting queries, but does not support joins. Dwork et al. [6] propose an output perturbation mechanism that can be used to answer queries with joins, however the noise added is unacceptably high for some join queries, such as those in Fig. 1(c). To address the shortcomings of existing techniques we make the following three contributions: I We propose an output perturbation algorithm, similar to the differentially private algorithm of Dwork et al [6]. For a large subclass of conjunctive queries with disequality, we show that the algorithm meets a strong informationtheoretic standard of adversarial privacy that bounds the worst-case change in the adversary’s belief about tuples in the database. I We show that the algorithm significantly improves utility over competing techniques. In particular, for social network queries, such as those in Fig. 1(c), the algorithm improves the error bound from polynomial in the number of nodes to polylogarithmic. This algorithm is (to the best of our knowledge) the first to answer such queries with acceptable accuracy, even for worst-case inputs. I The improved utility is achieved by relaxing the privacy condition. To provide insight into the relaxation in privacy, we characterize the class of adversaries that mechanisms satisfying -indistinguishability (a variant of the differential privacy definition) protect against. This result is of independent interest, as it describes a fundamental connection between two widely studied notions of differential [6, 5] and adversarial [7, 14, 17, 8] privacy. We note that all our results apply to arbitrary relational databases, and are not restricted only to social networks. The next section provides an overview of all the main results in the paper. Section 3 provides technical details for the privacy and equivalence results. In Sections 4 and 5 we explain extensions and discuss related work.

2.

MAIN RESULTS

In this section we describe our two main results and their significance. We postpone several technical details to Sec. 3. All the proofs are omitted to the Appendix.

2.1

Background

Each of the social network queries in Fig. 1(c) can be represented as a counting conjunctive query with disequality, expressed in CQ(6=). In this paper the value of a conjunctive query is defined to be a number, namely the number of satisfying assignments for the variables in the query head. For example, the query ∆(x, y, z) = R(x, y)R(y, z)R(z, x), x 6= y, y 6= z, z 6= x returns the number of triples x, y, z that form a triangle in the graph. In the general case, we let R1 , R2 , . . . , Rl be the relations in the database. We assume that each attribute of the relations takes values from the same domain D and denote m = |D| the size of the domain. In practice one chooses D as the active domain of the database, for instance in Fig 1(b) D is the set of all nodes in the graph, and m the number of such nodes. We denote by Tupi = Darity(Ri ) the set of all possible tuples in the relation Ri and Tup = ∪i Tupi the set of all possible tuples in the database. A database instance I ⊆ Tup is an instantiation of the relations Ri , where |I| is the number of tuples in the database.

2.2

Algorithm

Algorithm 2.1 takes as input the database instance I and a query and returns an output A(I). Our goal is to provide both privacy—A(I) should not disclose specific tuples in I—and utility—A(I) should be a good estimate of the true query answer, q(I). For simplicity, we assume a single query: the generalization to multiple queries is discussed in Sec 4.3. The algorithm also takes as input two privacy parameters γ and . On a request for a query q, the algorithm does one of two things: with some probability, it ignores the database and outputs a random number chosen uniformly from the range of q (Step 3). This event occurs with probability p/γ, where p (defined in Step 2) is a function of the query q and m, and is typically small, while γ is the first privacy parameter. Otherwise (Step 4) the algorithm returns the query answer after adding random noise sampled from a Laplace distribution with mean 0 and standard deviation

λ/, where λ (defined in Step 2) is a function of the query and m, and  is the second privacy parameter. Algorithm 2.1 Output perturbation (Inputs: database I, privacy parameters γ, )

query q,

1: Let g be the number of subgoals in query q and v the number of variables. 2: Set λ = (8(v + 1)g 2 log m)g−1 and p = g/m. 3: With probability p/γ, output A(I) chosen uniformly at random from {0, 1, . . . , mv }. 4: Otherwise (with probability 1 − p/γ), output A(I) as A(I) = q(I) + Lap(λ/)

Dwork et. al. [6] introduced a perturbation algorithm that, for a given query q adds a Laplacian noise λ/ and guarantees -indistinguishability. The λ for that algorithm is determined by the global sensitivity of the query, which roughly denotes the worst-case change in the query answer on changing one tuple in any possible database instance. The key difference in our algorithm is the choice of λ. Returning a completely random response with small probability in Step 3, along with the relaxed notion of adversarial privacy, allows us to choose a much smaller value for λ in Step 4, and as a consequence, our algorithm has much better utility. Roughly speaking, λ for our algorithm is determined by the adversary’s prior expectation of the query sensitivity instead of the worst-case sensitivity over all possible database instances. See Sec. 3.2 for the formal details. For example, given any query in Fig. 1(c) the random noise that our algorithm adds results in quite acceptable utility, while the amount of noise needed to ensure -indistinguishability renders them useless to a data analyst (we do a detailed comparison in Sec 2.4). The novelty in our approach relies in the privacy analysis. Instead of guaranteeing -indistinguishability, we guarantee a somewhat weaker adversarial privacy: while these two paradigms to privacy are quite distinct, and hard to compare, we will show in Sec 2.5 a remarkable connection between the two definitions. Using this connection we are able to claim that our algorithm guarantees adversarial privacy against the same adversaries as -indistinguishability, as long as they are bounded quantitatively. We explain this below, but first we discuss briefly the utility of the algorithm.

2.2.1

Utility

Our definition of utility is based on how close the output A(I) is to the actual answer q(I). Definition 2.1 ((Σ, Γ)-usefulness). An algorithm A is called (Σ, Γ)-useful w.r.t query q if for all inputs I: P r [ |q(I) − A(I)| ≥ Σ ] ≤

Γ m

Here the probability P r is computed solely on the random coin tosses of the algorithm A We prove in the appendix (Sec. D) the following theorem. Theorem 2.2 (Utility). Let q be any query with g subgoals. Then Algorithm 2.1 is (Σ, Γ)-useful w.r.t q for Σ = O(logg m/) and Γ = O(1/γ).

Example Let us fix the privacy parameters  and γ as constants independent of m. Consider again the query ∆ that counts the number of triangles in R. Then, g, the number of subgoals of ∆ is 3 and hence the Algorithm 2.1 guarantees (Σ, Γ)-usefulness for Σ = O(log3 m) and Γ = O(1). This means that the algorithm outputs a response with error O(log3 m), except with probability O(1/m). Recall that m is the number of nodes in R.

2.3

Adversarial Privacy

In the adversarial definition of privacy an adversary is a probability distribution: it defines the adversary’s prior belief. An algorithm is private if the a posteriori distribution (after seeing the algorithm’s output) is close to the prior. Moreover, this property should hold for any reasonable adversary. In our setting, the adversary’s prior knowledge is a probability distribution P on database instances. We denote P (I) the probability of an instance I, and P (t) marginal probabilP ity of a tuple, P (t) = I:t∈I P (I). After seeing the output O = A(I) of the algorithm, the adversary changes his belief to the a posteriori P (t | O). Privacy means that the a posteriori should be close to the prior. Since we do not know P (only the attacker knows it) we consider an entire class of distributions P: the adversary can be any P ∈ P. Definition 2.3 (Adversarial Privacy). An algorithm A is (, γ)-adversarially private w.r.t. P iff for all outputs O of A, tuples t ∈ T up and prior distributions P ∈ P the following holds: P (t | O) ≤ e P (t) + γ Note that for a fixed O, the probability P (t|O) depends on the prior P and the definition of A, but not on the coin tosses (randomness) of any particular run of A. We emphasize that the privacy guarantee is given for every output O and thus for every input instance I, and is not an averagecase guarantee. If an algorithm is (, 0)-adversarially private, then we simply say that it is -adversarially private. Related definitions in the literature include (ρ1 , ρ2 ) breach [7], (d, γ)-privacy [17], and SafeΠ [8]. Like the latter, we do not protect against negative leakages (but see Sec. 4.3), which is acceptable in many, but not all applications. The main difficulty in applying adversarial privacy is choosing the class of adversaries P. If one chooses P to be the set of all possible adversaries, then no algorithm can be both useful and adversarially private. Paraphrasing an example from Dwork [5], suppose a malicious adversary doesn’t know how many friends Alice has, but knows that Alice has a number of friends that is exactly 0.01% of the total number of triangles in the data. Then, if an algorithm returns the total number of triangles, a significant amount of information has been leaked. We cannot hope to protect adversarial privacy against such an omnipotent adversary, and the common practice is for a security manager to assume a weaker threat. The ability to adjust the threat model, hence the class of adversaries P, is of key importance in practice, yet no principled way to define the class P has been proposed in the literature. In this paper we propose a principled definition of the power of the adversary, as quantitative restrictions of the

class PT LM (defined formally in Sec. 2.5). The importance of the class PT LM is that adversarial privacy with respect to this class is precisely -indistinguishability [6] (reviewed in Sec. 2.5). In other words, -indistinguishability can be described precisely in terms of a particular class of adversaries, namely PT LM adversaries, which justifies our choice. Once we have this yardstick, we introduce the following quantitative restrictions on adversaries: Definition 2.4 ((δ, k)-bounded). Let δ ∈ [0, 1] and k ∈ N. An adversary P is (δ, k)-bounded if (i) At most k tuples have marginal probability P [t] = 1, and (ii) For all other tuples t ∈ T up the marginal probability P [t] is less than δ. Algorithm 2.1 guarantees privacy for all queries in a certain subset of CQ(6=), which we call stable queries. We postpone the formal definition of stable queries until Sec. 3.1, but state here (See Sec. C.3 in the appendix for a proof) that the following important class of queries is stable. Proposition 2.5 (Subgraph query stability). Let q be the query that counts the number of occurrences of a subgraph h in G. If h is connected (in the traditional graphtheoretic sense), then q is stable. All queries in Fig. 1(c) with the exception of di,v correspond to queries that count connected subgraphs, and hence are stable. The class of stable queries is in fact much more general and includes the query di,v as well. We can now state our first main result (the proof technique is in Sec. 3.2): Theorem 2.6 (Privacy). Let δ = log m/m and k = log m be the parameters bounding the adversary’s power. Then if the input query q is stable, Algorithm 2.1 ensures (, γ)-adversarial privacy w.r.t any (δ, k)-bounded PT LM adversary. Finally, we comment on choosing the parameters for bounding the adversary quantitatively. In many applications, assuming a (δ, k) bound on the adversary is quite reasonable. For example, in online social networking applications, users can only access a limited subgraph of the actual social network. Facebook allows users to know only their friends, and the friends of their friends. Thus, a bound of k = log m is a reasonable assumption on the power of the adversary. Another example is the active attackers considered √ in [1] that can create an arbitrary subgraph of k = O( log m) edges. For edges other than the k known edges, we assume an upper bound of δ = log m/m, which is also appropriate for sparse networks observed in practice in social networks [15]. The value is not critical for the functioning of our algorithm and we consider larger δ in Sec. 4.3.

2.4

Discussion

To put our results in perspective, we compare Algorithm 2.1 to the algorithm of Dwork et al. [6] (call it DMNS) in terms of both privacy and utility. These algorithms are modeled on different notions of privacy, but our equivalence result (Section 2.5) allows us to relate them more easily. DMNS is -indistinguishable, which is equivalent to (, 0)-adversarial privacy w.r.t. any adversary in the PT LM class. In contrast, Algorithm 2.1 only protects against a subclass of PT LM , namely those that are

(δ, k)-bounded. It also tolerates slightly more disclosure: it is (, γ)-adversarially private where γ > 0. To compare utility, we fix  to be the same and γ some positive constant. In terms of utility, both algorithms offer guarantees that depend on the particular query. We compare the social network queries in Figure 1(c) in terms of (Σ, Γ)-usefulness. We fix Γ = O(1)—thus, both algorithms have a O(1/m) probability of returning a useless output—and compare Σ, the bound on error. For Algorithm 2.1, the bound on the error follows from Theorem 2.2. For DMNS, the bound comes from the fact that its error is distributed exponentially with mean GSq /, where GSq is the global sensitivity of q. Global sensitivity is formally defined in Section 3.1, but intuitively it is the worst-case change in the query answer given the change of a single tuple in any possible database instance. Table 1 compares the error bounds of the two algorithms. As additional context, it also shows a lower bound on the error of an optimal -indistinguishable algorithm, denoted OP Tind —this bound is derived in the appendix (Sec E). In order to achieve -indistinguishability for the example queries, one needs to introduce error rates that are linear or polynomial in m (the number of nodes in the graph), which makes them useless for social network analysts. By contrast our algorithm guarantees error bounds that are polylogarithmic in m. (The di,v query has a range of {0, . . . , m}, so the only practically useful algorithm is Algorithm 2.1 for small i. For Ki , Algorithm 2.1 is useful for small i, but no algorithm is useful for large i.)

Table 1: The error bound (Σ) of Algorithm 2.1 compared with -indistinguishable alternatives for the queries of Figure 1(c). Recall that m = the number of nodes in the graph. Σ Query OP Tind DMNS [6] Algorithm 2.1 ∆ Ω(m) Θ(m log m/) Θ(log3 m/) ∨ Ω(m) Θ(m log m/) Θ(log2 m/) i−2 i−2 ci Ω(m ) Θ(m log m/) Θ(logi m/) di,v Ω(m) Θ(m log m/) Θ(logi m/) 2 i−2 i−2 Ki Ω(m ) Θ(m log m/) Θ(logi m/) hi,j Ω(mi−2 ) Θ(mi−2 log m/) Θ(logj m/)

Note that all of these queries have joins. If a query does not have joins—such as count the number of edges in the graph—it has low global sensitivity and DMNS has a similar error rate (but better constants). However, these examples illustrate that joins can lead to high global sensitivity and Algorithm 2.1 can guarantee much more accurate answers than a -indistinguishable algorithm. Queries with high global sensitivity were also considered by Nissim et al. [16], who propose adding noise proportional to a smooth upper bound on the local sensitivity (i.e., sensitivity of the instance rather than worst-case global sensitivity). A direct comparison with our approach is difficult because while the smooth bounds were derived for a few specific queries (e.g., median), it is not known how to compute them in general, nor in particular, for the social network queries we consider here. (See Section 5.2 for further discussion.)

2.5

Adversarial Privacy vs. Indistinguishability

In this section, we state our second main result, which states that -indistinguishability is the same as -adversarial privacy w.r.t. PT LM adversaries. Indistinguishability was defined in [6] and we review it briefly. Consider a randomized algorithm A that, on a given input instance I, returns a randomized answer A(I). Given an answer O, Pr(A(I) = O) denotes the probability (over the random coin tosses of A) that the algorithm returns exactly O on input I. For simplicity, we denote I + t the instance obtained by adding a tuple t which is not in I to the set I. Definition 2.7 (-indistinguishability). An algorithm A is -indistinguishable iff forall instances I and I 0 that that differ on a single tuple (i.e. |I| = |I 0 | and there exist tuples t, t0 such that I + t = I 0 + t0 ), and for any output O, the following holds: P r[A(I 0 ) = O] ≤ e P r[A(I) = O]

(1)

A related notion is that of -differential privacy [5], which guarantees condition (1) forall I and I 0 such that one contains an additional tuple compared to the other (i.e. there exists a tuple t such that either I + t = I 0 or I 0 + t = I). The distinction between these two notions of privacy is sometimes blurred in the literature and the term differential privacy is used to denote indistinguishability. However, we use the term indistinguishability throughout the paper, since this is the precise definition for which our main results apply. Indistinguishability does not mention an adversary, but was shown to protect against informed adversaries [6], who know all but one tuple in the database. However, the informed adversaries do not have tuple correlations. Against adversaries with arbitrary correlations (such as “Alice’s height is 5 inches shorter than the average” [5]), -indistinguishability does not guarantee adversarial privacy. An important question is: for which adversaries does -indistinguishability guarantee adversarial privacy? Our result shows that -indistinguishability corresponds precisely to -adversarial privacy for a certain PT LM class. The class PT LM has never been defined before and requires us to provide the following background on log-submodular and planar distributions. While well known in probability theory [2], log-submodular (LM ) distributions have been considered only recently in the context of data privacy in [8]. Definition 2.8 (Log-submodularity). A probability distribution P is log-submodular (denoted LM ) if forall instances I, I 0 the following condition holds: P (I)P (I 0 ) ≥ P (I ∩ I 0 )P (I ∪ I 0 )

(2)

To understand the intuition behind a log-submodular distribution, recall that a distribution P is called tuple-independent if for any two tuples t1 , t2 , P (t1 , t2 ) = P (t1 )P (t2 ); we write IN D for the class of all tuple-independent distributions. One can check that P is tuple independent iff for any two instances1 P (I)P (I 0 ) = P (I ∩ I 0 )P (I ∪ I 0 ). Hence, IN D ⊆ LM . Most prior work on adversarial data privacy, including our own [14, 17] has considered only tuple-independent 1 expansion all four probabilities, e.g. Pr(I) = QBy direct Q P (t) · t∈I t6∈I (1 − P (t)), etc.

adversaries. The LM adversaries, considered first in [8], are more powerful since they allow correlations. However, they turn out to be too powerful to capture -indistinguishability because, as we explain next, they are not closed under marginalization. Marginalization Let Q ⊆ T up be any set of tuples. Given an algorithm A, its marginalization w.r.t. Q is AQ (I) = A(I ∩ Q). Thus, AQ just ignores the tuples outside the set Q. One can easily check that if A is -indistinguishable and Q is a fixed set, then AQ is also -indistinguishable. Thus, if -indistinguishability is to be captured by some class of adversaries, that class must be closed under the operation of marginalization, defined as follows: Definition 2.9 (Marginal distribution). Let P be any probability distribution over possible databases. Then the marginal distribution over Q, denoted as PQ , is a probability distribution over the subsets S of Q defined as PQ (S) = P I:I∩Q=S P (I). LM distributions are not closed under marginalization, hence they cannot capture -indistinguishability. This justifies our definition of total log-submodular distributions. Definition 2.10 (Total log-submodularity). A probability distribution P is total log-submodular (T LM ) if for every set Q ⊆ T up, the marginal distribution PQ is logsubmodular. Formally: ∀Q, ∀S, S 0 ⊆ Q, PQ (S)PQ (S 0 ) ≥ PQ (S ∩ S 0 )PQ (S ∪ S 0 ) The T LM class contains IN D and all distributions with negative correlations. Intuitively, by negative correlations we mean that P (t1 |t2 ) ≤ P (t1 ) for all tuples t1 , t2 . We formalize the notion of negative correlations in Sec. 3.3. Planarity Finally, our last definition is motivated by the following observation. It is implicit in the definition of indistinguishability that the size of the database is made public. To see that, consider the following algorithm A: given an instance I, the algorithm returns the cardinality of I. A is trivially -indistinguishable, for every  > 0. In terms of adversarial privacy, this means that the adversary knows the size of the database. The corresponding notion in probability theory is called planarity [18]. Definition 2.11 (Planar distributions). A distribution P is planar if there exists a number n such that forall I if |I| 6= n then P (I) = 0. We call n the “size of the database” for the distribution P . We denote by PT LM the class of planar, T LM adversaries. At last, we show our second main result: a complete equivalence of -indistinguishability and adversarial privacy w.r.t PT LM adversaries. The proof technique is in Sec. 3.3. Theorem 2.12 (Equivalence). Let A be any algorithm and  > 0 be any parameter. Then the following two statements are equivalent (i) A is -indistinguishable, (ii) A is -adversarially private w.r.t PT LM adversaries. This equivalence result has interesting implications for both privacy definitions. In Sec 4, we propose a relaxation of -indistinguishability and also show how adversarial privacy provides protection of boolean properties and resistance to some positive correlations.

|goals(q)| . |vars(q)|

We say a query is dense if its density and the density of all of its possible homomorphic images (defined below) is greater than one.

2(a) Triangles

2(b) length 2 paths

Figure 2: Graphs showing high sensitivity

3.

TECHNICAL RESULTS

In this section, we define and characterize the class of stable queries, which are queries that Algorithm 2.1 protects privacy for. We also explain the technical results behind our two main results: the privacy Theorem 2.6 and the equivalence result (Theorem 2.12).

3.1

Stable Queries

The intuition behind the definition of stable queries is based on the the notion of local sensitivity. This notion was defined in [16] and denotes the maximum change in the query answer on changing any one tuple of the particular input instance. Definition 3.1 (LSq [16]). The local sensitivity of a query q on instance I is the smallest number LSq (I) such that for all database instances I 0 that differ on a single tuple with I (i.e. |I| = |I 0 | and there exist tuples t, t0 such that I + t = I 0 + t0 ), |q(I) − q(I 0 )| ≤ LSq (I) Global sensitivity [6] is the worst-case local sensitivity over all possible instances on the domain. Dwork et al. [6] propose an algorithm that guarantees good utility for all queries with low global sensitivity. The queries listed in Fig 1(c), however, have high global sensitivity. For instance, the global sensitivity of ∆ defined in Figure 1(c) is m − 2, where m is the number of nodes in G. This is illustrated by the graph shown in Figure 2(a). Suppose the graph contains the edge (Alice, Bob), but does not contain the edge (Bob, Alice). Then the number of triangles in the graph is 0. If we change the edge (Alice, Bob) to the edge (Bob, Alice), the number of triangles becomes m − 2, showing that the sensitivity of ∆ is m − 2. Similarly, Figure 2(b) illustrates the high global sensitivity of ∨. Stable queries are those that have low local sensitivity in expectation, where the expectation is computed over database instances drawn from adversary’s prior distribution. Since we do not know the adversary’s prior, we give a syntactic criteria for queries that guarantee low expected local sensitivity based only on the quantitative bounds δ, k of the adversary (see Def. 2.4). We define two notions: dense queries, which are queries that have low expected value w.r.t. adversary’s prior distribution, and the derivative set of a query q, which contains queries that upper bound the local sensitivity of q. Roughly speaking, a stable query is a query whose derivative set contains only dense queries. Dense queries Denote goals(q) the set of subgoals in the query q. Also denote vars(q) (const(q)) the set of variables (constants) in q. We define the density of q the ratio

Definition 3.2 (Homomorphism). A homomorphism is a function h : vars(q) → vars(q) ∪ const(q) that maps each variable in vars(q) to either a variable in vars(q) or a constant in const(q) while ensuring that the inequality constraints of the query are satisfied. For a homomorphism h, denote qh the query obtained by replacing, for all variables x ∈ vars(q), each occurrence of x in q by the symbol h(x). qh is called the h-homomorphic image of q. We can now define dense queries formally. Definition 3.3 (Dense query). Let H denote the set of all homomorphisms for a query q. We say q is dense if all the homomorphic images of q (i.e. queries in the set {qh |h ∈ H}) have density ≥ 1. Examples The query ∆ of Fig 1(c) is dense. It has a density of 1 and all of its homomorphic images have higher densities. The query ∨, defined as R(x, y)R(y, z), x 6= y, y 6= z, z 6= x, is not dense as its density is 2/3. The query q(x, y, z) : −R(x, y, z)R(x, z, y)R(x, y, y) is not dense, even though its density is 1. This is because its homomorphic image qh (x, z) : −R(x, z, z) (for h(y) = z) has density 1/2. Derivative Now we define the notion of derivative used for stable queries. Define a set of new constants dummy with |dummy| = |vars(q)| and the set D0 = dummy ∪ const(q). Let x = {x1 , . . . , xl } be any l variables of q and a = {a1 , . . . , al } be any l constants from D0 . Denote q[x/a] the query obtained by substituting all occurrences of variable xi in q by the constant ai for all i = {1, . . . , l}. Definition 3.4 (Derivative). Let g be a subgoal of q with l variables x = {x1 , . . . , xl } and let a = {a1 , . . . , al } be a set of l constants. The derivative of q w.r.t. (g, a), denoted ∂q , is the query obtained by removing the subgoal g as ∂(g,a) from the query q[x/a]. ∂q Intuitively, the query ∂(g,a) represents the change in the answer to q if the tuple a = {a1 , . . . , al } is added in the rela∂∆ tion g. For instance, ∂(R(x,y),(a,b)) is the query R(b, z)R(z, a), which measures the change in the number of triangles on adding the edge R(a, b). Next we define the derivative set which contains all possible derivatives of the query.

Definition 3.5 (Derivative set). The derivative set ∂q of q, denoted as ∂ 1 (q), is the set of queries { ∂(g,a) |g ∈ |vars(g)|

goals(q), a ∈ D0

}.

Due to technical reasons, we also need to define higher order derivate sets. The ith order derivative set of q, denoted as ∂ i (q), is recursively defined as the set of derivate queries of the (i − 1)th order derivative set. That is ∂ i (q) = ∪q0 ∈∂ i−1 (q) ∂ 1 (q 0 ). We can now define stable queries. Definition 3.6 (stable). A query q is stable if for all i ≥ 1 and queries q 0 ∈ ∂ i (q), q 0 is dense. Examples Any single subgoal query is automatically stable. Proposition 2.5 proves the stability of a large class of queries. di,v defined as q(xi ) : −R(v, x1 )R(x1 , x2 ) . . . , R(xi−1 , xi ), is also stable. This is because all derivatives of di,v are dense.

Checking for stable queries The query complexity for checking whether a query is stable is in coNP. To see this, note that a witness for showing a query is not stable is simply a derivative query and a homomorphism for the derivative. Checking whether the homomorphic image of the derivative has density < 1 can then be done in polynomial time. Despite the high query complexity, stability is a property of the query and does not depend on the data. This is an advantage compared to some techniques, such as the approach of Nissim et al. [16], that must look at the data to determine how to answer the query. (See Section 5 for more discussion of related work.)

3.2

Proof of Privacy (Theorem 2.6)

The proof relies on the property that the local sensitivity of a query is bounded w.h.p. according to the adversary’s prior distribution. A similar notion was also considP ered in [10]. Denote P (LSq > λ) = I:LSq (I)>λ P (I), the probability that the local sensitivity of q is greater than λ according to adversary’s prior distribution P . Next we show the value of λ that makes this probability small for all stable queries. Proposition 3.7 (Result-1). Let δ = log m/m and k = log m. Let P be any (δ, k)-bounded PT LM adversary and q be a stable query with g subgoals and v variables. Then for λ = 8(v + 1)g 2 log m, we have g P (LSq > λ) ≤ v+1 m The proof of Proposition C.3 is in the appendix (Sec. C.1). It uses new and non-trivial concentration results, which are also discussed in the appendix (Sec. A). The concentration results extend those shown by Vu et. al. [19] for tupleindependent distributions to PT LM distributions. These results may be of independent interest as they work for a general class of queries and probability distributions. Next we show the following privacy result for Algorithm 2.1. Its proof appears in the appendix (Sec. C.2) and exploits the connection between -indistinguishability and adversarial privacy w.r.t. PT LM adversaries. Proposition 3.8 (Result-2). Let P be any PT LM adversary and let q be any query with v variables. Let , γ be the privacy parameters and let λ, p be the parameters used in Steps 3 and 4 of Algorithm 2.1. If P (LSq > λ) ≤ mpv , then Algorithm 2.1 guarantees (, γ)-adversarial privacy w.r.t. P while answering q. Result-1 (Prop. C.3) shows that if P is a (δ, k)-bounded PT LM adversary, then any stable query q satisfies P (LSq > λ) ≤ mpv for λ = 8(v + 1)g 2 log m and p = g/m. Now these are the values that have been used in steps 3 and 4 of the Algorithm 2.1. Thus we can apply Result-2 (Prop. 3.8) on the adversary P and query q and obtain that (, γ)-adversarial privacy w.r.t P is preserved while answering q by Algorithm 2.1. Since this holds for any (δ, k)-bounded PT LM P and stable query q, we get the privacy result of Theorem 2.6.

3.3

Proof of the Equivalence (Theorem 2.12)

In this section, we explain the technique behind the proof of Theorem 2.12. We treat each tuple t ∈ T up as a boolean variable, which is true iff t belongs to the actual database instance. We also define boolean formulas over these boolean

variables. For example the formula (t1 ∨t2 ) states that either t1 or t2 belongs to the database. We denote vars(b) the set of variables that occur in a formula b. A formula m is monotone if switching any variable t from false to true does not change its value from true to false. For example, the formula m = t1 ∨ t2 is monotone. Our proof proceeds by establishing a connection between the PT LM class and a class of distributions, called CN A, which only have negative correlations among tuples. Many notions of negative correlations exist in the literature and our proof exploits the relationship among them. We discuss them in the appendix (Sec. B.1). Here we just define the most important one. Definition 3.9 (N A). A distribution P is said to be negatively associated (N A) if for all tuples t and monotone formulas m s.t. t ∈ / vars(m), P [t ∧ m] ≤ P [t]P [m]. For a distribution P , denote Pt the distribution obtained by conditioning P on tuple t (i.e. for all I, Pt (I) = P (I|t)). Similarly define P¬t . A class of distributions P is said to be closed under conditioning if for all distributions P ∈ P and all tuples t, the distributions Pt and P¬t also belong to P. One can show that the PT LM class is closed under conditioning. Define CN A to be the largest subclass of N A distributions that is closed under conditioning. Next we show an equivalence relationship between PT LM and CN A. Proposition 3.10 (Neg. correlations). Let P be a planar distribution. Then the following two statements are equivalent. (i) P is CN A, (ii) P is PT LM The proof of the proposition appears in the appendix (Sec. B.1). The proof is based on the main lemma (lemma 3.2) of [9], which establishes a connection between balanced matroids (see [9]) and CN A distributions. To complete the proof of Theorem 2.12, it suffices to prove the following proposition. Proposition 3.11 (Equivalence-2). Let A be any algorithm and  > 0 be any parameter. Then the following statements are equivalent (i) A is -indistinguishable, (ii) A is -adversarially private w.r.t. planar CN A adversaries. The proof appears in the appendix (Sec. B.2). We give here a brief comment. The proof of (ii) ⇒ (i) is simple and works by considering an extreme adversary P who knows all but one tuple in the database. The proof of (i) ⇒ (ii) is harder and uses an expansion property of balanced matroids. Further we conjuncture that the planar CN A class is maximal among all classes for which -indistinguishability is adversarial privacy. For this we would ideally like to prove that for every adversary P that is not planar CN A, there exists an -indistinguishable algorithm that is not -adversarially private w.r.t P . Since we restrict ourselves to classes that are planar and closed under conditioning, we would like to prove this result for any adversary P which is not N A. Next we state a result (proof is in Sec. B.3 of the appendix) that makes progress in this direction. We consider an adversary P which is not N A, and thus P [t ∧ m] ≥ P [t]P [m] for some tuple t and monotone formula m. However we further assume that P [t ∧ m] ≥ P [t]P [m] + P [t]2 . With this additional assumption on P , we show the existence of an -indistinguishable algorithm that is not -adversarially private w.r.t P .

Theorem 3.12 (Maximal). Let P be any adversary. Suppose there exists a tuple t and monotone formula m such that t ∈ / vars(m) and P [t ∧ m] ≥ P [t]P [m] + P [t]2 . Then there exists an  > 0 and an -indistinguishable algorithm that is not -adversarially private w.r.t P .

4.

EXTENSIONS

As shown in Section 2.5, -indistinguishability is precisely adversarial privacy w.r.t PT LM adversaries. This result has some interesting implications and we discuss two in this section. We then discuss some extensions to Algorithm 2.1.

4.1

Relaxation of Indistinguishability

From the equivalence result of theorem 2.12, we know that -indistinguishability implies adversarial privacy w.r.t PT LM adversaries. Below we give a sufficient condition that implies adversarial privacy w.r.t (δ, k)-bounded PT LM adversaries. This sufficient condition is expressed as a property of the algorithm, similar to -indistinguishability, and can be thought of as its relaxation. We believe that this relaxation can support algorithms with better utility than -indistinguishability, especially in the input perturbation setting where perturbed data is to be published. To give the relaxation, we define the notion of -distinguishable tuples. Let t, t0 be two tuples. Given an instance I that contains t but not t0 , denote I(t, t0 ) the instance obtained by replacing t in I by t0 .

4.2.1

Protecting Boolean Properties

The adversarial privacy definition 2.3 is stated to protect against tuple disclosures. However, in some cases, we would like to protect general boolean properties such as “Does John have a social security number (ssn) starting with the digits 541?”. We explain here how adversarial privacy protects all monotone boolean properties2 , which as defined in Sec. 3.3 are boolean formulas with only positive dependence on tuples. For example, consider the relation R(name, ssn). Suppose a PT LM adversary P is interested in the ssn of John and knows that one tuple of the form R(John, *) exists in the database. Then adversarial privacy w.r.t P implies that the posterior of P will be bounded for a property like R(John, 541*) discussed above. This, in fact, holds for any monotone boolean property over the set of tuples of the form R(John, *). Formally, we can show the following. Proposition 4.4 (Boolean properties). Let A be any -adversarially private algorithm w.r.t PT LM . Let T be any set of tuples and P be any PT LM adversary that knows exactly one tuple in T belongs to the database (P (|T ∩ I| = 1) = 1). Then for any monotone boolean property m over tuples in T and for all outputs O of A, P (m|O) ≤ e P (m).

4.2.2

Protecting against Positive Correlations

As mentioned in Sec. 2.5, the PT LM class contains distributions with only negative correlations among tuples. Thus adversarial privacy w.r.t PT LM seems to protect against Definition 4.1 (-distinguishable tuple). An algoadversaries with only negative correlations. rithm A makes a tuple t0 -distinguishable w.r.t tuple t if However, we can show that protection against negative there exist an instance I that contains t but not t0 , and an output O of the algorithm A for which P r[O|I] ≥ e P r[O|I(t, t0 )]. correlations automatically guarantees protection against adversaries with limited positive correlations albeit with a weaker Thus for an algorithm A that is not -indistinguishable, privacy parameter. We will define below a class PT LM k -distinguishable tuples are those for which A fails to guarthat contains distributions for which any tuple can have posantee -indistinguishability. itive correlations with at most k − 1 other tuples, but should have negative correlations with all other tuples. Then we Definition 4.2 (Relaxed Indistinguishability). Let shall show that -adversarial privacy w.r.t PT LM adverl be any positive integer. An algorithm A is (l, e )-relaxed saries implies k-adversarial privacy w.r.t PT LM k adverindistinguishable if for every tuple t there are at most l saries. To define PT LM k , we first need to consider the distinguishable tuples. following superclass of LM distributions. Denote I \ I 0 the set of tuples in I that are not in I 0 If l = 0, then the relaxed definition above is same as indistinguishability. If l = |T up|, then the relaxed definition Definition 4.5 (LM k ). A probability distribution P is gives no privacy guarantee. Larger the l, the weaker is the LM k if forall instances I, I 0 such that |I \ I 0 | ≥ k the folrelaxed definition. lowing condition holds: Theorem 4.3 (Bounded adversarial privacy). Let (1 − r) − 1, re )0 < r < 1 be any number and let A be ( 1−δ δ relaxed indistinguishable. Then A is also -adversarially private w.r.t (δ, k)-bounded PT LM adversaries. Theorem 4.3, proved in the appendix (Sec. B.4), gives a justification for relaxed indistinguishability. It states that the relaxation is sufficient to guarantee adversarial privacy w.r.t (δ, k)-bounded PT LM adversaries. Furthermore, since adversarial privacy is a worst-case guarantee made on all possible inputs I and algorithm’s outputs, so is this relaxation. Thus it is different from the average-case relaxations considered in the past (such as (, δ)-semantic privacy [10]).

4.2

Some Properties of Adversarial Privacy

We mention here some properties of adversarial privacy that follow from its relationship with -indistinguishability.

P (I)P (I 0 ) ≥ P (I ∩ I 0 )P (I ∪ I 0 ) Thus LM 1 is precisely the LM class. Further LM k−1 ⊆ LM k for all k. Now similar to T LM and PT LM define the classes: T LM k to be the set of distributions which are LM K under every possible marginalization, and PT LM k to be the set of planar T LM k distribution. Obviously, PT LM k−1 ⊆ PT LM k for all k and PT LM 1 = PT LM . Next we show the following result. Proposition 4.6 (Positive correlations). Let A be any -adversarially private algorithm w.r.t PT LM . Then A is k-adversarially private algorithm w.r.t PT LM k adversaries. 2

The monotonicity condition is because adversarial privacy protects against only positive leakage and allows negative leakage.

Note that the parameter  appears as e in the privacy definition (see Def. 2.3). Thus k is actually exponentially worse than  in terms of the privacy guarantee, and Proposition 4.6 makes practical sense only if we apply it for small k. The proof of Proposition 4.6 appears in the appendix (Sec. F) and exploits the collusion resistance behavior of -indistinguishability.

4.3

Extensions to Algorithm 2.1

Multiple queries Algorithm 2.1 as stated before can answer only a single query. Now we show how it can be extended to answer a sequence of queries. Consider a sequence of input queries q1 , . . . , ql . As in Step 2 of the algorithm 2.1, we compute the parameters λ and p. The parameter p is computed as before: i.e. p = g/m where g is the total number of subgoals in all queries. However, a separate λi is computed for each qi as (8gi2 (v + 1) log m)gi −1 . Here gi is the number of subgoals in qi and v is the total number of variables in all queries. After this the algorithm is same as algorithm 2.1. It decides with probability p/γ to ignore the database and give a completely random answer to each query. Otherwise, with probability 1 − p/γ, it returns Ai (I) = qi (I) + Lap(λi /) for each query qi . The algorithm has the same privacy guarantee, but a slightly worse utility guarantee. The algorithm guarantees (Σi , Γi )-usefulness for Σi = O(l loggi m/). This Σi is l times worse than the Σ guaranteed in theorem 2.2, where l is the number of queries answered. Larger δ In section 2, we chose δ = log m/m. Here we note that a larger δ with value mα ( logmm ) can be used while guarantying the same privacy and the same utility for any query q. The α can be as large as 1 − density(qh ), where qh is the homomorphic image of q with the largest density. Negative leakage So far we have considered protection against positive leakage only. Negative leakage, which allows one to infer the absence of a tuple (i.e P (¬t|O)  P (¬t)), is allowed under our adversarial privacy definition 2.3. Nonetheless it is possible to extend Algorithm 2.1 to protect against negative leakage as well. This extension guarantees that: P (¬t|O) ≤ e P (¬t) + γ. The extension works for a subclass of stable queries, and does so by using a larger λ (the noise parameter). The value of λ has to be O(logg (m)) rather than O(logg−1 (m)) used in Algorithm 2.1. We explain now the query class for which the extension works. Recall that stable queries are those that have dense derivate queries. The extension works for queries that are both stable and dense. ∆ is such a query. However, ∨ is not as it is stable, but not dense. In general, if q is a query that counts the number of occurrences of a subgraph h and h is a connected graph with cycles then q is both stable and dense.

5.

DISCUSSION & RELATED WORK

In this section we recap our contributions, review related work, and note open questions.

5.1

Recap

We have described an output perturbation mechanism (Algorithm 2.1) that guarantees utility for all input databases and answers a large class of queries much more accurately than competing techniques. In particular, for a number of high sensitivity social network queries, the accuracy bounds improve from polynomial in the number of nodes to polylog-

arithmic (Table 1). We have shown in the full version of the paper that any -indistinguishable algorithm that provides utility for all inputs must add noise that is proportional to the global sensitivity. Thus to achieve our utility results, we must relax the privacy standard. Our relaxed privacy condition is based on adversarial privacy, which requires that the posterior distribution be bounded by the prior. For all queries that are stable —a class that includes a number of practically important queries—the adversaries we consider believe (with high probability) that the query has low local sensitivity, as formalized by Proposition C.3. The adversary’s belief that the local sensitivity is low, say no more than λ, means that we can return accurate answers—with noise proportional to λ rather than the global sensitivity—and still ensure privacy (Proposition 3.8). Even when the input has high local sensitivity, privacy is preserved because of the novel step of the algorithm that returns a uniformly random answer. The probability of this event is rare, but it is just large enough to thwart the adversary (see details of proof in the full version). Our first main result (Theorem 2.6) shows that for CQ(6=) queries that are stable, the algorithm guarantees (, γ)-adversarial privacy w.r.t. (δ, k)-bounded PT LM adversaries. Query stability is a syntactic property that we show is related to the expected local sensitivity for this class of adversaries. Our second main result (Thm 2.12) justifies the choice of adversary and makes an important connection between the two notions of privacy: the PT LM adversaries are precisely the class of adversaries for which any -indistinguishable mechanism will provide adversarial privacy. This clarifies the relaxation of privacy we consider since our algorithm guarantees privacy only against PT LM adversaries whose prior distribution is numerically bounded by (δ, k). We believe the equivalence between adversarial privacy and -indistinguishability will be of independent interest and could be a fruitful source of new insights. For example, in Sec. 4.1 we use this connection to develop a relaxed version of -indistinguishability (which is independent of any assumptions about the adversary).

5.2

Related Differential Privacy Results

Ours is not the only approach to accurately answering high sensitivity queries. Nissim et al. [16] recently proposed an output perturbation algorithm that adds instance-based noise. The noise depends on the local sensitivity, but to avoid leaking information about the instance, the noise must follow a smooth upper-bound of the local sensitivity. The hope is that for many database instances (e.g. social networks that occur in practice), a smooth sensitivity bound can be found that is substantially lower than the global sensitivity. Smooth sensitivity is a different, and complementary, strategy to what we propose here. For queries with high global sensitivity, one cannot guarantee both utility for worst-case inputs and privacy against a worst-case adversary (i.e., indistinguishability). Nissim et al. attack the problem by relaxing the utility guarantee, motivated by the premise that for typical inputs, local sensitivity is low. We attack the problem by considering weaker (but realistic) adversaries, motivated by the premise that in practice adversaries will not have near-complete knowledge of the input. To date, relatively little is known about the smooth sensitivity of functions of interest. The authors show how to compute

the smooth sensitivity for two specific functions (median, minimum spanning tree) and propose a sampling technique that can be applied to others. Nevertheless, the smooth sensitivity of stable queries such as hi,j may be difficult to compute. In addition, the utility bounds for smooth sensitivity depend on the particular instance and we are not aware of any general analysis that characterizes the set of inputs for which accurate results are possible. Therefore we cannot compare these techniques directly with the results in Table 1. Ganta et al. [10] define a very different notion of adversarial privacy than we consider here. Roughly, the definition says an algorithm is private if for all adversaries, the adversary’s posterior belief does not change much given the change of a single tuple in the database. Privacy is defined as a posterior-to-posterior comparison, rather than prior-toposterior comparison, although they also show a relationship with differential privacy.

5.3

Other Related Work

Existing adversarial privacy techniques [7, 17] are designed for tabular data and do not yield accurate query answers for social network data (or for queries with joins more generally). The private analysis of social networks and other graph data has recently begun to receive attention. The insufficiency of simple anonymization (by removing node names) has been demonstrated: Backstrom et al. [1] proposed a number of attacks on online social networks, and Hay et al. [11] performed a systematic assessment of the threats of structural attacks. A number of anonymization algorithms have been proposed. Each publishes a transformed graph that provides anonymity by creating structurally similar groups [11, 22, 12, 3]. The empirical evaluation of utility looks promising, but no formal guarantees are shown. The primary distinction with the present work is the privacy condition on the output: anonymous data can still leak secret information (e.g., see [13]); whereas our techniques formally bound the information disclosure. Finally, Ying and Wu [20] show that randomly perturbing the graph can provide a notion of adversarial privacy, but empirical results show the utility of the data diminishes rapidly with increased privacy.

5.4

Open Questions and Future Work

Precisely characterizing the relationship between query structure and global sensitivity is an interesting open problem. Our algorithm adds noise less than the global sensitivity for a number of important queries, but is not guaranteed to do so for all queries. An improved algorithm could set λ to the minimum of GSq and the current quantity. In addition, some queries of interest (e.g. di,v , Ki ) have poor utility when i, the number of joins, is large. Acceptable utility for large i currently requires additional restrictions on the adversary, but initial results suggest these restrictions may be avoidable.

6.

REFERENCES

[1] L. Backstrom, C. Dwork, and J. M. Kleinberg. Wherefore art thou R3579X?: Anonymized social

[2]

[3]

[4]

[5] [6]

[7]

[8] [9] [10]

[11]

[12] [13]

[14]

[15] [16]

[17]

[18]

[19] [20] [21]

[22]

networks, hidden patterns, and structural steganography. In WWW, 2007. H. W. Block, T. H. Savits, and M. Shaked. Some concepts of negative dependence. In Ann. of Prob., 1982. A. Campan and T. M. Truta. A clustering approach for data and structural anonymity in social networks. In PinKDD, 2008. T. C. Christofides and E. Vaggelatou. A connection between supermodular ordering and positive/negative association. In J. of Multivariate Analysis, 2004. C. Dwork. Differential privacy. In ICALP, 2006. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006. A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breaches in privacy preserving data mining. In PODS, 2003. A. V. Evfimievski, R. Fagin, and D. P. Woodruff. Epistemic privacy. In PODS, 2008. T. Feder and M. Mihail. Balanced matroids. In STOC, 1992. S. Ganta, S. Kasiviswanathan, and A. Smith. Composition attacks and auxiliary information in data privacy. In KDD, 2008. M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis. Resisting structural re-identification in anonymized social networks. In VLDB, 2008. K. Liu and E. Terzi. Towards identity anonymization on graphs. In SIGMOD, 2008. A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. l-diversity: Privacy beyond k-anonymity. In ICDE, 2006. G. Miklau and D. Suciu. A formal analysis of information disclosure in data exchange. In SIGMOD, 2004. M. Newman. The structure and function of complex networks. SIREV: SIAM Review, 2003. K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In STOC, 2007. V. Rastogi, D. Suciu, and S. Hong. The boundary between privacy and utility in data publishing. In VLDB, 2007. J. G. Shanthikumar and H.-W. Koo. On uniform conditional stochastic order conditioned on planar regions. In Ann. of Probab., 1990. V. Vu. Concentration of non-lipschitz functions and applications. RSA, 2002. X. Ying and X. Wu. Randomizing social networks: a spectrum preserving approach. In SIAM, 2007. E. Zheleva and L. Getoor. Preserving the privacy of sensitive relationships in graph data. In PinKDD, 2007. B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks. In ICDE, 2008.

APPENDIX A.

CONCENTRATION RESULT

In this section, we describe a very general concentration. This result will be used in the proof of Prop. C.3, but it may also be of independent interest. We describe the concentration result in its full generality and thus need to introduce some new notations. Let T up = {t1 , . . . , tM } be the set of all possible tuples. Let P : 2T up → [0, 1] be any PT LM probability distribution over database instances. Let T ⊆ T up be any database instance. We can think of T as a binary string (t1 , t2 . . . , tM ) of length M = |T up|, where a tuple ti = 1 if ti ∈ T and ti = 0 if ti ∈ / T . Let Y be any multi-variate polynomial over these binary variables ti ∈ T up. Since ti ’s are binary variables, t2i = ti for all ti . Thus Y will be multi-linear. An example of Y is t1 t2 +t3 t4 . In this section, we shall analyze the concentration of Y for the distribution P . Denote k the degree of Y , which is the maximum number of variables occuring in a monomial of Y . Let Yti be the sum of all monomials of Y which have ti as a variable. Loosely speaking, Yti can be thought of as the partial derivative of Y ∂Y . This partial derivative notion can be extended to multiple tuples. Formally, let w.r.t variable ti and we also denote it as ∂t i A ⊆ T up be any subset of tuples. Denote YA as the partial derivative of Y w.r.t A, i.e. YA is the sum of those monomials of Y which contain all the tuples in A as the variables. Denote: Ej (Y ) = max|A|≥j E(YA ) Mj (Y ) = maxt1 ,...,tM ,|A|≥j YA (t1 , . . . , tM ) Also we define two constants which depend only on k as follows ck = 2kk/2 and dk = k. Theorem A.1 (Concentration). Let P be any PT LM distribution over possible database instances. Let Y be a poly˜ ≤ k such that M˜ ≤ 1 nomial of degree k and assume that Ej (Y )’s and Mj (Y )’s are defined as above. For any integer 1 ≤ k k and any collection of positive numbers E0 > E1 > . . . > Ek˜ = 1 and λ satisfying ˜ − 1, • Ej ≥ Ej (Y ), 0 ≤ j ≤ k ˜ − 1, • Ej /Ej+1 ≥ λ + 4j log M, 0 ≤ j ≤ k the following holds √ λ P (|Y − E(Y )|) ≥ ck λE0 E1 ) ≤ dk e− 4 Theorem A.1 is a generalization for PT LM distributions of the Theorem 4.2 in [19], which worked only for tuple-independent distributions. The theorem is proven in two steps. In the first step, we prove a very general martingale lemma (similar to the one in [19]) that works for any real-valued function Y and probability distribution P with arbitrary correlations (not necessary PT LM ). In the second step, we exploit the polynomial structure of Y to get a concentration result for PT LM distributions.

A.1

Vu’s Martingale Lemma

In [19], the authors proved a martingale-based concentration inequality for functions Y that have low sensitivity with high probability. Their proof works only for tuple-independent probability distributions. Here we extend it for the case when P may have arbitrary correlations. We define the following properties of the function Y , which will help us analyze its concentration. Let x be any value in {0, 1} and i be any index in {1, 2, . . . , M }. Denote: Ci (x, T )[Y, P ] = |E(Y |t1 , . . . , ti−1 , ti = x) − E(Y |t1 , . . . , ti−1 )|

(3)

Thus Ci (x, T )[Y, P ] depends on x and the values of t1 , t2 , . . . , ti−1 assigned according to the instance T . It also depends on the function Y and the distribution P with respect to which the expectation is being computed. When the function Y and probability distribution P are clear from context, we drop [Y, P ] from the notation, and just denote the above defined quantity as Ci (x, T ). Denote C(T ) = maxi,x Ci (x, T ). Intuitively, C(T ) is the maximum effect of changing one tuple in T on the value of Y . Also define: Vi (T ) =

X

Ci2 (x, T )P (ti = x|t1 , . . . , ti−1 ) and V (T ) =

x∈{0,1}

CY = maxT C(T ) and VY = maxT V (T )

M X

Vi (T )

(4)

i=1

(5)

If the function Y or the distribution P is not clear from the context, we shall explicitly append [Y, P ] at the end to make the notations clear. For example, Vi (T ) will be written as Vi (T )[Y, P ] to make its dependence on Y and P explicit. Next, we state and prove the Martingale lemma. Lemma A.2 (Martingale Lemma). Let X : 2T up → R be any real-valued function over the database instances. Assume that C, V, λ are postive numbers satisfying CX ≤ C, VX ≤ V and λ ≤ 4 CV2 . Then √ P (|X − E(X)| ≥ λV) ≤ 2eλ/4

Proof: W.L.O.G we assume E(X) = 0. This is because the property VX is invariant under shifting of X. So, it sufficies to show: √ P (|X| ≥ λV) ≤ 2eλ/4 To prove this, we first establish the following lemma. 2

Lemma A.3. Let Z : 2T up → R be any real-valued function over the database instances. If u ≤ 1/CZ , then E(euZ ) ≤ eu To see that Lemma A.3 implies our main lemma, set u =

q

2

E(euX ) ≤ eu

λ . 4V

Since λ ≤ 2

Vx

≤ eu

4V , C2

u≤

1 C



1 CX

VZ

.

. Lemma A.3 yields:

V

By Markov’s inequality: P (X ≥





λV)

= P (euX ≥ eu = P (e

uX

≤ e Using the same argument, we can bound P (−X ≥ is given below.

λ/2

≥e

u2 V−λ/2



λV

)

)

−λ/4

=e

λV). This completes the proof modulo the proof Lemma A.3, which

Proof of Lemma A.3: First we note the following proposition (the proof of the proposition is not given here but it follows directly from the proof of Proposition 3.3 in [19].) Proposition A.4 (Proposition 3.3 [19]). Let x ∈ {0, 1} be any binary random variable with probability p (i.e P r(x = 1) = p). Let f (x) be any function on {0, 1} which has absolute value at most 1 and mean 0 (i.e Ex (f (x)) = 0). Then Ex (ef (x) ) ≤ eEx f

2

(x)

The proof of Lemma A.3 uses induction on P M , the size of |T up|. If M = 1, then for all x ∈ {0, 1}, C1 (x, T ) = |Z(x)| (as we assume W.L.O.G. E(Z) = 0) and V (x) = x∈{0,1} Z 2 (x)P (t1 = x) = Ex Z 2 (x). Now |uZ(x)| ≤ uCZ (as |Z(x)| = C1 (x, T ) and CZ by definition is maxi,x,T Ci (x, T )). Further by the hypothesis of Lemma A.3, uCZ ≤ 1. Thus |uZ(x)| ≤ 1 for all x ∈ {0, 1} and the statement of the lemma follows from Proposition 3.3 by substituting f (x) = uZ(x). Now consider a generic M P. First notice that C1 (x, T ) = |E(Z|t1 = x)| (as E(Z) = 0). Thus, C1 (x, T ) does not depend on T . Consequently, V1 (T ) = x∈{0,1} Ci2 (x, T )P (t1 = x) is also a constant and we can set V1 = V1 (T ). Fix any value x for t1 , where x ∈ {0, 1}. Let P|x denote the distribution obtained by condition P on t1 = x. Similarly, if Y is a function of t2 , . . . , tM , denote E|x (Y ) as the expecation of Y taken according to the distribution P|x . Define Yx (t2 , . . . , tM ), a function only over t2 , . . . , tM , as Z(x, t2 , . . . , tM ), and define Zx , another function only over t2 , . . . , tM , as: Zx (t2 , . . . , tM ) = Yx (t2 , . . . , tM ) − E(Z|t1 = x) Now the following two statements are true: 1. E|x Zx = 0. This is simply because X E|x Yx (t2 , . . . , tM ) = Yx (t2 , . . . , tM )P|x (t2 , . . . , tM ) = t2,...,tM

=

X

X

Yx (t2 , . . . , tM )P (t2 , . . . , tM |t1 = x)

t2,...,tM

Z(x, t2 , . . . , tM )P (t2 , . . . , tM |t1 = x) = E(Z|t1 = x) = E|x E(Z|t1 = x)

t2,...,tM

2. VZx [P|x ] = VYx [P|x ] ≤ VZ − V1 . Here VZx is defined as shown in definitions 3,5, but using the probability distribution P|x , instead of P . For the proof, note that VZx [P|x ] = VYx [P|x ] as the function Zx is obtained by shifting Yx by a constant E(Z|t1 = x) independent of t2 , . . . , tM , and the property V is invariant under shifting. So we only need to show that VYx [P|x ] ≤ VZ − V1 . Let Tx be any database instance that entails the assignment t1 = x. It is easy to see from definiP tion 3 that Ci (x0 , Tx )[Y, P|x ] = Ci (x0 , Tx )[Z, P ]. Similary, Vi (Tx )[Y, P|x ] = x0 ∈{0,1} Ci2 (x0 , T )P|x (ti = x0 |t2 , . . . , ti−1 ) = P PM Vi (Tx )[Z, P ]. Thus, from definition 5, we have V (Tx )[Y, P|x ] = M i=2 Vi (Tx )[Y, P|x ] = i=2 Vi (Tx )[Z, P ] = V (Tx )[Z, P ] − V1 . Further, maxTx V (Tx )[Z, P ] ≤ VZ . Thus, VZx [P|x ] ≤ VZ [P ] − V1 . By applying the inductive hypothesis on Zx and the distribution P|x (we have fixed t1 = x and thus can remove it from 2

T up and apply induction on the smaller set), we get E|x (Zx ) ≤ eu

VZx [P|x ]

2

≤ eu

(VZ [P ]−V1 )

.

On the other hand, by conditional expectations we get:  E(euZ )

 X

= Ex E(euZ |t1 = x) = Ex 

euZ(x,t2 ,...,tM ) P|x (t2 , . . . , tM )

t2 ,...,tM



 = Ex 

X

euYx (t2 ,...,tM ) P|x (t2 , . . . , tM )

t2 ,...,tM

= Ex E|x euYx = Ex E|x euZx euE(Z|t1 =x) =

2

(E|x euZx )(Ex euE(Z|t1 =x) ) ≤ eu

(VZ [P ]−V1 )

Ex euE(Z|t1 =x)

Set f (x) = uE(Z|t1 = x). By the assumption on u, f (x) satisifies the conditions of Proposition 3.3, so 2

Ex euE(Z|t1 =x) ≤ eEx (u

C1 (x))

2

= eu

V1

,

completing the proof.

A.2

Concentration for Polynomials

In this section, we complete the proof of Theorem A.1. Let Y be a polynomial defined over the binary variables ti ∈ T up. Let P : 2T up → [0, 1] be any PT LM probability distribution. For x ∈ {0, 1}, denote Yti ,x as the polynomial Y (t1 , . . . , ti−1 , ti = x, . . . , tM ). Then Yti , the partial derivate of Y defined earlier, is simply Yti ,1 − Yti ,0 . Recall the expressions Ci (x, T ) and V (T ) as defined in the above section. The martingale concentration result is stated in terms of these expressions. However, these expressions are difficult to analyze, and so we define upper bounds Wi (T ) and W (T ) as follows. • Wi (T ) = E(Yti |t1 , . . . , ti−1 ). Next we show that Wi (T ) is an upper bound for Ci (x, T ). Proof of Ci (x, T ) ≤ Wi (T ): We know that, Ci (x, T )

= |E(Y |t1 , . . . , ti−1 , ti = x) − E(Y |t1 , . . . , ti−1 )| = P (ti = 1 − x|t1 , . . . , ti−1 )|E(Y |t1 , . . . , ti−1 , ti = 1) − E(Y |t1 , . . . , ti−1 , ti = 0)| ≤ |E(Y |t1 , . . . , ti−1 , ti = 1) − E(Y |t1 , . . . , ti−1 , ti = 0)| = |E(Yti ,1 |t1 , . . . , ti−1 , ti = 1) − E(Yti ,0 |t1 , . . . , ti−1 , ti = 0)|

Since P is an PT LM distribution, we know it is CN A. Thus, E(Yti ,0 |t1 , . . . , ti−1 , ti = 1) ≤ E(Yti ,0 |t1 , . . . , ti−1 , ti = 0). We can rewrite E(Yti ,1 |t1 , . . . , ti−1 , ti = 1) − E(Yti ,0 |t1 , . . . , ti−1 , ti = 0)| as: | (E(Yti ,1 |t1 , . . . , ti−1 , ti = 1) − E(Yti ,0 |t1 , . . . , ti−1 , ti = 1)) + (E(Yti ,0 |t1 , . . . , ti−1 , ti = 0) − E(Yti ,0 |t1 , . . . , ti−1 , ti = 1)) | Since, E(Yti ,0 |t1 , . . . , ti−1 , ti = 1)−E(Yti ,0 |t1 , . . . , ti−1 , ti = 1) ≤ 0, the above expression is less than |E(Yti ,1 |t1 , . . . , ti−1 , ti = 1) − E(Yti ,0 |t1 , . . . , ti−1 , ti = 1)|, which is equal to E(Yti |t1 , . . . , ti−1 , ti = 1). Further, E(Yti |t1 , . . . , ti−1 , ti = 1) ≤ E(Yti |t1 , . . . , ti−1 ) as P is CNA. Thus, Ci (x, T ) ≤ Wi (T ). P • W (T ) = M V (T ) as CY W (T ) ≥ V (T ) i=1 Wi (T ). Next we show that W (T ) upperbounds P Proof of V (T ) ≤ CY W (T ): Recall that V (T ) = M V (T ). Further, i=1 i X 2 Vi (T ) = Ci (x, T )P (ti = x|t1 , . . . , ti−1 ) x∈{0,1}

≤ CY

X

Ci (x, T )P (ti = x|t1 , . . . , ti−1 )

(as Ci (x, T ) ≤ CY )

x∈{0,1}

≤ CY

X

Wi (T )P (ti = x|t1 , . . . , ti−1 )

x∈{0,1}

= CY Wi (T ) Thus, V (T ) ≤

PM

i=1

CY Wi (T ) = CY W (T ).

Following lemma shows how the partial derivate and expectation operations can be exchanged for PT LM distributions. Lemma A.5. Let Y be any polynomial and P be any PT LM distribution. The following holds for all tuples ti ∂ ∂Y (E(Y |t1 )) ≤ E( ) ∂ti ∂ti

Proof: Let Z(ti ) = E(Y |ti ). Then ∂ (E(Y |t1 )) ∂ti

∂Z = Z(1) − Z(0) ∂ti = E(Yti ,1 |ti = 1) − E(Yti ,0 |ti = 0) = E(Yti ,1 |ti = 1) − E(Yti ,0 |ti = 1) + E(Yti ,0 |ti = 1) − E(Yti ,0 |ti = 0) =

≤ E(Yti ,1 |ti = 1) − E(Yti ,0 |ti = 1) = E(Yti |ti = 1) ≤ E(Yti ) ∂Y ) = E( ∂ti

(As E(Yti |ti = 1) ≤ E(Yti |ti = 0))

Next, we establish certain properties of Wi (T ) and W (T ) that are required for the proof of Theorem A.1. Lemma A.6. For any j = 0, 1, . . . , k − 1 and any i, we have (i) Ej (Wi ) ≤ Ej+1 (Y ), and (ii) Ej (W ) ≤ kEj (Y ). Further, ˜ − 1, we have (iii) Mj (Wi ) ≤ 1, and (iv) Mj+1 (W ) ≤ k. for any i and j ≥ k Proof: For (i), Denote Ti = {t1 , . . . , ti−1 }. Thus, Ej (Wi )

∂ E(Yti |t1 , . . . , ti−1 )] ∂A ∂Yti ∂Yti ≤ max|A|≥j E[E( |Ti \ A)] = max|A|≥j E( ) ∂A ∂A ∂Y = max|A|≥j E( ) ∂(A + {ti }) ∂Y ≤ max|B|≥j+1 E( ) = Ej+1 (Y ), ∂B = max|A|≥j E[

which completes the proof of (i). The proof of (iii) follows the same line: Mj (Wi )

∂ E(Yti |t1 , . . . , ti−1 ) ∂A ∂Yti maxt1 ,...,ti−1 ,|A|≥j E( |Ti \ A) ∂A ∂Y maxt1 ,...,ti−1 ,|A|≥j E( |Ti \ A) ∂A + ti ∂Y maxt1 ,...,tM ,|A|≥j (as expectation is less than maximum) ∂A + ti ∂Y = Mj+1 (Y ). max|B|≥j+1 ∂B

= maxt1 ,...,ti−1 ,|A|≥j ≤ = = ≤

˜ − 1 the result of (iii) follows. For (ii) and (iv), we make the identical arguments as done for Since Mj+1 (Y ) ≤ 1 for all j ≥ k (i) and (4) with an additional observation that each monomial of Y has at most k (degree of Y ) non-vanishing partial-order derivatives and hence can occur at most k times in W . This completes the proof. Proof of Theorem A.1: From the Lemma A.6, the proof of theorem A.1 follows exactly as described in Sec. 5.2 of [19]. Moreover, since Y is a multi-linear polynomial, we get a better value for the constants ck and dk as compared to [19].

B. B.1

PROVING THE EQUIVALENCE RESULT Proof of Proposition 3.10

For giving the proof of Prop. 3.10, we define some notions of negative correlation that have been studied before in the literature. Our proof will use these notions. We shall treat each tuple t ∈ T up as a boolean varaible, which is true iff t belongs to the actual database instance. We define boolean formulae over these boolean variables. For example the formula (t1 ∨ t2 ) ∧ ¬t3 states that either t1 or t2 belongs to the database and that t3 does not belong to the database. We denote vars(b) the set of variables that occur in b. A literal is a boolean formula over only one variable. A formula m is monotone if switching any variable t from false to true does not change its value from true to false. For example, the formula m = t1 ∨ t2 is monotone. A formula c is called a conjunct if it can be rewritten as a conjunction of literals. For example c1 = t1 ∧ ¬t2 ∧ ¬t3 ∧ t4 is a conjunct, but c2 = t1 ∨ t2 is not. A conjunct c is called positive (negative) if it contains only positive (negative) literals. Let P : 2T up → [0, 1] be any probability distribution over possible databases. Recall that P (t) denotes the marginal probability that t occurs in the database or equivalently that t is true. For a given formula b, denote P (b) the probabilty that b is true. Denote P (b1 |b2 ) the probability that b1 is true conditioned on the event that b2 is true. Denote P|b the probability distribution obtained by conditioning on the boolean formula b, i.e. P|b (I) = P (I|b) for all instances I.

Definition B.1 ((Conditional) Negative Dependence). A probability distribution P over possible database instances is said to be negatively dependent if for all tuples t and t0 , P (t|t0 ) ≤ P (t). P is said to be conditionally negatively dependent if P|c is negatively dependent for all conjuncts c. Intuitively, negative dependence (or N D for short) says that every pair of tuples t, t0 are in some sense negatively correlated. Conditional negative dependence (or CN D for short) says that t, t0 are negatively dependent even if we condition on any conjunct c. This intuitively says that t, t0 are negatively correlated for any possible assignment to other tuples (c represents this assignment). Next, we redefine the class of CN A distributions defined in Sec. 3.3. Definition B.2 ((Conditional) Negative Association). A probability distribution P over possible database instances is said to be negatively associated if for all tuples t and monotone formulae m for which t ∈ / vars(m), P [t|m] ≤ P [t]. P is said to be conditionally negatively associated if P|c is negatively associated for all conjuncts c. Remark It is easy to see that N A → N D as N A states that t, m are negatively correlated for any monotone formula m, which includes the special case of m = t0 . Similarly CN A → CN D. On the other hand, it is easy to construct probability distributions that are CN D but not CN A. However, for planar distributions (distributions for which cardinality of the database is fixed) we can prove the equivalence shown in Lemma B.3. The following lemma shows propostion 3.10. Lemma B.3 (Negative Correlation Equivalence). Let P be a planar distribution. Then the following statements are equivalent. (i) P is CN A, (ii) P is CN D, (iii) P is PT LM . Proof: We first show that (ii) ⇔ (iii). Let P be any PT LM distribution. Fix any tuples t and t0 , and fix any conjunct c. We shall show P (t|t0 , c) ≤ P (t|c). Let Tp (Tn ) be the set of tuples that appear as positive (negative) literals in c. Define Q = Tp ∪ Tn ∪ {t, t0 }, S = Tp ∪ {t, t0 } and S 0 = Tp ∪ {t}. Thus S 0 ⊂ S. By recursive sub-modularity we have: PQ (S) PQ (S−t)

⇔ Now

PQ (S) PQ (S)+PQ (S−t)



PQ (S) PQ (S)+PQ (S−t)

PQ (S 0 ) PQ (S 0 −t)



PQ (S 0 ) PQ (S 0 )+PQ (S 0 −t)

is exactly the probability of t conditioned on the event that all the tuples in S −t are true and the tuples

in Q\S are false. As Tp ∪{t0 } = S−t and Tn = Q\S, the probability

PQ (S) PQ (S)+PQ (S−t)

is just P (t|t0 , c). Similarly,

PQ (S 0 ) PQ (S 0 )+PQ (S 0 −t) 0

is just P (t|¬t0 , c). Thus, the recursive sub-modularity condition is equivalent to the condition P (t|t0 , c) ≤ P (t|¬t , c), which is iteself equivalent to P (t|t0 , c) ≤ P (t|c). This completes the proof of (ii) ⇔ (iii). Proof of (i) ⇒ (ii) This is trivial from the two definitions Proof of (ii) ⇒ (i) We shall first show that if P is planar CND then P is N A (i.e P (t|m) ≤ P (t)). The proof is similar to that of Lemma 3.2 in [9]. Here we restate it in our notations. The argument is inductive on the database size (which is fixed for a planar distribution P ). The case when size = 1 is easy to verify (and this is also the only point where the monotonicity of m is used). For the inductive step note that P (m|t) = P (t0 |t)P (m|t∧t0 )+P (¬t0 |t)P (m|t∧¬t0 ), and P (m) = P (t0 )P (m|t0 )+P (¬t0 )P [m|¬t0 ). Note further that (a) P (t0 |t) ≤ P (t0 ) from the fact that P is CN A, (b) P (m|t ∧ t0 ) ≤ P (m|t0 ) by applying the inductive hypothesis on the distribution obtained by conditioning P on t0 = 1, and (c) P (m|t ∧ ¬t0 ) ≤ P (m|¬t0 ) by applying the inductive hypothesis on the distribution obtained by conditioning P on t0 = 0. If, in addition, it was the case that (d) P (m|t∧t0 ) ≥ P (m|t∧¬t0 ), then the lemma would follow by averaging principles: P (m|t) ≤ P (t0 )P (m|t∧t0 )+P (¬t0 )P (m|t∧¬t0 ) 0 by (a) and (d), ≤ P (t0 )P (m|t0 ) + P (¬t0 )P (m|¬t (m) by (b) and (c). that there exists always some tuple t0 P )=P P We argue 0 0 such that (d) holds. In particular, note that t0 6=t P [t |m ∧ t] = n − 1 = t0 6=t P [t |t]. Hence for some t0 , P (t0 |m ∧ t) ≥ P (t0 |t) (with P (t0 |t) > 0), which is equivalent to P (m|t ∧ t0 ) ≥ P (m|t), which is finally equivalent to (d). This completes the proof of if P is planar CN D then P is N A Now the proof of (ii) ⇒ (i) is immediate. This is because if P is planar CN D then so is P|c for any conjuct c and thus P|c is also N A. Thus P is CN A, which completes the proof. We have shown above (i) ⇔ (ii), (ii) ⇔ (iii) This completes the proof of the Lemma B.3.

B.2

Proof of Proposition 3.11

To prove proposition 3.11, the notion of balanced distributions will be used. We define it below. Let St be any set of database instances all of which contain a common tuple t. Denote nbrst (I) the set of instances I 0 that are obtained by replacing tuple t from I by some tuple t0 ∈ / I. Denote nbrst (St ) = ∪I∈St nbrst (I). Definition B.4 ((Conditionally) balanced). A probability distribution P over database instances is balanced if for every set of instances St that contain a common tuple t the following holds: P P (t) I∈St P [I] P ≤ 0 1 − P (t) I 0 ∈nbrst (St ) P [I ]

P is conditionally balanced if P|c is balanced for all positive conjuncts c. Next we show the following lemma relating CN A and conditionally balanced distributions. Lemma B.5. Let P be any planar distribution. Then the following statements are equivalent: (i) P is CN A, (ii) P is conditionally balanced Proof of (i) ⇒ (ii): Let P be any planar CN A distribtution. We shall first show that P is balanced. Let St be any set of instances that contain the common tuple t. Let I ∈ St be any instance from the set St . Denote cI the conjunction of all tuples in I except t, i.e cI = ∧t0 ∈I,t0 6=t t0 . Denote m the monotone formula obtained by the disjunction of all conjuncts cI for each I ∈ St , i.e. m = ∨I∈St cI . Clearly t ∈ / vars(m). P P P P P [I] Now, P (t ∧ m) = I∈St P (t ∧ cI ) = I∈St P [I]. Similarly, P (¬t ∧ m) = I 0 ∈nbrst (St ) P [I 0 ]. Thus, P 0 I∈St = P [I 0 ] I ∈nbrst (St )

P (t∧m) P (¬t∧m)

=

P (t|m) P (¬t|m)



P (t) , 1−P (t)

where the last inequality follows from the CN A property (We know by the CN A property that

P (t) (t|m) ≤ 1−P ). This shows that if P is planar CN A then P P (t|m) ≤ P (t), which means P (¬t|m) ≥ 1 − P (t) and thus PP(¬t|m) (t)) is balanced. To complete the proof of (i) ⇒ (ii), we note that if P is planar CN A then so is P|c for any conjuct c and thus P|c is also balanced. Thus P is conditionally balanced, which completes the proof.

Proof of (ii) ⇒ (i): Let P be any conditionally balanced distribtution. We shall first show that P is N A. Let t be any tuple, m be any monotone formula s.t. t ∈ / vars(m). We need to show P that P (t|m) ≤ P (t). Let St be the set of all instances which contain t and make m true. It is easy to see that P (t∧m) = I∈St P [I]. Moreover, as m is a monotone formula not dependent on t, they will be true on all instances I 0 ∈ nbrst (St ). Further, any instance that makes P m true and does not contain t will P P (t|m) P (t) P (t∧m) I∈St P [I] 0 P lie in nbrst (St ). Thus, P (¬t ∧ m) = I 0 ∈nbrst (St ) P [I ]. Thus P (¬t|m) = P (¬t∧m) = ≤ 1−P ), where the P [I 0 ] (t) 0 I ∈nbrst (St )

P (t) (t|m) ≤ 1−P , which implies that P (t|m) ≤ P (t). Thus, P last inequality follows from the fact that P is balanced. Thus, PP(¬t|m) (t) is N A. To complete the proof of (ii) ⇒ (i), we note that if P is conditionally balanced then so is P|c for any conjuct c and thus P|c is also N A. Thus P is CN A, which completes the proof.

Next we state and prove the following lemma. We denote by P r(O|I) the probability that the algorithm publishes an output O on input I. This probability is based only on the random coin flips of the algorithm. We shall use P , on the other hand, to represent the prior distribution (over database instances) of the adversary. For an instance I, P (I) shall represent the probababity of instance I. P (I) is independent of the coin flips (or randomness) of the algorithm. Lemma B.6. Let Jt be any set of instances that contain t and let J = Jt ∪ nbrst (Jt ). Let A be -indistinguishable w.r.t. to J — meaning that for any I, I 0 ∈ J such that I ∈ nbrs(I 0 ), P r(O|I) ≤ e P r(O|I 0 ). If P is CN A, then P P (t) I∈Jt P r(O|I)P (I) P ≤ e 0 )P (I 0 ) P r(O|I 1 − P (t) 0 I ∈nbrst (Jt ) Proof. Denote O(I) as the shorthand for P r(O|I), the probability that A outputs O on instance I. Let l denote the number of non-zero distinct values of O(I) on the set of instances I ∈ Jt . Our proof will use induction on l. The base case is l = 1. Let the non-zero value be o. Let St ⊆ Jt be the set of all instances I such that O(I) = o (since l = 1, instances I ∈ Jt \ St have O(I) = 0). By the -indistinguishability condition, we know that O(I 0 ) ≥ eo for any I 0 ∈ nbrst (St ). Further, since P is CN A, by Lemma B.5, we know that P is conditionally balanced. By definition of conditionally balanced, we know that P P (t) I∈St P [I] P ≤ (6) 0 1 − P (t) I 0 ∈nbrst (St ) P [I ] Now, P

P O(I)P (I) I∈St O(I)P (I) P ≤ 0 )P (I 0 ) 0 0 O(I 0 0 I ∈nbrst (Jt ) I ∈nbrst (St ) O(I )P (I )

P

O(I)P (I) 0 0 0 I ∈nbrst (Jt ) O(I )P (I ) I∈Jt

P

=

I∈St

P

And, P

P

O(I)P (I) P 0 0 I 0 ∈nbrst (St ) O(I )P (I )

oP [I] o 0 I 0 ∈nbrst (St ) e P [I ] P P (t) I∈St P [I] = e P ≤ e 0] P [I 1 − P (t) I 0 ∈nbrst (St )

I∈St



I∈St

P

Thus, P

O(I)P (I) 0 0 I 0 ∈nbrst (Jt ) O(I )P (I ) I∈Jt

P

≤ e

P (t) 1 − P (t)

(Follows from (6))

which proves the base case. The inductive argument has a similar flavor. Denote IH(l) the inductive hypothesis that the required property holds when O(I) has at most l distinct non-zero values on the set Jt . For induction, we shall show IH(l + 1). Let o1 = maxI∈Jt O(I). Denote St ⊆ Jt to be the set of instances such that O(I) = o1 . By definition of o1 , we know St is non-empty. Further, denote St0 ⊆ Jt the set of instances that have O(I) 6= o1 . Let o2 = maxI∈St0 O(I). By the -indistinguishability condition, we know that O(I 0 ) ≥ oe1 for any I 0 ∈ nbrst (St ). We write: P

O(I)P [I] I∈nbrst (Jt ) O(I)P [I] P P 0 0 I∈St o1 P [I] + I 0 ∈St0 O(I )P [I ] P P o1 0 0 I∈nbrst (St ) e P [I] + I 0 ∈(nbrst (St0 )\nbrst (St )) O(I )P [I ] P P 0 0 0 I∈St (o1 − o2 )P [I] + I 0 ∈Jt O (I )P [I ] P P (o1 −o2 ) P [I] + I 0 ∈nbrst (Jt ) O0 (I 0 )P [I 0 ] I∈nbrst (St ) e I∈Jt

P ≤ ≤ =

A+B (say) C +D

In the last expression, O0 has been used. It is defined as follows: O0 (I 0 ) = O(I 0 ) if I ∈ / St ∪ nbrst (St ), O0 (I) = o2 if I ∈ St , P (t) 0  0 B and O (I) = o2 /e if I ∈ nbrst (St ). It is easy to check that IH(l − 1) holds for O . Thus, the ratio D ≤ e 1−P . Further, (t) A C

P (t) ≤ e 1−P as P is conditionally balanced. Thus (t)

A+B C+D

P (t) ≤ e 1−P completing the proof. (t)

Finally we can show the proposition 3.11. We first restate it. Proposition B.7 (Proposition 3.11). Let A be any algorithm and  > 0 be any parameter. Then the following statements are equivalent (i) A is -indistinguishable, (ii) A is -adversarially private w.r.t planar CN A adversaries. Proof: To show (ii) ⇒ (i), let us assume A is -adversarially private w.r.t planar CN A adversaries. We shall show that A is also -indistinguishable. Fix any pair of instances I,I 0 s.t. I 0 ∈ nbrs(I). Let O be any output of A. We shall show that P r[O|I 0 ] ≤ e . Let t be the tuple such that t ∈ I \ I 0 . P r[O|I] Consider the distribution P , s.t. P (I) = δ and P (I 0 ) = 1 − δ. It is easy to see that P is planar CN A. Further, adversarial privacy requirement implies that PP[t|O] ≤ e . Denote Et the event t ∧ O, i.e. Et is true when t occurs in the [t] P original database instance and O is published. Denote E¬t the event ¬t ∧ O. Thus, P r[Et ] = I:t∈I P r(O|I)P (I) and P P r[E¬t ] = I:t∈I / P r(O|I)P (I). We know by Bayes rule that P [t|O] =

P r[Et ] P r[Et ] + P r[E¬t ]

Since P (I) = δ, P (I 0 ) = 1 − δ and P (I0 ) = 0 for other instances I0 , we get P r[Et ] = δP r[O|I] and P r[E¬t ] = (1 − δ)P r[O|I 0 ]. P P (t|O) Also P [t] = ≤ e . Substituting the I:t∈I P (I) = δ. From the requirement of adversarial privacy, we know that P [t] expression for P (t|O) and P [t] = δ, we get P r(O|I) ≤ e δP r(O|I) + (1 − δ)P r(O|I 0 ) This gives P r(O|I 0 ) (1 − δ)e ≤ P r(O|I) 1 − δe 

Note that limδ→0 (1−δ)e = e . Since δ can be arbitrarily small, we get 1−δe

P r(O|I 0 ) P r(O|I)

≤ e . This proves the first part.

For the proof of (i) ⇒ (ii), let us assume that A is -indistinguishable. We shall show that A is also -adversarially private w.r.t planar CN A adversaries. Fix any distribution P that belongs to the planar CN A class. Let t be any tuple and O be any output of the algorithm A. We shall show that PP[t|O] ≤ e . Recall the events Et = t ∧ O and E¬t = ¬t ∧ O and r[t] P [Et ] that P [t|O] = P [EtP]+P . Let It be the set of instances that contain t. Note that P [Et ] = I∈It P r(O|I)P (I) and [E¬t ] P P [E¬t ] = I∈nbrst (It ) P r(O|I)P (I). We can then apply Lemma B.6 to show that: P [Et ] P (t) ≤ e P [E¬t ] 1 − P (t)

B.3

Proof of Theorem 3.12

We first restate it first below. Theorem B.8 (Theorem 3.12). Let P be any adversary. Suppose there exists a tuple t and monotone formula m such that t ∈ / vars(m) and P [t ∧ m] ≥ P [t]P [m] + P [t]2 . Then there exists an  > 0 and an -indistinguishable algorithm that is not -adversarially private w.r.t P . Proof: Denote a = P [t∧m], b = P [t∧¬m], c = P [¬t∧m], and d = P [¬t∧¬m]. It is easy to see that P [t∧m]−P [t]P [m] = ad−bc and P [t] = a + c (just expand each probability in terms of a, b, c, and d). Thus the hypothesis P [t ∧ m] ≥ P [t]P [m] + P [t]2 reduces to the condition ad − bc ≥ (a + c)2

(7)

Also note that by definition a + b + c + d = 1. Now for any given , construct the following algorithm A : A either publishes the output O or O0 . Further the probability that A publishes O is chosen as:  1/2 if t ∧ m is true in I    1/2e− if ¬t ∧ m is true in I P r(O|I) = 1/2e− if t ∧ ¬m is true in I    1/2e−2 if ¬t ∧ ¬m is true in I Then A is -indistinguishable. In proof, assume the contrary. Then there exist two instances I and I 0 differing in exactly one tuple such that either P r(O|I) ≥ e P r(O|I 0 ) or P r(O0 |I 0 ) ≥ e P r(O0 |I 0 ). This can only happen if t ∧ m is true in I and ¬t ∧ ¬m is true in I 0 (otherwise it is easy to check from the defintion of A above that P r(O|I) ≤ e P r(O|I 0 ) and P r(O0 |I 0 ) ≤ e P r(O0 |I 0 )). Since m is monotone and m does not depend on t, there should be at least one tuple t0 6= t that is in I but not in I 0 (otherwise m cannot change from true to false from I to I 0 ). Thus both t, t0 occur in I but do not occur in I 0 , and hence they disagree in at least two tuples. This contradicts our initial assumption that I and I 0 disagree on exactly one tuple which proves that A is -indistinguishable. Recall that Et is the event t∧O, i.e. E in the original database instance and O is published. Also recall Pt is true when t occursP E¬t the event ¬t ∧ O. Then P r(Et ) = I:t∧m P [I]P r(O|I) + I 0 :t∧¬m P [I 0 ]P r(O|I 0 ), which simplifies to a(1/2) + c(1/2)e− . Similarly P r(E¬t ) = b(1/2)e− + c(1/2)e−2 . Now

P [t|O] =

P r(Et ) a + ce− = P r(Et ) + P r(E¬t ) a + (b + c)e− + ce−2 2

+bx Let x = e . Then we get that P [t|O] = ax2 ax . We will now show that there exists an  > 0 such that P [t|O]/e ≥ P [t]. +(b+c)x+d In other words, we will show that there exists an x > 1, the following holds (as P[t]=a + c and e = x).

ax + c ≥a+c ax2 + (b + c)x + d ax+c Denote f (x) = ax2 +(b+c)x+d . Note that for f (1) = a + c as a + b + c + d = 1. Further the derivative f 0 (1) = ad − bc − (a + c)2 . By condition 7, we know that ad − bc ≥ (a + c)2 , and thus f 0 (1) ≥ 0. Thus f is increasing at 1, and hence there exists x > 1 ax+c s.t. f (x) > f (1). For that x, we get ax2 +(b+c)x+d > a + c. This shows that there exists an  > 0 such that P [t|O]/e > P [t],  which gives P [t|O] > e P [t] implying that A is not -adversarially private w.r.t P even though we had seen above that A was -indistinguishable. This completes the proof.

B.4

Proof of theorem 4.3

We first restate it below. Theorem B.9 (Theorem 4.3). Let 0 < r < 1 be any number and let A be ( 1−δ (1 − r) − 1, re )-relaxed indistinguishable. δ Then A is also -adversarially private w.r.t (δ, k)-bounded PT LM adversaries. 0

0

(1 − r) − 1 and e = re . Let A be any (l, e )-relaxed indistinguishable algorithm. Proof of theorem 4.3: Let l = ( 1−δ δ Let P be any (δ, k)-bounded PT LM adversary (here k can be any number). Fix O as some output of algorithm A and t as some tuple. To show the theorem, we need to show that P (t|O) ≤ e P (t) . Denote O(I) P as the shorthand for P r(O|I), the probability that A outputs O on instance I. For a set of instances S, denote P (S) = I∈S P (I). We first show that for any set S of instances that all contain the tuple t and have O(I) = o (i.e. S = {I : t ∈ I ∧ O(I) = o}) the following holds: P P (t) I∈S o · P [I] P ≤ e (8) 0 0 1 − P (t) I 0 ∈nbrst (S) O[I ]P [I ]

Let I ∈ S be any instance from the set S. Denote cI the conjunction of all tuples in I except t, i.e cI = ∧t0 ∈I,t0 6=t t0 . Denote m the monotone formula obtained by the disjunction of all conjuncts cI for each I ∈ S, i.e. m = ∨I∈S cI . Clearly t ∈ / vars(m). Now: P

P o · P [I] N I∈S o · P [I]/P (m) P = = (say) 0 ]P [I 0 ] 0 ]P [I 0 ]/P (m) O[I O[I D 0 0 I ∈nbrst (S) I ∈nbrst (S) I∈S

P

P

P [I]

Note that the numerator in the above expression is N = o I∈S = o PP(t∧m) = P (t|m). This is because t ∧ m is exactly P (m) (m) the event the database instance is one from the set S. Since P is PT LM , we know, from Theorem B.3, that P is also CN A. Thus P (t|m) ≤ P (t) (as t ∈ / vars(m)). Further we know that P (t) ≤ δ as P is δ-bounded. Thus, we get that the numerator N ≤ oδ. Next we show that the denominator D ≥ o(1−δ) . If we show this, then the proof of Eq (8) will be complete (then e δ completing the proof). N/D will then be larger than e 1−δ P Now D = I 0 ∈nbrst (S) O[I 0 ]P [I 0 ]/P (m). Let Bad(t) be the set of 0 -bad tuples for t. Denote nbrst,tb (S) to be the set of instances {I(t, tb )|I ∈ S}. We shall bound P (nbrst,tb (S)). Denote mtb the monotone formula ∨I∈S,tb ∈I / cI . Thus P (mtb ) ≤ P (m). Now P (nbrst,tb (S)|mtb ) = P (tb |mtb ) ≤ P (tb ) ≤ δ

(9)

P (nbrst,tb (S)) P (nbrst,tb (S)) ≤ ≤δ P (m) P (mtb )

(10)

Thus,

Here the last inequality follows from Eq (9). t (S)) Now we know that P (nbrs ≥ 1 − δ. Now nbrst (S) = ∪t0 nbrst,t0 (S). Thus P (m) P P 0 P (nbrst,t0 (S)) P (nbrst (S)) t0 ∈Bad(t) P (nbrst,t (S)) t0 ∈Bad(t) / = − P (m) P (m) P (m) ≥ (1 − δ) − |Bad(t)|δ = 1 − (l + 1)δ 0

0

(1 − r) − 1 and e = re . Thus, D ≥ o(1−δ) completing the proof of Thus D is at least oe− (1 − (l + 1)δ), where l ≤ 1−δ δ e Eq (8). The proof of the theorem now follows from the same inductive argument on the number of distinct values of O(I) as in Lemma B.6.

C. C.1

PROVING THE PRIVACY THEOREM Proof of Proposition C.3

The proof will use the concentration results of Sec. A.1. First observe that every counting CQ(6=) query q without projection can be rewritten as a multi-linear polynomial Yq over the tuples in T up. To construct Yq from q, consider any assignment a : vars(q) → D to the variables in q. In other words, given any variable x ∈ vars(q), a assigns it to the constant a(x). Let t1 , t2 , . . . , tg (not neccessarily distinct) be the tuples obtained by grounding the g subgoals in q by substituting all x ∈ vars(q) by a(x). We call {t1 , . . . , tg } the ground set corresponding to a, and denote it as ground(a). If all the tuples in ground(a) are true (i.e. they belong to the database), then the assignment a contributes exactly once (as there is no projection) to the answer of q. This is simulated in Yq as follows: Yq has a term of the form t1 t2 . . . tg , which contributes one to the answer of Yq . It is easy to see that the degree of Yq will be at most the number of sub-goals in q. Further, the number of terms will be at most m|vars(q)| , and the total number of tuples M in Yq , will be at most gm|vars(q)| . Now consider a CQ(6=) query q that has some projections. In other words, there are some variables in vars(q) that do not appear in the head of q. Denote qnp the query obtained by adding all the variables vars(q) to the head of q. Thus qnp is the query q without any projections. It is easy to see that LSq (I) ≤ LSqnp (I) for all database instances I. Thus to show that the local sensitivity of q is bounded, it is sufficient to show that local sensitivity of qnp is bounded. Thus, for the purpose of this section we shall just consider the queries with no projections. For such a query q, we shall denote Yq the polynomial corresponding to q. Denote Bi the ith bell number. We next show the following lemma. Lemma C.1. Let δ = log m/m and k = log m. Let q be any stable query with g sub-goals and P be any (δ, k)-bounded PT LM distribution. Then E(Yq ) ≤ Bg+1 logg m. Proof: First we show the result for k = 0 and tuple-independent distribution P . Denote v = |vars(q)|. Consider all possible assignments to vars(q). There are at most mv such assignments. Consider those assignments a that have |ground(a)| = g, i.e. a does not unify any subgoals in q. For such an assignment (say a) to contribute to the query result, all the g tuples t1 , . . . , tg ∈ ground(a) need to be true. Thus, the expected number of such assignments in the query result is less than mv ( logmm )g = logg mmv−g ≤ logg m.

Now consider those assignments that unify one or more subgoals in q. Consider any partitioning P of the g subgoals into i partitions. Let h be the most general homomorphism that unfies the subgoals in g to create the i distinct subgoals corresponding to the partioning P. Thus, for a query q(x, y, z) : −R(x, y), R(z, a) and the partitioning which has R(x, y) and R(z, a) unified to lie in the same partition, the most general homomorphism would be x = z and y = a. Denote qh the query obtained by substituting the variables in q by their h-images. Thus, qh (z) : −R(z, a) is the qh for the above example. Since h is the most general homomorphism, any assignment of vars(q) that results in the partititioning P corresponds to exactly one assignment of vars(qh ). In the above example, an assignment of x = z = b and y = a to the variables x, y, z of q, corresponds to the assignment z = b to the variables in vars(qh ). Since q is stable , it is also dense, and thus qh has density at most 1. Thus the expected number of assignments of vars(qh ) in the result of qh is at most logg m. This means that the expected number of assignments of vars(q) resulting in the partitioning P is at most logg m. Since the number of partitions is the bell number Bg , E(Yq ) ≤ Bg logg m. Now we consider the case of k = log m generalizing the case of k = 0 considered above. Thus the set of known tuples is less than equal to k. Consider those assignments a such that ground(a) has exactly i sub-goals that ground to known tuples. The expected number of such assignments is at most gi Bg−i logg−i m. To see this, note that in the worst-case, any i of the  g subgoals can be grounded to the known tuples. This leads to the factor of gi . Once the subgoals grounded to the known 0 i tuples are fixed, the remaing query q is a query that lies in the set ∂ (q) and, by definition, is also dense. Now, because of the restriction that no other subgoal grounds to known tuples, we can assume that k = 0 and evaluate E(Yq0 ), which as shown above, is at most Bg−i logg−i m, leading to the required expression. Now, the i sub-goals can ground to any i known tuples (not necessarily distinct)  from the set of k known tuples. Thus expected number of assignments with some i tuples from the known set is at most ki gi Bg−i logg−i m, which, as k = log m, is exactly gi Bg−i logg m. Summing over all i = {0, . . . , g},     P P P P g we get E(Yq ) ≤ gi=0 gi Bg−i logg m = logg m gi=0 g−i Bg−i = logg m gi=0 gi Bi . Now, as gi=0 gi Bi = Bg+1 , we get E(Yq ) ≤ Bg+1 logg m. Finally, the expected number of assignments for any PT LM distribution is less than or equal to the expected number for an tuple-independent distributions with the same marginal tuple probabilities. This follows since Yq is a supermodular function (see supermodular ordering [4]). Thus the result follows. Proposition C.2 (Expected assignments). Let δ = log m/m and k = log m. Let q be any stable query with g = |goals(q)| sub-goals and P be any (δ, k)-bounded PT LM distribution. Then for all i ∈ {1, . . . , g}, Ei (Yq ) ≤ (2(g + 1) log m)g−i . Proof: Let q 0 be the query that has the maximum number of expected answers among all queries in the set ∂ i (q). It is clear that q 0 has g − i subgoals. Further q 0 will also be stable . This is because, from the definition of stable queries, it is easy to g−i see that the derivative of stable query is also stable. From lemma C.1, we know that the E(Yq0 ) ≤ Bg+1−i m. Now  log Ei (Yq ) ≤ gi E(Yq0 ). This is because the derivate of Yq w.r.t any set of i tuples from the set A is at most gi times the value of the query q’.   Thus, Ei (Yq ) ≤ gi E(Yq0 ) ≤ gi Bg+1−i logg−i m. Now, using the Comtet fomula, the bell number Bk can be evaluated as  P 2k j k g g−i 1 , which proves the j=1 j! . With simple arithmetic manipulations, we can show that i Bg+1−i is less than 2(g + 1) e result. Next we restate Proposition C.3. Proposition C.3 (Proposition C.3). Let δ = log m/m and k = log m. Let P be any (δ, k)-bounded PT LM adversary and q be a stable query with g subgoals and v variables. Then for λ = 8(v + 1)g 2 log m, we have g P (LSq > λ) ≤ v+1 m Proof: Let q’ be any derivate of query w.r.t one subgoal. For the proof, we use Theorem A.1 on each query q 0 with parameters √ ˜ = g − 1, and M = gmv . As the number of subgoals of q 0 is g − 1, by i = (4(v + 1)g g log m)g−1−i , λ = 4(v + 1) log m, k g−1−i Prop. C.2 we get, Ei (Yq0 ) ≤ (2g log m) . Thus, it is easy to see that i ≥ Ei (Yq0 ). Further, the ratio √ √ √ i = 4(v + 1)g g log m ≥ 4(v + 1) g log m + 4(v + 1)(g − 1) g log m i+1 ≥ 4(v + 1) log m + 4(g − 1)v log g log m ≥ λ + 4(g − 1)v(log g + log m) ≥ λ + 4(g − 1) log gmv = λ + 4(g − 1) log M ≥ λ + 4i log M Thus these values of i ’s satisify the two requirements of theorem A.1. Now set cg = 2g g/2 and dg = g. Applying theorem A.1 on these values reduces the inequality: √ λ P (|Y − E(Y )|) ≥ cg λE0 E1 ) ≤ dg e− 4 into P (LSq ≥ (8(v + 1)g 2 log m)g−1 ) ≤ gm−(v+1) . This proves the result.

C.2

Proof of Proposition 3.8

First, we restate the proposition. Proposition C.4 (Proposition 3.8). Let P be any subclass of PT LM adversaries and let q be (λ, p)-bounded w.r.t P. Let , γ, λ, p be the parameters for Algorithm 2.1. Then Algorithm 2.1 guarantees (, γ)-adversarial privacy w.r.t P while answering query q.

Proof. Let P be any prior distribution from P. Let t be any tuple and O any output of the Algorithm 2.1. To show (, γ)-adversarial privacy, we must show that the posterior belief is bounded by the prior as follows: P [t|O] ≤ e P [t] + γ. Let It refer to the set of inputs that contain tuple t. We can rewrite P [t|O] as P I∈It P r(O|I)P (I) P [t|O] = P [O] P 0 0 where P [O] = I 0 P r(O|I )P (I ). To bound the posterior probability, we partition It into two sets: goodt are those instances with low local sensitivity (i.e., LSq (I) ≤ λ) and badt is the complement of goodt . For the instances in goodt the algorithm behaves like a differentially private algorithm and so the adversary’s posterior will be bounded. By Lemma B.6, we know that P P (t) I∈goodt P r(O|I)P (I) P ≤ e 0 )P (I 0 ) 1 − P (t) P r(O|I I 0 ∈nbrst (goodt ) We can rearrange terms and get P

I∈goodt

P

I 0 ∈good

P r(O|I)P (I)

t ∪nbrst (goodt )

P r(O|I 0 )P (I 0 )

≤ e P (t)

which implies P

I∈goodt

P r(O|I)P (I)

P [O]

≤ e P (t)

P since P [O] ≥ I 0 ∈goodt ∪nbrst (goodt ) P r(O|I 0 )P (I 0 ). As for the instances in badt , since P (LSq > λ) ≤ mpv , the probability that I ∈ badt is bounded by mpv . Furthermore P [O] is at least at large as the probability that the algorithm ignored the input and sampled uniformly at random. According to Algorithm 2.1, this happens with probability p/γ . Therefore, mv P p I∈badt P r(O|I)P (I) mv = γ ≤ p/γ P [O] mv Finally, we can combine the two results to get the desired result: P I∈It P r(O|I)P (I) P (t|O) = P [O] P P I∈goodt P r(O|I)P (I) I∈badt P r(O|I)P (I) = + P [O] P [O] ≤ e P (t) + γ

C.3

Proof of Proposition 2.5

We restate it below. Proposition C.5 (Proposition 2.5). Let q be the query that counts the number of occurences of a subgraph h in G. If h is a connected graph (in the traditional graph-theoretic sense) then q is stable. Let q be any query over binary relations. Denote symbols(q) = consts(q) ∪ vars(q). Construct an undeirected graph Gq as follows: the set of nodes is the set symbols(q). An edge exists between two nodes if the corresponding symbols occur together in some subgoal of q. We first show the following lemma. Lemma C.6. Let q be any query such that the graph Gq is connected. Then |symbols(q)| ≤ |goals(q)| + 1 Proof: The lemma follows directly from the properties of connected graphs. For a graph to be connected, the number of edges is at least one less than the number of nodes. Proof of proposition Now we prove the proposition stated above. Let q be a query counting the occurence of a connected subgraph h. Then it is easy to see that qg is same as an undirected version of h. Thus qg will be connected. Let qi be any query in the set ∂i (q). To prove that q is stable , it is enough to prove that qi is dense. Let Gqi be the graph corresponding to the query qi . Thus Gqi can be thought of as obtained from Gi by removing i edges. Further for each edge that is removed, its nodes (which correspond to the symbols in the corresponding sub-goal) are frozen into constant. In general, the graph Gqi may be disconnected. Let Gqi be the union of connected components c1 ∪ . . . ck , where k is the number of connected components in Gqi . Then vars(qi ) = ∪j vars(cj ) and goals(qi ) = ∪j goals(cj ).

From lemma C.6, we know that |symbols(cj )| ≤ |goals(cj )| + 1. Further each cj contains at least one constant. This is because for cj to have been broken off from other components, one of the edge in Gq going into cj would have had to be removed. As noted earlier, removing an edge requires freezing its nodes, turning them into constants. Thus there will be at least one constant in cj . This means |vars(cj )| from each other, they have non-overlapping P≤ |goals(cj )|. Since cj ’s are disconnected P nodes and edges. Thus |vars(qi )| = j |vars(cj )| and |goals(qi )| = j |goals(cj )|. This means that |vars(qi )| ≤ |goals(qi )|. Thus density of qi is at most 1. Additionally, there can be no other homomorphic images of qi . This is because no subgoals in qi can unify. Thus qi is dense. This completes the proof.

D.

PROOF OF UTILITY

The theorem is restated here. Theorem D.1 (Theorem 2.2). Let q be any query with g = |goals(q)|. Then Algorithm 2.1 is (Σ, Γ)-useful w.r.t q for Σ = O(logg m/) and Γ = O(1/γ). We assume that the size of the query can be treated as a constant: g = O(1) and v = O(1) where v is the number of variables in the query. We must show that the probability of an error larger than O(logg (m)/) is bounded by O(1/(mγ)). If Step 3 occurs, the error can be large, but the probability of Step 3 occurring is equal to g/(mγ) = O(1/(mγ)). It remains to show that Step 4 is bounded. If Step 4 is executed, then A(I) is a Laplace random variable with mean q(I) and scale parameter λ/. The error, |q(I) − A(I)| follows an exponential distribution, with density p(x) = /λe−x/λ . The probability that the error exceeds is λlog(m)/ is at most 1/m. According to Step 2 of the algorithm, λ = (8(v + 1)g 2 log m)g−1 = O(log g−1 (m)). Therefore the error will exceed O(log g (m)/) with probability O(1/(m)). Since the output of both Step 3 and Step 4 has bounded error, the algorithm has bounded error.

E.

LOWER BOUND ON UTILITY OF ANY DIFFERENTIALLY PRIVATE ALGORITHM

We characterize the utility of an -differentially private algorithm. Usefulness is intimately related to the notion of the query’s global sensitivity, which measures how much the query answer can change given the change of a single tuple in the database. See Sec 3.1 for formal definition and Figure 2 for examples of queries with high global sensitivity. The global sensitivity determines the noise distribution for the algorithm in Dwork et al. [6] (denoted DM N S). Specifically, the algorithm adds Laplace noise with standard deviation GSq /. Therefore, DM N S is (Σ, Γ)-useful with Σ = Θ(GSq log(m)/). The following proposition shows that the utility of a differentially private algorithm depends fundamentally on the sensitivity. Proposition E.1 (Utility lower bound). Let Aq be a -differentially private algorithm for q. If Aq is (Σ, Γ)-useful for any Γ < m, then Σ ≥ GSq /2. Proof. We will prove by contradiction. Assume Σ < GSq /2. We say output O is useful for I if it is within Σ of q(I), otherwise it is useless. Consider worst-case inputs I, I 0 such that LSq (I) = GSq . Let O be useful outputs for I, specifically the set of outputs O such that |O − q(I)| ≤ Σ. Because A is (Σ, Γ)-useful P (A(I) ∈ O) ≥ 1 − Γ/m 0

Now consider I . Let pbad be the probability that A(I 0 ) is useless for I 0 . Since the algorithm is (Σ, Γ)-useful, then pbad ≤ Γ/m. Suppose A(I 0 ) = O. By the triangle inequality, |O − q(I 0 )| ≥ |q(I) − q(I 0 )| − |q(I) − O| If O ∈ O, then |O − q(I 0 )| ≥ GSq − GSq /2 = GSq /2 > Σ. Therefore if O ∈ O, it is useless for I 0 . The differential privacy condition states that for any set S, the P (A(I 0 ) ∈ S) ≥ e− P (A(I) ∈ S). Therefore, P (A(I 0 ) ∈ O) ≥ e− P (A(I) ∈ O) = e− (1 − Γ/m) So we have that pbad ≥ e− (1 − Γ/m) and pbad ≤ Γ/m e− (1 − Γ/m) ≤ pbad ≤ Γ/m e− (1 − Γ/m) ≤ Γ/m e− e− e− (1 + e− )

≤ Γ/m + e− Γ/m ≤ Γ/m(1 + e− ) ≤ Γ/m

which is a contradiction for fixed  and any Γ < m. Furthermore, to parallel Proposition 2.5, some important queries have high global sensitivity. Proposition E.2 (Subgraph query sensitivity). Let q be the query that counts the number of occurrences of a subgraph h in G. If h is a connected graph then GSq = Ω(mk−2 ) where k is the number of vertices in h.

F.

PROTECTING AGAINST POSITIVE CORRELATIONS

We first need to consider the following superclass of LM distributions considered in 2.5. Definition F.1 (LM k ). A probability distribution P is LM k if forall instances I, I 0 such that |I \ I 0 | is greater than k the following condition holds: P (I)P (I 0 ) ≥ P (I ∩ I 0 )P (I ∪ I 0 ) Thus LM 1 is precisely the LM class. Further LM k−1 ⊆ LM k for all k. Now similar to T LM and PT LM define the classes: T LM k to be the set of distributions which are LM K under every possible marginalization, and PT LM k to be the set of planar T LM k distribution. Obviously, PT LM k−1 ⊆ PT LM k for all k and PT LM 1 = PT LM . Next we show the following result. Proposition F.2 (Positive correlations). Let A be any -adversarially private algorithm w.r.t PT LM . Then A is k-adversarially private algorithm w.r.t PT LM k adversaries. Note that the parameter  is used as e in the privacy definition. Thus k is actually exponentially worse than  in terms of the privacy guarantee, and Prop. 4.6 makes practical sense only if we apply it for small k. The proof of Prop 4.6 exploits the collusion resistance behaviour of differential privacy.

Suggest Documents