Efficiently Anonymizing Social Networks with ... - ACM Digital Library

2 downloads 0 Views 465KB Size Report
Nov 1, 2013 - The goal of graph anonymization is avoiding disclosure of privacy in social networks through graph modifications mean- while preserving data ...
Efficiently Anonymizing Social Networks with Reachability Preservation Xiangyu Liu

College of Information Science and Engineering Northeastern University Liaoning 110819, China

[email protected]

Bin Wang

College of Information Science and Engineering Northeastern University Liaoning 110819, China

[email protected] [email protected]

ABSTRACT

is an important graph query as reachability queries are not only common on graph databases, but they also serve as fundamental operations for many other graph queries. For instance, SNS (Social Networking Services) generally support queries on the relationship path between two users. In practice, graph reachability is nontrivially distorted, and the accompanying information loss is incurred by graph anonymization. The reason is that existing anonymization algorithms use the number of modified edges as the only metric to evaluate information loss, neglecting the fact that the impact on graph reachability varies with different edge modifications.

The goal of graph anonymization is avoiding disclosure of privacy in social networks through graph modifications meanwhile preserving data utility of the anonymized graph for social network analysis. Graph reachability is an important data utility as reachability queries are not only common on graph databases, but also serving as fundamental operations for many other graph queries. However, the graph reachability is severely distorted after the anonymization. In this paper, we solve this problem by designing a reachability preserving anonymization (RPA for short) algorithm. The main idea of RPA is to organize vertices into groups and greedily anonymizes each vertex with low anonymization cost on reachability. We propose the reachable interval to efficiently measure the anonymization cost incurred by an edge addition, which guarantees the high efficiency of RPA. Extensive experiments illustrate that anonymized social networks generated by our methods preserve high utility on reachability.

a

b c d

f e w

Categories and Subject Descriptors

j

g

din dout vertices 2 1 b,f,i,k,p l

k i h

(a) G

H.2.8 [Database Management]: Database Applications— Data Mining

u v

p q

1

3

c,l

1

2

g,j,u

1

1

e,h

1

0

d,q,v,w

0

2

a

(b) Degrees

Figure 1: A social network graph and the degrees.

Keywords

1.1 Motivation

Social networks; Privacy; Anonymization; Reachability

1.

Xiaochun Yang

College of Information Science and Engineering Northeastern University Liaoning 110819, China

Figure 1 describes a toy microblog network G and the degrees of vertices in G, where din and dout refer to the indegree and out-degree of a vertex, respectively. A directed edge (a, b) indicates that a is a follower of b. Assuming the adversary has acquired that Alice has no follower and follows two other individuals, i.e. din (Alice) = 0 and dout (Alice) = 2, it is not difficult for the adversary to re-identify vertex a as Alice with 100% confidence, since there is only one vertex having the same degree (including in-degree and out-degree) as Alice. To avoid vertex re-identification using degrees as background knowledge, k-degree anonymization has been proposed in [7] and the general idea is to modify the network so that for each vertex v, there exist at least k − 1 other vertices having the same degrees as v, thus the probability of the adversary re-identifying the identity of a vertex would not be larger than k1 . For instance, G1 and G2 in Figure 2 are two 2-degree anonymized versions for G. We say u, v is a reachable pair, if vertex v is reachable from u. Specifically, G1 is obtained through inserting edge (g, a) into G, resulting in 31 new reachable pairs, such as g, a

INTRODUCTION

Social networks usually contain individuals’ sensitive information, which makes it an important concern to avoid compromising individual privacy when releasing social network data. Graph anonymization, as an valid method for privacy protection, has been extensively studied in the past few years [3, 6, 12, 13, 14]. One of the fundamental issues when anonymizing social network data is avoiding disclosure of individuals’ sensitive information while still permitting certain analysis and queries on the network. Reachability Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CIKM’13, Oct. 27–Nov. 1, 2013, San Francisco, CA, USA. Copyright 2013 ACM 978-1-4503-2263-8/13/10 ...$15.00. http://dx.doi.org/10.1145/2505515.2505731.

1613

a

b c d

f e w

j

g

c

l

k i h

(a) G1

a

b

u

p

v

q

d

f e

l

k g

w

The remainder of the paper is organized as follows. Section 2 summarizes the related work. Section 3 gives the problem definition and we present the RPA algorithm in Section 4 then. In Sections 5, a technique is devised to accelerate RPA. We evaluate our methods in Section 6. Section 7 concludes the paper.

j

i h

u

p

v

q

2. RELATED WORK

(b) G2

In previous work, privacy attacks in social networks are mainly classified into two categories, including vertex reidentification attacks and link re-identification attacks. In vertex re-identification attacks (a.k.a. identity disclosure), an adversary identifies the identities of vertices in the published network using the subgraphs associated with target individuals as background knowledge. Liu et al. in [7] propose k-degree anonymity to prevent from privacy attacking using vertex degree as adversary knowledge. Zhou et al. in [13] protect identity privacy through anonymizing the 1neighborhood subgraph of each vertex. To resist subgraph based vertex re-identification attacks, Hay and Campan in [2, 6] propose to cluster vertices into super nodes, thus vertices in a super node are indistinguishable from each other. Zou et al. in [14] present a privacy preserving model KAutomorphism for protecting identity privacy. In link re-identification (a.k.a. link disclosure) attacks, an adversary aims at identifying sensitive relationships among individuals in social network. Ying et al. in [11] study graph randomization through adding/removing and switching edges randomly while preserving the spectrum of the network. Liu et al. in [8] propose inference security to prevent link inference attacks in social networks. Cormode and Bhagat in [1, 4] study graph anonymization to protect link privacy in bipartite graphs. Fard et al. in [5] propose subgraph randomization to protect sensitive relationships for directed social network graphs.

Figure 2: Two anonymized versions for G in Figure 1(a).

and h, p. Correspondingly, only one new reachable pair j, a is introduced in G2 . Let R(G) denote the set of all reachable pairs in G. For simplicity, we define that a vertex can reach itself. For graph G and its k-anonymized version Gk , we define anonymization cost Cost(G, Gk ) to measure the utility of Gk on reachability, which is calculated as: Cost(G, Gk ) = |R(G) − R(Gk )| + |R(Gk ) − R(G)|.

(1)

Equation 1 equals to the quantity of incremental and decremental reachable pairs in Gk . Higher Cost(G, Gk ) implies low utility of Gk on reachability. By employing Equation 1, we could get Cost(G, G1 ) = 31 and Cost(G, G2 ) = 1, indicating that the reachability of G is better preserved in G2 than in G1 . To preserve the reachability, for graph G, we expect to find an anonymized version Gk , of which Cost(G, Gk ) is minimized.

1.2 Challenges and Contributions It is significant to preserve the reachabilities of anonymized graphs, yet few existing research work [1, 6, 7, 12, 13, 14] pays enough attention to it. Anonymizing social networks with reachability preservation is much more challenging. Given an n-vertex and m-edge graph G, how to efficiently measure the anonymization cost on reachability due to graph modifications? As we will show in this paper, measuring the anonymization cost incurred by an edge addition is a frequent operation in our reachability preserving anonymization algorithm. Unfortunately, this process is time costly. For example, when measuring the impact on reachability due to adding edge (g, a) into G, a straightforward method is to calculate the reachable vertex set of each vertex in G before and after the addition and check the difference then. A breadth-first search is needed for each vertex to obtain the reachable vertex set, which requires O(n + m) time. Thus, it would take O(2n(n + m)) time to measure the impact due to an edge addition. The high time complexity makes it impractical for massive graphs. As a fundamental operation in graph anonymization, efficiently measuring the anonymization cost on reachability proposes us a challenge. Our contributions can be summarized as follows: (1) To retain the reachability in social networks, we develop a reachability preserving anonymization (RPA for short) algorithm. (2) We propose the reachable interval to efficiently measure the anonymization cost incurred by an edge addition, which guarantees the high efficiency of RPA. (3) Our extensive empirical results illustrate that anonymized social networks generated by our methods preserve high data utility on reachability.

3. PRELIMINARIES AND PROBLEM DEFINITION Due to the fact that many real life social networks (e.g., Twitter, e-mail networks) are directed graphs, in this work, we represent a social network as a directed graph G = (V, E) with |V | = n and |E| = m, where V is the vertex set, and E ⊆ V × V is the edge set. We also use V (G) and E(G) to denote the vertex set and edge set of G. We use (u, v) to represent the edge from vertex u to vertex v. Specifically, we say v is an out-neighbor of u and u is an in-neighbor of v. We use din (u)/dout (u) to denote the in/out-degree of u, i.e., the number of edges coming to or out of u. For vertex u ∈ V , we represent the degree of u in the form of (din (u), dout (u)). We say link u to v when adding the edge (u, v) into E. In this work, we assume that the adversary uses both the in-degrees and out-degrees as background knowledge to reveal the identities of individuals. Based on the privacy model of k-degree anonymity in [7], we formally present our privacy model in social networks. Definition 1. (k-Degree Anonymity) Given graph G(V, E) and integer k, if ∀v ∈ V , there exist m(m ≥ k − 1) other vertices u1 , u2 , . . . um that satisfy din (v) = din (ui ) and dout (v) = dout (ui )(1 ≤ i ≤ m), we say graph G is k-degree anonymous.

1614

For instance, both G1 and G2 in Figure 2 are 2-degree anonymous graphs. In this paper, we focus on reachability preserving anonymization in directed graphs. With no difficulty our approach can be applied in undirected graphs, for each edge can be regarded as an edge with bi-directions.

Algorithm 1: Reachability Preserving Anonymization Input: a social network graph G(V, E) and the anonymity parameter k; Output: an anonymized graph Gk ; 1 initialize Gk = G; 2 mark all vertices in Gk as “unanonymized”; 3 Setua = {unanonymized vertices in Gk }; 4 repeat 5 Seed = s ∈ Setua with maximal din (s) + dout (s); 6 if |Setua | ≥ 2k then 7 vSet = k nearest vertices to Seed in Setua ;

Definition 2. (Reachability) Given graph G(V, E) and vertices u, v ∈ V , we say vertex v is reachable from vertex u if there is a path starting from u and ending at v, and we define that u, v is a reachable pair of G, denoted as u  v. The reachability of graph G refers to the set of all reachable pairs in G, denoted as R(G). Given graph G(V, E) and integer k, we expect to construct a k-degree anonymous graph Gk (Vk , Ek ) through a set of graph-modification operations on G such that Gk preserves the reachability of G as much as possible. As described in [12, 14], for the purpose of k-anonymity, it is not necessary for the output graph to have the same set of vertices as the original one, i.e. Vk ≈ V . Moreover, we restrict the graphmodification operations to vertex and edge additions, that is V ⊆ Vk and E ⊆ Ek . We formally propose the Reachability Preserving Anonymization problem as follows.

8 9 10 11 12 13 14 15

din = maxv∈vSet (din (v)); dout = maxv∈vSet (dout (v)); for each vertex u ∈ vSet do anonymizeOutDegree(Gk , u, dout , vSet); anonymizeInDegree(Gk , u, din , vSet); mark u as “anonymized” and remove u in Setua ;

16 until Setua == ∅; 17 return Gk ;

Problem 1. (Reachability Preserving Anonymization) Given graph G = (V, E) and integer k, find a k-degree anonymous graph Gk (Vk , Ek ) such that Cost(G, Gk ) is minimized.

Algorithm 2: anonymizeOutDegree

1

As only vertex and edge additions are allowed in graph anonymization, R(G) ⊆ R(Gk ) obviously. Then Equation 1 equals to Cost(G, Gk ) = |R(Gk ) − R(G)|, which measures the quantity of incremental reachable pairs in Gk .

2 3 4 5

Theorem 1. The problem of Reachability Preserving Anonymization is NP-hard.

4.

else vSet = the remaining vertices in Setua ;

Input: Graph Gk , vertex u ∈ V (Gk ), integer dout and vertex set vSet; candN eighbors = {candidate out-neighbors in Gk }−vSet; while dout (u) < dout and candN eighbors = ∅ do u = v in candN eighbors with minimum Cost(u, v); add (u, u ) into E(Gk ); remove u from candN eighbors;

6 if t = dout − dout (u) > 0 then 7 add t vertices into V (Gk ) marked as “anonymized”; 8 link u to these newly added vertices;

REACHABILITY PRESERVING GRAPH ANONYMIZATION

In this section, we introduce our reachability preserving anonymization algorithm (RPA for short), which is described in Algorithm 1. The main idea of RPA is to group vertices with similar degrees and equalize them on their degrees. Given specified value of in-degree and out-degree, the algorithm greedily anonymizes each vertex with low anonymization cost on reachability. Due to the power law distribution of the degrees of vertices in social networks, the algorithm starts with the vertices of high in-degrees and out-degrees as the remaining ones are easy to anonymize. First, the algorithm marks all vertices in the network as “unanonymized” and stores them in Setua (lines 2-3). Iteratively, the vertex in Setua with maximal din + dout value is picked as Seed (line 5). If the size of Setua is no smaller than 2k, the algorithm selects a subset vSet that contains the k nearest vertices to Seed (lines 6-7). Otherwise, all vertices in Setua are collected into vSet (line 9). Notice that for vertex u, we map its degree (din (u), dout (u)) to a point in 2-dimension space, and use Manhattan distance to measure the distance between two points. As we could know, the in-degrees and out-degrees of vertices are non-decreasing as only vertex and edge additions are allowed in graph anonymization. Thus, the algorithm anonymizes each vertex in vSet with the maximal values of

in-degree and out-degree in the same set (lines 10-15). As shown in Algorithm 2, given out-degree dout and vertex u, anonymizeOutDegree greedily selects dout − dout (u) vertices and links u to them. Each selected vertex v should satisfy the following conditions: • (u, v) ∈ E(Gk ); • Vertex v is unanonymized. The first one is a necessary condition for v being a new out-neighbor of u. The second one guarantees anonymizing u would not affect the anonymity of other anonymized vertices. We call these two conditions as “the Candidate Neighbor Condition(CNC)”, and vertices satisfying CNC are candidate out-neighbors of u. For a candidate in-neighbor v of u, the first condition is modified to (v, u) ∈ E(Gk ). Notice that to keep the in-degrees no larger than din , no vertex in vSet is selected as the candidate out-neighbors of u. In the loop from line 2 to 5, Algorithm 2 greedily selects vertex v with minimum Cost(u, v) as the new out-neighbor of u, where Cost(u, v) refers to the anonymization cost incurred by adding (u, v) into Gk and is calculated as: Cost(u, v) = |R(Gk ∪ {(u, v)})| − |R(Gk )|.

1615

(2)

Theorem 3. When add edge (u, v) into G, for vertex p ∈ R−1 (u), we have ΔRl (p) = Rl (v) − Rl (p).

The vertices we link u to are removed from candN eighbors (line 5). If candN eighbors is empty and dout is still larger than dout (u), Algorithm 2 adds dout − dout (u) vertices into Gk and links u to them (lines 6-8). According to the power law distribution, the quantity of vertices having degree (1, 0) in Gk is much larger than the anonymity parameter k. Obviously, these newly added vertices also have degree (1, 0), so we safely mark them as “anonymized” (line 7). The algorithm anonymizeInDegree anonymizes din (u) into din adopting the same heuristic strategy as Algorithm 2, and we omit the details here. In practice, the impact of added vertices could be neglected as the quantity is small, which will be shown in Section 6. Review the contents in Section 1.2, the naive method takes O(2n(n+m)) time to calculate Cost(u, v) for each candidate out-neighbor v, which is impractical for massive social networks. In order to reduce the time complexity of Algorithm 2, we focus on how to efficiently measure Cost(u, v).

5.

To guarantee calculating ΔRl (p) in high efficiency, we extend the dual labeling coding scheme in [10] to generate reachable intervals, on which ΔRl (p) could be calculated in O(t) time, where t n. The dual labeling consists of two main steps. The first step is constructing a spanning tree of the input graph. Then, the dual labeling assigns intervalbased labels to the vertices and keeps track of the non-tree edges so that the reachability information is complete. Different from DAG (directed acyclic graph), social networks generally contain cycles. Given graph G(V, E) with |V | = n and |E| = m, we first find the strongly connected components of G and represent each component with a super node. It takes O(n + m) time to find these super nodes using Tarjan’s algorithm in [9]. For graph G in Figure 1(a), a spanning tree T and the super nodes are presented in Figure 3(a), where the solid arrows form a spanning tree and the dotted arrows are non-tree edges. Obviously, all vertices in a super node are equivalent to each other concerning reachability. In the remainder of the paper, the term ‘node’ will be used to refer to a maximal set of strongly connected vertices. We assume that the graph consists of only one connected component; disjoint components can be hooked together by creating a virtual root node.

MEASURING ANONYMIZATION COST WITH HIGH EFFICIENCY

In this section, we introduce how to efficiently measure Cost(u, v). We propose the definition of reachable interval, based on which we present the technique to measure anonymization cost. As the size of reachable intervals is key to the time complexity of measuring Cost(u, v), we extend the dual labeling in [10] to generate smaller reachable intervals, which guarantee the high efficiency of the measurement. Given graph G(V, E) and vertex p ∈ V , we use R(p) to denote the set of vertices that are reachable from p. Let R−1 (p) represent the set of vertices, from which p is reachable, i.e. the reachable vertex set of p in the reverse graph G−1 . For simplicity of discussion, assuming a vertex could reach itself, i.e. p ∈ R(p) and p ∈ R−1 (p). As only vertex and edge additions are allowed in the anonymization, we use ΔR(p) to denote the set of incremental vertices to R(p) due to the addition of edge (u, v). When inserting edge (u, v) into E, not all vertices’ reachabilities will be affected.

[0,17) a

a

S1 b c

j

s1 s2 d e

l u

p

f

S2 k g i h

w v

q

(a) A spanning tree T and the super nodes.

[1,7)s1

j [7,17)

[8,12)s2 l[12,17) [5,7)e d [13,15)u p[15,17) [4,5) [6,7)w q[16,17) [14,15)v

(b) The interval labels.

Figure 3: The extended dual labeling coding scheme.

Theorem 2. The addition of edge (u, v) into graph G only has impacts on reachabilities of the vertices in R−1 (u). According to Theorem 2, the anonymization cost Cost(u, v) could be calculated through checking ΔR(p) for each vertex p ∈ R−1 (u), and Equation 2 is equivalent to:  Cost(u, v) = |ΔR(p)|. (3) p∈R−1 (u)

We propose a data structure, named reachable interval, to efficiently calculate ΔR(p). We first assign each vertex an integer id. Then, we generate a reachable interval for each vertex based on these ids. Definition 3. (Reachable Interval) For vertex u, we represent R(u) as an union of intervals, among which constitutes of vertex ids, and we name it reachable interval, using Rl (u) as denotation. Let pid denote the id of vertex p. For vertices u and p, it is obvious that pid ∈ Rl (u) iff u  p. Then, for vertex p, we have |ΔR(p)| = |ΔRl (p)|. We calculate ΔRl (p) to replace ΔR(p) in Equation 3.

For accurately measuring the anonymization cost, we extend the dual labeling to assign ids to the nodes of T , after we modify it to accommodate the cases where multiple vertices are contained in one node. Through traversing the spanning tree T , each node u is assigned with an interval label [ustart , uend ). Specifically, ustart is equal to ustart + cnt(u ), where u is the preorder predecessor of u and cnt(u ) refers to the number of vertices in u (ustart is assigned 0 when u is the root node). As for the value of uend , uend equals to uend when u is a non-leaf node, where u is the postorder predecessor of u. In the case of u being a leaf node, uend equals to ustart + cnt(u). All vertices in a node u share the interval label of u. We consider ustart as the id of node u. Figure 3(b) shows a spanning tree T with the assigned interval labels. The transitive link table L maintains the edge transitive closure over the non-tree edges, which is shown as follows: 8 → [1, 7) 12 → [8, 12) 13 → [15, 17) 12 → [1, 7) A reachability query, u  v, can be answered in O(1) time by using the interval labels of u and v and the transitive link table (see [10] for details).

1616

datasets, which are both directed graphs1 : Eu-Email and Epinions. Table 2 presents the statistics of these two datasets.

Algorithm 3: Generate Reachable Interval Input: a spanning tree node u, transitive link table L; Output: the reachable interval of u; 1 Rl (u) = [ustart , uend );   2 for each vstart → [vstart , vend ) ∈ L do 3 if vstart ∈ [ustart , uend ) then   4 Rl (u) = Rl (u) ∪ [vstart , vend );

Table 2: Statistics of datasets. Eu-Email Number of vertices 265214 Number of edges 420045 Maximum in-degree 7631 Maximum out-degree 930 Average in/out-degree 1.58 # of vertices with degree (0, 1) 170768 # of vertices with degree (1, 0) 36922 Average clustering coefficient 0.3093 Number of triangles 267313 Diameter (longest shortest path) 13

5 return Rl (u);

For a node u in the spanning tree T , Algorithm 3 generates a reachable interval Rl (u), consisting of the intervals of nodes that are reachable from u through tree edges and nontree edges. Obviously, the reachable interval of each vertex in u is Rl (u). Assuming there are t non-tree edges in T , then L consists of at most t(t+1) entries. The time complexity of 2 Algorithm 3 is O(t2 ). Evidently, Rl (u) contains at most t+1 intervals. For instance, the reachable intervals of f and g in Figure 3(b) are as follows:

We implemented our RPA algorithm. For comparison, we modified and implemented the anonymization algorithm in [13] and named the modified version as Neighbor-Greedy. An anonymization cost function is introduced in [13], which measures the similarity between two vertices on their neighborhoods. We employed the same function in NeighborGreedy except that we used the degrees of vertices instead of the neighborhoods as the metric. We evaluate the running time, the anonymization cost and the quantities of the added edges and vertices of these two algorithms, where we set k = 10, 20, 30, 40, 50. All programs are implemented in Java and performed on a 2.33GHz Intel Core 2 Duo CPU with 4GB DRAM running the Windows XP operating system. A. Running Time We show the running time of anonymization on both datasets with respect to different k values in Figure 4. It can be observed that the runtime gets higher with the increment of k value. RPA occupies the higher runtime due to the task of preserving reachability in anonymization. With the help of reachable intervals, RPA is almost as efficient as NeighborGreedy in runtime.

f → [1, 7) g → [1, 7) ∪ [8, 12) It is worth mentioning that, by finding the minimal equivalent graph [10], the number of non-tree edges is minimal. When adding edge (u, v) into E, for vertex p ∈ R−1 (u), as Rl (v) and Rl (p) contain at most t + 1 intervals, it requires O(t) time to calculate ΔRl (p) = Rl (v) − Rl (p). Let r = |R−1 (u)|, then calculating Equation 3 would require O(rt) time, where r n and t n in real social networks. In the discussion above, we measure the anonymization cost without considering the added fake vertices. For vertex p, let fi (p) and fo (p) denote the numbers of fake in-neighbors and out-neighbors of p, respectively. The following equation is extended from Equation 3, which shows the calculation of Cost(u, v) when considering the fake vertices.   Cost(u, v) = (fi (p)+1)(|ΔRl (p)|+ fo (q)). qid ∈ΔRl (p)

Time(x103 Sec.)

(4) In Equation 4, ΔRl (p) = Rl (v) − Rl (p). We classify the incremental reachable pairs into four categories, which are shown in the “category” column of Table 1. The reachable pair v, v   of category f ake, true indicates that v is a fake vertex and v  is a true vertex.

16.0

1.6

12.0

1.2

8.0

0.8

RPA Neighbor-Greedy

4.0 0.0

Time(x103 Sec.)

p∈R−1 (u)

10

20

30

40

(a) Eu-Email

Table 1: The incremental reachable pair categories represented by terms in Equation 4. Category # of pairs f ake, true fi (p)|ΔR  l (p)| f ake, f ake fi (p) qid ∈ΔRl (p) fo (q) true, true  |ΔRl (p)| true, f alse qid ∈ΔRl (p) fo (q)

RPA Neighbor-Greedy

0.4

50

Anonymity parameter k

6.

Epinions 75879 508837 3035 1801 6.71 18328 11774 0.2283 1624481 13

0.0

10

20

30

40

50

Anonymity parameter k

(b) Epinions

Figure 4: The runtime of graph anonymization. B. Anonymization Cost Given graph G and the corresponding k-degree anonymous k )−R(G)| × 100% version Gk , we use incremental ratio= |R(G |R(Gk )| to measure the anonymization cost on graph reachability. Notice that R(Gk ) contains all reachable pairs in Gk , including the pairs containing fake vertices. We show the results in Figure 5. Obviously, higher incremental ratio indicates more anonymization cost on graph reachability. It can be

EXPERIMENTAL EVALUATION

In this section, we provide extensive experiments to evaluate our methods. We have used two real social network

1

1617

Available at http://snap.stanford.edu/data/

100

100

80 60 Neighbor-Greedy RPA

40 20 0

Incremental ratio(%)

Incremental ratio(%)

observed that Neighbor-Greedy results in the highest incremental ratio, which is averagely 50% larger than RPA on both datasets. This phenomenon could be ascribed to the impacts of edge additions on reachability being neglected. The incremental ratios of the RPA are averagely less than 2%, proving the anonymized graphs generated by our methods preserve high data utility on reachability.

10

20

30

40

50

Anonymity parameter k

the RPA algorithm. We present the reachable interval to efficiently measure the anonymization cost, which guarantees the high efficiency of RPA. Our extensive empirical studies illustrate that anonymized social networks generated by our methods preserve high data utility on reachability.

8. ACKNOWLEDGMENTS The work is partially supported by the National Basic Research Program of China (973 Program) (No. 2012CB316201), the National Natural Science Foundation of China (Nos. 61173031, 61272178), the Joint Research Fund for Overseas Natural Science of China (No. 61129002), and the Fundamental Research Funds for the Central Universities (Nos. N120504001, N100704001).

Neighbor-Greedy RPA

80 60 40 20 0

10

20

30

40

50

Anonymity parameter k

(a) Eu-Email

(b) Epinions

9. REFERENCES [1] S. Bhagat, G. Cormode, B. Krishnamurthy, and D. Srivastava. Class-based graph anonymization for social network data. In VLDB’09, pages 766–777, 2009. [2] A. Campan and T. Truta. A clustering approach for data and structural anonymity in social networks. In PinKDD’08, 2008. [3] J. Cheng, A. Fu, and J. Liu. K-isomorphism: privacy preserving network publication against structural attacks. In SIGMOD’10, pages 459–470, 2010. [4] G. Cormode, D. Srivastava, T. Yu, and Q. Zhang. Anonymizing bipartite graph data using safe groupings. In VLDB’08, pages 833–844, 2008. [5] A. M. Fard, K. Wang, and P. S. Yu. Limiting link disclosure in social network analysis through subgraph-wise perturbation. In EDBT’12, 2012. [6] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis. Resisting structural re-identification in anonymized social networks. In VLDB’08, pages 102–114, 2008. [7] K. Liu and E. Terzi. Towards identity anonymization on graphs. In SIGMOD’08, pages 93–106, 2008. [8] X. Liu and X. Yang. Protecting sensitive relationships against inference attacks in social networks. In DASFAA’12, pages 335–350, 2012. [9] R. Paige and R. E. Tarjan. Three partition refinement algorithms. SIAM Journal on Computing, 16(6):973–989, 1987. [10] H. Wang, H. He, J. Yang, P. S. Yu, and J. X. Yu. Dual labeling: Answering graph reachability queries in constant time. In ICDE’06, pages 75–86, 2006. [11] X. Ying and X. Wu. Randomizing social networks: a spectrum preserving approach. In SDM’08, pages 739–750, 2008. [12] M. Yuan, L. Chen, and P. Yu. Personalized privacy protection in social networks. In VLDB’10, pages 141–150, 2010. [13] B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks. In ICDE’08, pages 506–515, 2008. ¨ [14] L. Zou, L. Chen, and M. Ozsu. K-automorphism: A general framework for privacy preserving network publication. In VLDB’09, pages 946–957, 2009.

80 60 40 RPA Neighbor-Greedy

20 0

10

20

30

40

Adding edge ratio(%)

Adding edge ratio(%)

Figure 5: The incremental ratios.

50

Anonymity parameter k

(a) Eu-Email

60 40 RPA Neighbor-Greedy

20 0

10

20

30

40

50

Anonymity parameter k

(b) Epinions

Figure 6: The edge adding ratios. C. The Quantities of Added Edges and Vertices To measure the ratio of added edges in Gk , we define adding k )−E(G)| ×100%, and Figure 6 shows the reedge ratio= |E(G |E(Gk )| sults. As Neighbor-Greedy uses the number of edges being added to measure the anonymization cost, it generally requires the least adding edge operations during the anonymization process. Besides that, RPA is close to Neighbor-Greedy on adding edge ratio. In Table 3, we present how many fake vertices are inserted in the anonymized graphs, where N-G refers to NeighborGreedy. As shown in Table 3, none of the two algorithms inserts more than 70 vertices on either of the dataset, while there are thousands of vertices having degree (0,1) or (1,0) in the anonymized graph (the results are not shown due to space limitation). Thus, it would be infeasible for an adversary to obtain further privacy information through removing the fake vertices from the anonymized graph along with the incident edges. Table 3: The number Eu-Email k 10 20 30 40 N-G 0 0 18 48 RPA 0 0 15 46

7.

of added fake vertices. Epinions 50 10 20 30 40 50 58 8 20 30 26 42 61 12 23 28 30 36

CONCLUSIONS

In this paper, we propose the problem of reachability preserving anonymization. We solve the problem by designing

1618