Label-bag based Graph Anonymization via Edge Addition Chongjie Li
Toshiyuki Amagasa
Department of Computer Science Graduate School of Systems and Information Engineering University of Tsukuba 1–1–1 Tennodai, Tsukuba, Ibaraki 305–8573, Japan
Faculty of Engineering, Information and Systems University of Tsukuba 1–1–1 Tennodai, Tsukuba, Ibaraki 305–8573, Japan
[email protected]
[email protected] Gautam Srivastava Hiroyuki Kitagawa Faculty of Engineering, Information and Systems University of Tsukuba 1–1–1 Tennodai, Tsukuba, Ibaraki 305–8573, Japan
[email protected] ABSTRACT Privacy-preserving publishing of graph data, such as social networks, has been gaining much public attention in recent years due to the growing demands for publishing graph data containing privacy information. Most of the existing approaches for graph anonymization deal with unlabeled graphs, while labeled graphs have useful real-life applications. However, it is proven that k-anonymity problem edgelabeled graphs is computationally expensive. In this paper, we devise a greedy heuristic based approach for k-anonimity problem over edge-labeled graphs. More precisely, we deal with some utility metrics to achieve better anonymization results. To show the effectiveness of the proposed schemes, we conduct a set of experimental evaluations using synthetic and real datasets. The results reveal that our proposed scheme can successfully anonymize edge-labeled graphs. We also assess how the utility of the anonymized graphs is affected by the proposed algorithms.
Categories and Subject Descriptors K.4.1 [Computers and Society]: Privacy; H.2 [Database Management]: Miscellaneous
General Terms Social graph, label bag, anonymization
1.
INTRODUCTION
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. C3S2E ’14 August 04–07 2014, Montreal, Canada Copyright 2014 ACM 978-1-4503-2712-1/14/08 ...$15.00. http://dx.doi.org/10.1145/2641483.2641516.
Department of Computer Science University of Victoria PO Box 3055, STN CSC, Victoria, BC, Canada
[email protected]
Social networks (SNs) have shown a remarkable development in recent years. Consequently, growing amount of information is being accumulated in various SNs. Typically, a SN can be modeled as a directed or a undirected graph, called social graph, where each node represents a user and each edge represents the relationship between the users at the both ends. By analyzing such social graphs, one can extract useful information using various data mining techniques [2, 1, 25]. An important fact is that social graphs contain privacy information by nature. On the other hand, the demand for publishing social graphs has been increasing, e.g., SN providers, such as Facebook and LinkedIn, often want to publish their data to others for various reasons, such as research collaboration, social contribution, etc. To protect privacy information, many researchers have proposed techniques for privacy preservation [9]. Among them, the differential privacy has been recognized as one of the most important technique, where the accuracy of queries from statistical databases is maximized while minimizing the probability of identifying its records [6]. Another major technique is anonymization [19, 14, 24], where a dataset is transformed in such a way that there is a scientific guarantees that the individuals in the dataset cannot be reidentified while the data remain practically useful. The privacy-preserving graph publishing is to transform a graph into another graph in such a way that adversaries with certain background knowledge cannot re-identify the entities and/or relations between entities. It is worth to mention that many of the graph anonymization techniques deal with unlabeled graphs [26, 3, 23] where all edges are assumed to be uniform. However, as can easily be imagined, edgelabeled graphs have wide variety of real-life applications. For example, a SN can be naturally modeled as an edgelabeled graph where 1) each node represents a user; and 2) each edge represents the relationship between the users at the both ends, where the edge label shows the type of the relationship (friend, relative, etc.).
Alice f
Bob
r f
Dave
f
Carol
(a) Original
1 f
4
2
r f
f
f f
(b) Add f edge
1
3
4
2
r f
f r
2. RELATED WORK f
friend
r
relative
3
(c) Add r edge
Figure 1: An edge-labeled graph and its anonymization. Dealing with edge-labeled graphs poses a new challenge in graph anonymization. We need a new model of graph anonymization rather than the k-anonymity for unlabeled graphs. Consider the example in Figure 1. There are four users and two kinds of edges, namely, r (relative) and f (friend) (Figure 1 (a)). The graph is 2-anonymized by removing the users’ identity and adding an f edge between Dave and Carol (Figure 1 (b)). However, if an adversary has the knowledge that Bob has two friends and one relative in the SN, then he can identify Bob (node 2). Instead, if we add edge r, the graph can now be 2-anonymized, because nodes 1 and 3 (2 and 4) cannot be discriminated by the labels of the connecting edges. To deal with this problem, Srivastava et al. [11] proposed the label-bag based anonymization of edge-labeled graphs1 . In this work, they formally defined the problem and provided a complexity analysis of the problem. In fact, the problem is NP-hard for general graphs containing multiple edge labels. In addition, they proposed algorithms for simple cases, but did not propose ones for general cases. Nevertheless, it is important to develop a method that can perform the label-bag based anonymization for general cases. To tackle this problem, we propose algorithms for anonymizing edge-labeled graphs based on the label-bag based model. Because the problem is known to be NP-hard [11], we propose greedy heuristic-based algorithms. More precisely, we only assume adding new edges, and do not consider deletion or renaming of edge label. One of the major contributions of this paper is that we exploit several utility metrics, such as degree/label distribution, shortest path distance, and clustering coefficient, to achieve better anonymization results, thereby allowing users to manage the utility of the resulting graphs. This feature is important, because users can control the characteristics of anonymized data by taking into account the desired properties for the way that the data is used. More precisely, if a user wishes to perform clustering analysis, one may use clustering coefficient as the utility metric, etc. We perform intensive performance evaluation of the proposed scheme. The experimental results show that the proposed scheme can generate k-anonymized graphs based on the label-bag based model. Also, it can do it according to various graph utility metrics. The rest of this paper is organized as follows, Section 2 introduces some related works. We then formalize the problem in Section 3. Section 4 gives the proposed methods, and some improved algorithms that take into account utility are presented in Section 5. We show the results and analysis of the experiments in Section 6. Section 7 concludes this paper. 1
In their original work, they use the terminology labelsequence instead of label-bag. However, in this work, we use label-bag, because we deal with bag (multi-set) of labels.
In the area of graph anonymization, there have been many related works due to the growing demand for privacy-preserving graph publishing. The problem can be categorized into several classes according to the basic assumptions regarding different aspects, such as the graph model, graph modification operations, adversary’s background knowledge, etc. Actually, the majority focus on anonymizing structural information, i.e., graph topology. These methods can be divided into unlabeled graph anonymization [19, 14, 25, 26, 3, 23, 21] and labeled graph anonymization [11, 24]. The later can further be divided according to whether nodes or edges (or both) are labeled. Early works focus on unlabeled or node-labeled cases. Sweeney et al. [19] give the early idea of k-anonymity by replacing the identifiers of published data. Liu et al. [14] defines the k-degree anonymity where the degree of certain nodes are disclosed to the adversary. Zhou et al. [25] devise the neighborhood attack against certain nodes, and is extended by Tripathy et al. [21] in such a way that the adversaries have more information than 1neighborhood. Later, k-automorphism [26] and k-symmetry [23] propose different privacy models that provide stronger privacy on the structural property of the graph. Cheng et al. [3] further discusses the k-isomorphism by considering link information. Notice that most of them deal with unlabeled graphs or node-labeled graphs. Recent years, attentions have been paid on edge-labeled problems like Yuan [24] and Kapron [11]. In terms of the graph modification operation, there are proposals focusing on edge addition/deletion [14, 11, 25, 3, 22], node/edge addition [23, 4], vertex generalization [10], edge label generalization [24], and class/cluster-based method [2, 21]. In terms of the attack model, there are models focusing on entity re-identification [19, 14, 23], link re-identification [20, 12], or both [24, 3]. More than that, there are approaches related to active attacks like Backstrom [1] where adversaries attempt to modify the data. For anonymization of edge-labeled graphs, Fung et al. [8] proposed a scheme for k-anonymity based on frequent patterns. Our work is different from their work for the definition of k-anonymity. In addition, we utilize some utility metrics for better anonymized results. Our work is inspired mainly by [11, 24]. However, unlike [24], we focus only on edge addition, while [24] allows label generalization. Kapron [11] gives the problem definition of label-bag based edge-labeled graph anonymization, and proves that the computational complexity of the problem is NP-hard when k > 2. In addition, they propose algorithms for anonymizing bipartite graphs where k = 2, but do not provide the ones for general graphs where k > 2.
3. PROBLEM DEFINITION Here, we give the formal definition of label-bag (LB) based graph anonymization problem. In this work, we assume that a graph is undirected and simple, i.e., there is no self-loop and no multiple edges between two nodes. A graph G is defined as a quadruple (V, E, L, λ), where V is the set of nodes, E ⊆ V × V is the set of edges, L is the set of edge labels, and λ : E → L is the mapping from an edge to a label. Note that each node vi ∈ V has its identity (i). Then, the label bag is defined as follows: Definition 1
(Label bag; LB). For an edge-labeled graph
Alice
a b
b Dave
a
Bob
1
b
b
Carol
(a) Original graph
4
b a
1
2
a
b
b
3
4
(b) Remove identity
a b
2
b
b Edge addition
a
3
(c) LB 4-anonymity
(& LB 2-anonymity)
Figure 2: LB k-anonymity example. G, the label bag LBi of a node vi in G is the multi-set of edge labels, such that LBi = {λ(e) | e ∈ E and e has vi in either of the connected nodes.} Let us consider the the example in Figure 2. Figure 2 (a) is the original graph, and Figure 2 (b) is obtained by replacing the node names with identifiers. LB1 = {a, b} and LB2 = {a, b, b}. Hereafter, we abbreviate the label bag as the concatenation of labels, such as LB1 = ab and LB2 = abb. Next, we define the concept of label-bag based k-anonymity. Definition 2 (Label-bag based (LB) k-anonymity). Given an edge-labeled graph G and an integer k, G is said to be k-anonymized, if there exist at least k nodes with the same label-bag LBi for any node vi ∈ V . For example, Figure 2 (b) is 2-anonymized, because v1 and v3 (v2 and v4 ) have the same LB ab (abb, resp.). Then, the label-bag based anonymization problem is defined as follows. Definition 3 (LB k-anonymity problem [11]). Given an edge-labeled graph G and an integer k(≥ 2), the LB k-anonymity problem of G is to construct a graph G = (V, E ∪ ΔE, L, λ), such that G is LB k-anonymized. As can be seen from the definition, in this problem, we only allow edge addition as the graph modification operation. Notice that introduction of new labels is not allowed, either. In [11], Kapron proved that the computational complexity of this problem is NP-hard when k > 2. Let us take a look at Figure 2, Figure 2 (b) can further be anonymized by adding an edge (1, 3) with label b. As a result, the graph is 4-anonymized, because all four nodes have the same LB abb. From a practical point of view, it is important to make G as similar to G as possible to minimize information loss from the original graph. To this end, |ΔE| should be minimized and the edges to add (ΔE) should be chosen carefully in such a way that the utility of the graph is maintained as much as possible. In this work, we exploit several graph utility metrics, and incorporate them in the anonymization algorithm (Section 5).
4.
BASIC ALGORITHM FOR LB KANONYMIZATION
We propose a basic greedy heuristic based algorithm for LB k-anonymity, which shall later be improved by taking into account graph utility metrics. The algorithm can roughly be divided into two phases. Suppose a parameter k(≥ 2) and graph G = (V, E, L, λ) such that |V | ≥ k. First, the algorithm divides the input
node set into several anonymization groups, which is a nonoverlapping partition of node set V , such that each group contains at least k nodes and the LBs in the same group are identical or as similar as possible. Then, it checks that the graph satisfies k-anonymity, and terminates if so; otherwise, it goes to the second phase where an iterative process lasts until k-anonymity is satisfied. During the process, it greedily adds edges between two nodes in different groups such that each node’s LB is different from the group’s ideal LB. In the rest of this section, we elaborate the proposed algorithm in detail.
4.1 Anonymous group generation 4.1.1 Target label bag and grouping strategy The first phase is to partition the node set V into a number of disjoint subsets called anonymous groups A = {A1 , A2 , . . . , Aj , . . . , A|A| } such S that i) Aj ⊂ V , ii) Ai ∩ Aj = ∅ (i = j), and iii) V = j=1,2,...,|A| Aj . Before discussing specific grouping algorithms, we introduce a couple of concepts, target label bag (TLB) and grouping strategy, which are required in the subsequent discussion.
Target label bag; TLB. For each group Aj , we associate the objective LB, which are computed from the members’ LBs, to represent the LB that the member are supposed to obtain after anonymization. Definition 4 (Target label bag; TLB). For a group A j S label bag T LBAj such that T LBAj = S (⊂ V ), we set a target is bag union. vi ∈Aj LBi , where Let us consider again the example in Figure 2 (b). Suppose that we have a group A1 = {v1 , v2 }. Then, T LBA1 = ab ∪ abb = abb.
Grouping strategy. A grouping strategy defines 1) the number of anonymous groups and 2) the size of each group, and is denoted as an ordered list of integers, where each integer value describes the size of corresponding group and the length of the list corresponds to the number of groups being generated. As for the size of each group, there must exit at least k nodes, because at least k nodes must have the same LB to achieve k-anonymity. For example, let us consider an graph with 9 nodes (|V | = 9) and we want to make it 3-anonymity (k = 3). Then, possible grouping strategies are S = 3, 3, 3 , 5, 4 , 6, 3 , or simply 9 . Here, let us denote the size of groups by |S| and the element in i-th position by Si , e.g., |S| = 2, S1 = 5, and S2 = 4 for S = 5, 4 . As can easily be conjectured, the number of possible grouping strategies grows exponentially according to the number of nodes, partitions, and k, which makes the grouping computation difficult particularly when |V |/k is large. However, the following lemma alleviates the complexity. Lemma 1. For a given graph G and parameter k, the size of each anonymity group Aj is within the range k ≤ |Aj | < 2k. Proof sketch. Suppose an anonymous group Aj such that Aj ≥ 2k, then we can subdivide Aj into several smaller groups Aj1 , Aj2 , . . . Aji , . . . , Ajm such that 1) k ≤ Aji < 2k and 2) T LBAji = T LBAj .
Algorithm 1 Feature-based grouping. Input: Graph G = (V, E, L, λ), grouping strategy S = S1 , S2 , . . . , S|S| Output: Anonymous groups A = {A1 , A2 , . . . , A|S| } 1: for j = 1 to |S| do 2: vi ← randomly chosen node in V ; 3: Aj ← {vi }; T LBAj ← LBi ; A ← A ∪ Aj ; 4: V ← V − {vi }; 5: for l = 1 to |Sj | − 1 do 6: vi ← a node in V whose LB (LBi ) is most similar to T LBAj ; 7: T LBAj = T LBAj ∪ LBi ; Aj ← Aj ∪ {vi }; 8: V ← V − {vi }; 9: end for 10: end for 11: Return A;
Let us consider again an example of a graph with 9 nodes. According to the above lemma, we only need to deal with s = 3, 3, 3 and 5, 4 , and do not with 6, 3 and 9 . Even though the possible combinations of grouping strategies are suppressed thanks to this lemma, exhaustive search is still costly to perform. So, in this work, we introduce a couple of algorithms to compute anonymity groups.
4.1.2 Feature-based grouping Feature based grouping is a simple greedy approach for anonymous group computation. As for the grouping strategy S, we attempt to make |S| be |V |/k and make the size of each anonymous group as even as possible, i.e., S1 = S2 = . . . = S|s|−1 = k and S|S| = |V | − k(|S| − 1). In the algorithm, for each anonymous group, we randomly choose a seed node, and set the initial TLB as the corresponding seed’s LB. Then, we iteratively look for such a node that has the most similar LB to the current TLB. Regarding the similarity between two LBs, we can use any similarity between two bags, such as Jaccard similarity. Moreover, if more than two nodes have the same similarity to the current TLB, then we randomly choose one node out of the candidates. Next, we update the current TLB and put it in the current anonymous group. We repeat this process until all nodes are assigned to either of the anonymous groups. Algorithm 1 shows the concrete algorithm.
4.1.3 Clustering-based grouping As can be seen, the grouping result of the feature-based grouping can be affected by the order of node selection. To address this problem, more systematic approach may be worth considering, i.e., we apply clustering over LBs to group a set of similar LBs together to form better anonymous groups. More precisely, we exploit the prototype-based hierarchical clustering method, where each group is initialized as a single node, and the most similar pair of groups are merged together to form a larger group. We repeat this process until all groups have size larger than k. As for the distance (dissimilarity) between two groups, we can consider several variations as follows:
Algorithm 2 Clustering-based grouping. Input: Graph G = (V, E, L, λ), integer k Output: Anonymous groups A = {A1 , A2 , . . . , A|A| } 1: for i = 1 to |V | do 2: Ai ← {vi }; A ← A ∪ Ai ; 3: end for 4: while ∃j : |Aj | < k ∧ |A| > 1 do 5: Find Ai , Aj ∈ A with least Dx (Ai , Aj ) such that i = j ∧ |Ai | < k ∧ |Aj | < k; 6: Ai ← Ai ∪ Aj ; T LBAi ← T LBAi ∪ T LBAj ; A ← A − {Aj }; 7: end while 8: GroupAdjust(A); 9: Return A;
D1 (Ai , Aj ) D2 (Ai , Aj ) D3 (Ai , Aj )
= |T LBAi − T LBAj | + |T LBAj − T LBAi | = D1 (Ai , Aj ) × (|Ai | + |Aj |) = (T LBAi − T LBAj ) × |Aj | +(T LBAj − T LBAi ) × |Ai |
Intuitively, D1 takes into account the difference between two LBs, whereas D2 and D3 take into account the size of LBs in addition to D1 . Algorithm 2 shows the concrete algorithm. GroupAdjust in the last line is to compensate such a result where only one cluster with size < k is left. If so, we merge the cluster with another one with least distance (Dx ).
4.2 Greedy edge additions Having generated anonymous groups, the next step is to add edges between two distinct anonymous groups, thereby making the graph LB k-anonymous. Algorithm 3 shows the concrete algorithm. First, we check if the generated anonymous groups satisfy k-anonymity. If so, we output the current graph as the output (line 1–4); otherwise, we greedily add edges to make the graph k-anonymized (line 5–17). Our edge addition algorithm makes use of the concept of residual label bag (RLB), which is defined for each node in an anonymous group. Definition 5 (Residual label bag; RLB). For a node vi in a group Aj , the residual label bag (RLB) is defined as RLBi = T LBAj − LBi , where − is bag difference. For example, in Figure 2 (b), suppose that we have a group A1 = {v1 , v2 } where T LBA1 = abb(= ab ∪ abb). Then, RLB1 = b(= abb − ab), while RLB2 = ∅. According to the definition of RLB, anonymization of graph is equivalent to make every node’s RLB empty. The edge-addition phase (line 5–17) is comprised of two steps. First, we greedily add edges between two nodes such that 1) they have a common label in their RLBs; and 2) they are not directly connected. Then, we check that the graph is LB k-anonymized, and the algorithm terminates if so. Otherwise, there are some nodes having RLBs. To cope with them, in the next step, we take an anonymous group, and augment its TLB by adding a label in the remaining RLB, thereby allowing us to further reduce RLBs. Let us take a look at an example in Figure 3, whose node information is shown in Figure 4. By looking at RLB column
Algorithm 3 Greedy heuristic based algorithm for LB kanonymity. Input: Graph G = (V, E, L, λ), integer k / grouping strategy S Output: LB k-anonymized version of graph G 1: A ← Result of feature-based or clustering-based grouping over G; 2: if ∀j : |Aj | > k then 3: Return G; 4: end if 5: while ∃i, j : vi ∈ Aj ∧ T LBAj = RLBi do / E ∧ RLBm ∩ 6: while ∃n, m : vm , vn ∈ V ∧ (vm , vn ) ∈ RLBn = ∅ do 7: l ← a label in RLBm ∩ RLBn ; 8: Add an edge (vm , vn ) with label l; 9: RLBm ← RLBm − {l}; RLBn ← RLBn − {l}; 10: end while 11: if ∀vi ∈ V : RLBi = ∅ then 12: Return G; 13: end if 14: vi ← a node having largest RLB; 15: l ← a label in RLBi ; 16: Aj ← an anonymous group which does not contain vi ; 17: T LBAj ← T LBAj ∪ {l}; 18: For each node vi in Aj , RLBAj ← RLBAj ∪ {l} 19: end while
in Figure 4, we find that node 9 has non-empty RLB (b), and also find that node 7 has b in its RLB. Because there is no edge in between, we can add an edge between them with label b, thereby removing b from the nodes’ RLBs. We continue this until no edge can be added. According to the example, it turns out that we cannot add edge any more, because 1) node 4 and 8 (node 7 and 8) has a common label b (a, resp.), but already connected. Next, we take an anonymous group and augment its TLB so that we can additional edges. In this example, we take anonymous group A1, and add a to its TLB, and add a to the member’s RLB, namely, node 9, 1, and 6, as well. We continue these steps until all RLBs become empty.
2
b
b
3
a
1
2
6
b
5
9
a
b
b 4
b
5
4
b
(A) Original
2 3
b
b
5
1
b
b
b 4
b
a 7
a
8
(B) Added 1 edge
a
9 6
(C) Added 2 edges
b 3
a
a
8
2
a
7
b b
b b 5
b 4
b
1
a b b a a b
b
a
8
9 6
b a
a
Lemma 2. Let G and l be a graph and a label, respectively, and suppose that RLBN um(l) is an odd number. If we choose an anonymous group Aj , such that |Aj | is an even number, to augment label l, we cannot make G LB kanonymized. Proof sketch. According to the assumption, RLBN um(l) can be represented as 2n + 1(n ∈ N + ), and, by augmenting Aj by label l, RLBN um(l) becomes 2n+2m+1(n, m ∈ N + ), because |Aj | is even. Meanwhile, an edge addition with label l contributes to reduce RLBN um(l) by 2. As a consequence, RLBN um(l) cannot be 0. This property applies not only to greedy edge addition, but also to grouping strategy S (Section 4.1.2). We change the grouping strategy S according to k, i.e., S1 = S2 = . . . = S|s|−1 = k and S|S| = |V | − k(|S| − 1) if k is odd; otherwise, S1 = S2 = . . . = S|s|−1 = k + 1 and S|S| = |V | − (k + 1) × (|S| − 1). This grouping strategy contributes to avoid the situation where all group sizes are even whereby no group is appropriate to augment. By choosing appropriate anonymous group to augment, this algorithm terminates in a finite number of steps, though it may fail to LB k-anonymize the graph.
As can be seen, the proposed greedy edge addition algorithm is order dependent. This means that the result heavily depends on the order of edges to add and the order of anonymous group to augment as well. In the worst case, the algorithm is not successful in LB k-anonymization. Nevertheless, it is possible to find a better result by exhaustively search over the solution space, though it is time consuming. For this purpose, we incorporate a backtracking mechanism in the above algorithm. In this algorithm, we make use of the following property.
b
6
b
In the second step of greedy edge addition where the TLB of an anonymous group is augmented for further edge addition, choosing a better anonymous group is important to have better anonymization result. There are several factors to take into account: 1) average node degree, 2) average RLB size, 3) average RLB size of neighboring nodes, and 4) connectivity to other anonymous groups. If the average node degree is high, it is less likely to be connected with other nodes. However, in a real graph, the edge density is usually low. As a consequence, we take into account the rest of the factors, namely, 2), 3), and 4). More precisely, we sort all anonymous groups, except for the ones with low connectivity, according to average RLB size in descending order, and choose the one from the top of the list. When choosing an anonymous group to augment, we need to take into account the following special case. Here, for a label l, let us denote the total number of ls in RLBs as P RLBN um(l) = v∈V |RLBv |.
4.2.2 Backtracking
9
b
b
7
a
8
b
b
3
a
1
4.2.1 Choosing a better anonymous group to augment TLB
7
(D) Anonymized graph
Figure 3: An LB k-anonymization example (graph).
Lemma 3. Let G = (V, E, L, λ) and vi ∈ V be a graph and a node in G, respectively, and let us denote the degree of vi by di . Then, to make G LB k-anonymized, the following condition needs to be satisfied for any node in V . |RLBi | ≤ |V | − di − 1
A Grp TLB A1 ab
A2
A3
Node 9 1 6 bb 2 3 5 aabbb 4 7 8
LB a ab ab bb bb bb bbb aa ab
RLB Neighbor b 1 2, 9 5, 7 1, 3 2, 4 4, 6 aa 3, 5, 8 bbb 6, 8 abb 4, 7
(A) Original
A Grp TLB A1 ab
A2
A3
Node 9 1 6 bb 2 3 5 aabbb 4 7 8
LB ab ab ab bb bb bb bbb aab ab
RLB Neighbor 1, 7 2, 9 5, 7 1, 3 2, 4 4, 6 aa 3, 5, 8 bb 6, 8, 9 abb 4, 7
A Grp TLB A1 aab
A2
A3
Node 9 1 6 bb 2 3 5 aabbb 4 7 8
(B) Added 1 edge
LB ab ab ab bb bb bb bbb aab ab
RLB a a a
Neighbor 1, 7 2, 9 5, 7 1, 3 2, 4 4, 6 aa 3, 5, 8 bb 6, 8, 9 abb 4, 7
(C) Adjust A1’s TLB
Figure 4: An LB k-anonymization example (nodes). Proof sketch. For a node vi in G, its largest degree is at most |V | − 1, because we do not allow any self-loop or any multiple edges, whereby vi is connected to all other nodes. Given that the degree of vi is di , we can add at most |V | − di − 1 additional edges to vi . If RLBi > |V | − di − 1, we cannot make the graph LB k-anonymized. According to this property, by keeping track of the degree and the size of RLB of each node, we can tell whether it is possible to make the current graph LB k-anonymized. If it turns out that the above property does not hold, we backtrack to the previous state, and try another candidate, i.e., for edge addition, we remove the lastly added edge, and attempt to add another edge; for anonymous group augmentation, we cancel the last augmentation, and attempt to augment the next candidate. Notice that, to enable backtracking, we need extra memory space for recording intermediate process states. Its efficient implementation is a part of our future work.
4.3 Complexity analysis We provide here a complexity analysis of the proposed algorithm. The time complexity of the feature-based grouping algorithm is O(n2 ), where n is the size of the graph being anonymized, if we naively scan over the nodes to search for similar LBs. As for the space complexity, for representing adjacent lists using dynamic lists, it requires O(n), where n is the size of the graph. Additionally, it requires a space for storing group information, O( nk ). In case of clustering-based grouping, it is basically equivalent to the well-known agglomerative hierarchical clustering. Consequently, its time complexity is O(n3 ), where n is the size of the graph. This means that it cannot be applied to large graphs. For this reason, in this work, we exploit Fastcluster by Mullner [15], whose complexity is O(n2 ). Regarding the space complexity, we need to maintain the distance between every pair of nodes, which requires O(n2 ) space. In the greedy addition phase, its time complexity is O(mn2 ), where n is the size of the graph and m is the average number of iteration. The average number of iteration m has a positive correlation with the sum of RLBs. For this reason, to reduce the total processing time, it is important to find a good grouping in the previous phase.
5.
IMPROVED ALGORITHM BASED ON UTILITY
The focus of the method introduced in Section 4 is to achieve k-anonymity based on greedy edge addition. Although this is a practical criteria, it is possible to adopt other criterion depending on the way that the anonymized graph is used. In general, the usefulness of an anonymized graph can be quantified using utility [10]. Based on this observation, in this section, we attempt to enhance the proposed scheme by taking utility measurements into account for better anonymized results.
5.1 Utilities If we think about the typical uses of social network data, many users are considered to be interested in the graph properties [10], such as degree, path length, transitivity, etc. The others are related to the answers to extent queries like [24] or [12]. Due to these interests, in this work, we deal with the following metrics.
5.1.1 Degree distribution and label distribution In this case, we are interested in using the degree or label distribution of a graph as the utility metric. To do this, we need to quantify two distributions. Specifically, we exploit Earth Mover’s Distance (EMD) [18], which is defined over two distributions P = {(p1 , wp1 ), . . . , (pm , wpm )} and Q = {(q1 , wq1 ), . . . , (qn , wqn )} where pi (qi ) is a feature and wi is the corresponding weight, respectively. Specifically, given two signatures (distributions) P = {(p1 , wp1 ), . . . , (pm , wpm )}, Q = {(q1 , wq1 ), . . . , (qn , wqn )}, the EMD can be calculated as 1 EM D(P, Q) = (|r1 |+|r1 +r2 |+...+|r1 +r2 +...+rm−1 |) m−1 We adapt this to calculate the difference between two degree distributions (EMDD; EMD for degree) and two label distributions (EMDL; EMD for label). To calculate EMDD, we use the degree sequences ordered in ascending order. Similarly, we use the distribution of distinct labels.
5.1.2 Shortest path distance The shortest path distance (SPD) is a widely used measurement in graph theory. More precisely, we exploit average shortest path distance (ASPD), which can be computed as the average of all SPDs between two distinct nodes. To compute an SPD, we exploit Floyd-Warshall algorithm [5]. Having computed all possible SPDs, ASPD can be calculated as:
Table 1: Category Synthetic Real 1 Real 2 Real 3
Cost
Experimental dataset. Name Small World Graph [16] Speed Dating Data [7] arXiv E-print Archive [17] Enron Email Data [13]
1400 1200 1000
FB
800
C1
600
1 ASP D = |E|
X
C3
400
SP D(vi , vj )
200
vi ,vj ∈V
0
where E and V are the set of edges and nodes in graph G, respectively.
5
10
20
k
50
Cost
600
5.1.3 Clustering coefficient This is a measure of how nodes in a graph tend to cluster together. It can be roughly dived into two categories, global and local clustering coefficient. Specifically, we make use of the average local clustering coefficient (ACC). Let vi be a node, and let us denote vi ’s neighborhood by Ni = {vj | (vi , vj ) ∈ E}. Then, the edges between vi ’s neighbors is defined as ENi = {(vj , vk ) ∈ E | vj , vk ∈ Ni }. Then, the local clustering coefficient for node vi and ACC can be computed as: CCi =
C2
500 400
FB C1
300
C2
200
C3
100
1 X |ENi | , ACC = CCi |Ni |(|Ni | − 1) |V | v ∈V
0
3
4
5
L
6
i
5.2 Modified algorithm To incorporate these utility measurements, we need to modify the proposed algorithm. The only point to modify is the greedy edge addition step. In the original version, we find candidates such that two nodes share at least one common label in RLB and have no edge in between. Then, we take the one with largest RLB size. On the other hand, in the modified algorithm, we compute the utility values before and after edge addition for each candidate. Then, we choose the one with the highest utility gain.
6.
EXPERIMENTS
6.1 Experimental environment We conduct a series of experiments to evaluate the efficiency and effectiveness of the proposed algorithm. The experimental environment is a PC (Intel Core 2 Duo 2.26 GHz CPU, 4 GB memory), and the program is written in C language compiled by gcc-4.2.
6.2 Experimental dataset The experimental datasets are shown in Table 1. We use synthetic data for testing the proposed algorithm in controlled situations, while it is also tested using several real datasets.
6.3 Experimental results Due to the fact that social networks are well-modeled by the small world graph, we generate a Small World Graph with 500 nodes as the synthetic data. Figure 5 shows the result when varying k (left) and varying the number of labels L (right). The vertical axis (cost) is the number of edges
Figure 5: Experimental results: Synthetic varying k (left) and varying L (right).
Table 2: Experimental result: Real 1. k 5 10 20 50
Feature 169 398 664 1259
Clustering-1 112 309 567 1325
Clustering-2 117 312 526 2239
Clustering-3 131 291 540 1820
added to anonymize the graph. We compare the featurebased grouping (FB) and the clustering-based grouping using different distance metrics (C1 to C3). We can observe that both k and L have positive effects to the total cost. In Figure 5 (left) where k is varied (L = 3), the clustering-based method, in particular, the one with distance metric 3 (C3) outperforms others when k is small. In the meantime, in Figure 5 (right), the feature-based grouping (FB) becomes better when k is large. Likewise, when varying the number of labels (k = 5), C3 outperform others. Table 2 shows the cost for anonymization using dataset1, which contains 551 nodes, 8,368 edges, and 2 kinds of labels. The max degree of this graph is 22 and the average degree is approximately 15. The result is similar to the previous one in the sense that the clustering-based algorithm basically outperform the feature-based grouping when k is small, whereas the feature-based algorithm performs best when k is large. Figure 6 (left) compares three real datasets (Real 1, 2, and 3) with different k using feature-based grouping. More precisely, Real 2 consists of 16,726 nodes and 47,594 edges
Cost
100000
Table 3: Experimental results: utility-based methods (Real 1).
10000
1000
real1 real2
100
real3
10
1 5
10
20
50
k
0.045 0.04 0.035 0.03 0.025
Edge-addition
0.02
Grouping
0.015 0.01 0.005 0
Utility Original EMDD EMDL ACC ASPD
Cost 169 169 169 169 169
Utility of anonymized graph EMDD EMDL ACC ASPD 0.0198 0.0074 0.0455 3.2840 0.0198 0.0074 0.0431 3.2954 0.0198 0.0074 0.0431 3.2954 0.0198 0.0074 0.0051 2.5269 0.0198 0.0074 0.0192 2.7787
rithm based on label-bag model and realize it in two different ways. We evaluate the proposed scheme by some experiments on both synthetic and real data. Through the results, it is proved to be efficient and of good utility. Also, we investigate how choices in parameters would influent the cost. An improved method is proposed considering four utility metrics and is proved to be of good utility through experiment results. In the future we plan to extend our model to utilize edge deletion and label generalization operations are other interesting topics.
8. ACKNOWLEDGMENTS FB
C1
C2
C3
Figure 6: Experimental results: cost for different datasets (left) and time breakdown (right).
This research was partly supported by the program Research and Development on Real World Big Data Integration and Analysis of the Ministry of Education, Culture, Sports, Science and Technology, Japan.
9. REFERENCES with 3 kinds of labels. Average and max degree are 5.09 and 107, respectively. Real 3 contains 36,692 nodes and 367,662 edges. Notice that the labels are randomly generated. We can observe that the total cost is quite different depending on the dataset, because the graph size is different. Figure 6 (right) shows the time breakdown for each algorithm applied to Real 1. We can observe that the time for greedy edge addition is relatively small compared to that for grouping. As a consequence, from the viewpoint of running time, the feature-based grouping is more efficient than the clustering-based methods. Table 3 compares the results of utility-based methods using Real 1 dataset. The row with label “Original” shows the utilities for the k-anonymized data generated by the baseline method where utility is not taken into account. The rows below show the respective utility values computed from the anonymized graphs considering the corresponding utility metrics. The result show that, by the proposed method, the utility metrics changed (3.2 to 2.7 for ASPD and 0.45 to 0.05 for ACC), which are more close to the values in the non-anonymized graph. Unfortunately, EMD-based methods (EMDD and EMDL) did not work well, and the values did not change. This is due to the fact that the number of labels were small. We plan to evaluate these metrics using larger datasets in the future. For ASPD (average shortestpath distance) and ACC (average clustering coefficient), the result
7.
CONCLUSION
In this paper, we discussed the k-anonymity problem in privacy-preserving graph data publishing. This is an extension from the unlabeled model and can be applied to many real-world situations. We provide a heuristic algo-
[1] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In Proceedings of the 16th international conference on World Wide Web, WWW ’07, pages 181–190, New York, NY, USA, 2007. ACM. [2] S. Bhagat, G. Cormode, B. Krishnamurthy, and D. Srivastava. Class-based graph anonymization for social network data. Proc. VLDB Endow., 2(1):766–777, Aug. 2009. [3] J. Cheng, A. W.-c. Fu, and J. Liu. K-isomorphism: privacy preserving network publication against structural attacks. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD ’10, pages 459–470, New York, NY, USA, 2010. ACM. [4] S. Chester, B. M. Kapron, G. Ramesh, G. Srivastava, A. Thomo, and S. Venkatesh. k-anonymization of social networks by vertex addition. In ADBIS (2), pages 107–116, 2011. [5] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001. [6] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, pages 265–284, Berlin, Heidelberg, 2006. Springer-Verlag. [7] FlowingData. http://flowingdata.com. [8] B. C. M. Fung, Y. Jin, and J. Li. Preserving privacy and frequent sharing patterns for social network data publishing. In Proc. ASONAM 2013, pages 479–485, 2013.
[9] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv., 42(4):14:1–14:53, June 2010. [10] M. Hay, G. Miklau, D. Jensen, D. Towsley, and C. Li. Resisting structural re-identification in anonymized social networks. The VLDB Journal, 19(6):797–823, Dec. 2010. [11] B. M. Kapron, G. Srivastava, and S. Venkatesh. Social network anonymization via edge addition. In Proc. ASONAM 2011, pages 155–162, 2011. [12] A. Korolova, R. Motwani, S. U. Nabar, and Y. Xu. Link privacy in social networks. In ICDE, pages 1355–1357, 2008. [13] J. Leskovec. Enron email network. http://snap.stanford.edu/data/email-Enron.html. [14] K. Liu and E. Terzi. Towards identity anonymization on graphs. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08, pages 93–106, New York, NY, USA, 2008. ACM. [15] D. M¨ ullner. fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python. Journal of Statistical Software, 53(9):1–18, 5 2013. [16] NetworkX. http://networkx.lanl.gov/index.html. [17] M. E. J. Newman. Scientific collaboration networks. ii. shortest paths, weighted networks, and centrality. Phys. Rev. E, 64:016132, Jun 2001. [18] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision, 40(2):99–121, 2000. [19] L. Sweeney. k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst., 10(5):557–570, Oct. 2002. [20] C.-H. Tai, P. S. Yu, D.-N. Yang, and M.-S. Chen. Privacy-preserving social network publication against friendship attacks. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 1262–1270, New York, NY, USA, 2011. ACM. [21] B. Thompson and D. Yao. The union-split algorithm and cluster-based anonymization of social networks. In Proceedings of the 4th International Symposium on Information, Computer, and Communications Security, ASIACCS ’09, pages 218–227, New York, NY, USA, 2009. ACM. [22] B. K. Tripathy and G. K. Panda. A new approach to manage security against neighborhood attacks in social networks. In ASONAM, pages 264–269, 2010. [23] W. Wu, Y. Xiao, W. Wang, Z. He, and Z. Wang. k-symmetry model for identity anonymization in social networks. In Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10, pages 111–122, New York, NY, USA, 2010. ACM. [24] M. Yuan, L. Chen, and P. S. Yu. Personalized privacy protection in social networks. Proc. VLDB Endow., 4(2):141–150, Nov. 2010. [25] B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks. In Proc. ICDE 2008, pages 506–515, 2008.
¨ [26] L. Zou, L. Chen, and M. T. Ozsu. k-automorphism: a general framework for privacy preserving network publication. Proc. VLDB Endow., 2(1):946–957, Aug. 2009.