The VLDB Journal (2013) 22:543–559 DOI 10.1007/s00778-012-0304-8
REGULAR PAPER
Outsourcing shortest distance computing with privacy protection Jun Gao · Jeffrey Xu Yu · Ruoming Jin · Jiashuai Zhou · Tengjiao Wang · Dongqing Yang
Received: 16 April 2012 / Revised: 16 December 2012 / Accepted: 21 December 2012 / Published online: 18 January 2013 © Springer-Verlag Berlin Heidelberg 2013
Abstract With the advent of cloud computing, it becomes desirable to outsource graphs into cloud servers to efficiently perform complex operations without compromising their sensitive information. In this paper, we take the shortest distance computation as a case to investigate the technique issues in outsourcing graph operations. We first propose a parameter-free, edge-based 2-HOP delegation security model (shorten as 2-HOP delegation model), which can greatly reduce the chances of the structural pattern attack and the graph reconstruction attack. We then transform the original graph into a link graph G l kept locally and a set of outsourced graphs Go . Our objectives include (i) ensuring each outsourced graph meeting the requirement of 2-HOP delegation model, (ii) making shortest distance queries be answered using G l and Go , (iii) minimizing the space cost of G l . We devise a greedy method to produce G l and Go , which can exactly answer shortest distance queries. We also develop J. Gao (B) · J. Zhou · T. Wang · D. Yang Key Laboratory of High Confidence Software Technologies, EECS, Peking University, Beijing, China e-mail:
[email protected] J. Zhou e-mail:
[email protected] T. Wang e-mail:
[email protected] D. Yang e-mail:
[email protected] J. X. Yu Department of Systems Engineering and Engineering Management, Chinese University of Hong Kong, Shatin, Hong Kong e-mail:
[email protected] R. Jin Department of Computer Science, Kent University, Kent, OH, USA e-mail:
[email protected]
an efficient transformation method to support approximate shortest distance answering under a given average additive error bound. The experimental results illustrate the effectiveness and efficiency of our method. Keywords Outsource · Shortest distance · Security · Transformation
1 Introduction The rapid growth of graphs raises big challenges to the database community. Nowadays, graph structured data are used in numerous applications (e.g., web graphs, social networks, biological and chemical pathways, transportation networks). High efficiency of graph operations is essential to applications. However, even primitive operations on a graph can be very time-consuming due to the complexity of structural connectivities and graph size [8]. Moreover, real graph datasets are growing rapidly in size, making the attainment of high efficiency even harder. This paper studies the shortest distance computation (or query) on a large edge-weighted graph. The shortest distance query is one of the key operations to many applications. In addition, the shortest distance query has more expressive power than the reachability query [4] and can be used as a primitive component for other complex graph operations, including the distance join [27] and the graph pattern query [10]. The classical methods, such as Dijkstra’s algorithm [8], find the shortest distance with O(n 2 ) on general graphs or O(m + n log n) on sparse graphs, where n and m are the number of nodes and edges, respectively. Such a high computational cost makes the shortest distance computing being a bottleneck in many applications.
123
544
The paradigm shift of cloud computing offers an attractive model to manage large graph data. An outsourced server (or cloud server) typically has sufficient storage and computational capability. We can store graph data into the cloud server and rely on its powerful distributed and parallel architecture to process complex graph operations in an economic way. In addition, some expensive operations, such as building 2-HOP-based distance index [4], can be done in the cloud server. We must not directly outsource the original graph into a cloud server when the original graph contains sensitive information, such as the identifiers of nodes, the edges between two specific nodes. The cloud server cannot be assumed to be fully trusted. Once the partial original sub-graph is re-identified from the outsourced graph, sensitive information may be leaked. For example, the sensitive relationship between two re-identified nodes can be revealed. In addition, the outsourced server may yield untrusted answers related to the re-identified nodes. For instance, once a node u is identified by an outsourced server, the outsourced server may produce bias answers related to u [25]. An original graph has to be transformed before being outsourced to a cloud server. A straightforward way is to remove node identifiers from the original graph. However, as many works show [1,15], it is not sufficient to protect the security of the original graph. Attackers still can re-identify the target sub-part in the outsourced graph if the attackers have learned the structure of the (partial) original graph. Even worse, the original graph might be re-constructed from the outsourced one, since the outsourced graph is expected to preserve certain properties and then the modifications on the graph are always restricted. It is a challenging task to strike a balance between the utility and the security of outsourced graphs. On one hand, graph transformation is inevitable to achieve the desired security requirement [2,19,26,28]. On the other hand, even the most basic modifications, such as an addition/removal of one node/edge, may impact a large number of shortest distances. In addition, most of existing works handle the unweighted graphs, and the edge weights in the transformation can provide more clues to attackers in the re-identification. We also notice that there exist studies on the graph transformation with the feature preservation [7,24]. However, they cannot provide an explicit security guarantee or cannot support the exact shortest distance computing. In the previous work [11], we propose a method to transform an original graph into a set of outsourced graphs, each of which meets the requirement of security model named 1-neighborhood-d-radius. When an attacker has local knowledge on nodes’ neighbors or paths with their length less than d, the outsourced graph can defend the structural attack. However, it is difficult to choose a right value for d. A smaller d leads to the security leakage when attackers have
123
J. Gao et al.
more knowledge, while a larger d impacts the utility of outsourced graphs since only fewer nodes are allowed in one outsourced graph. In this paper, we introduce a stronger parameter-free security model, which can greatly lower the chances on structural pattern attacks even if attackers have the knowledge on the global structure of entire original graph. We then re-design graph transformation methods to fit the new security model. Our contributions can be summarized as follows: – We propose a new security model, named edge-based 2-HOP delegation model, for outsourced graphs. We show that 2-HOP delegation model can reduce the chances of the structural pattern attack and reconstruction attack significantly. – We formulate the graph transformation problem and propose its greedy method. Given an original graph G, we transform G into a link graph G l kept on the client side and a set of outsourced graphs Go . Our objective is to minimize the size of the link graph G l on the condition that each outsourced graph in Go meets the requirement of 2-HOP delegation model and shortest distances can be answered using Go and G l . – We study how to answer approximate shortest distances in the same context with an average additive distance error bound. By allowing approximate shortest distances to be answered, we show that the original graph can be transformed efficiently. – We conduct extensive experiments on both real and synthetic datasets. The results show that the client side achieves a significant cost saving in shortest distance computing with the aid of a cloud server. In addition, the transformation performance can be further improved when it is used to answer approximate shortest distances within a specified error bound. The remainder of this paper is organized as follows: In Sect. 2, we present the security model and formulate our transformation problem. Sections 3 and 4 design methods to transform graphs with the exact and approximate distance answering, respectively. Section 5 reports experimental results. Section 6 reviews the related work. Section 7 concludes the paper.
2 Security model and problem formulation In this section, we first review graph-related notations. Then, we show the framework of graph outsourcing and discuss the security issues of outsourced graphs. Next, we give candidate graph transformation methods and present our new security model. Finally, we formulate our problem.
Outsourcing shortest distance computing with privacy protection a 1 b
1
6 2
9
h
j o
9
l 8
4
1
x 3
g 7
4
k
3
3
4 i
2 2
w 6
6
8
4
u
e 7 f
5
d 3 3
c
2
1
v
3
6
z
(b) Weighted Query Pattern w
x
y z
p
(a) Original Graph
y 6
(c) Unweighted Query Pattern
Fig. 1 Sample graph Table 1 Symbols G(V, E)
An original graph
G (V , E )
A transformed graph
G l (Vl , El )
A link graph
G o (Vo , E o )
An outsourced graph
Go
Outsourced graphs
n/m
The number of nodes/edges in G
no
The number of nodes in G o
nl
The number of nodes in cluster
2.1 Notation Let G = (V, E) be an edge-weighted undirected graph, where V and E are its node set and edge set, respectively. Each edge e ∈ E takes the form of e = (u, v), u, v ∈ V . The weight of the edge e can be represented by w(e) or w(u, v). A path u 0 u x is a sequence of edges (u 0 , u 1 )(u 1 , u 2 ) . . . (u x−1 , u x ), where u i ∈ V (0 ≤ i ≤ x). edges( p) is for the number of edges in p. The cost of a path p is the sum of edge weights in p, denoted by len( p). A shortest distance query in a graph G computes the minimal cost of any path from u to v in G, which is denoted by δG (u, v). Figure 1a shows a sample graph. The letter inside a circle represents the node’s identifier, and the number annotated on an edge is its weight. The symbols referred throughout the paper are listed in Table 1. 2.2 Framework of graph outsourcing We present the overall framework of outsourcing shortest distance computation in Fig. 2. Given an original edgeFig. 2 Framework of outsourcing shortest distance computation
Original Graph
545
weighted graph G, we transform G and obtain its transformed graph G = (V , E ). We then try to distribute G into G o = (Vo , E o ) and G l = (Vl , El ) with the consideration of both security and utility, where V = Vo ∪ Vl , E = E o ∪ El . The first part, an outsourced graph(s) G o , records the key information for the online shortest distance answering, but does not contain any detailed information. The second part, a link graph G l on the local side, keeps the private information and the relationships between nodes in G and nodes in an outsourced graph(s) G o . For any shortest distance query q, q is rewritten into queries on the cloud server and queries on the local side first, according to the transformed results. Then, the query results are combined together to find the shortest distance. To ensure that the transformed graph G can be used in the shortest distance answering, the graph transformation method should have the following property: Definition 1 (Shortest distance preserving graph transformation) Let G = (V, E) be an original graph. A method transforms G into G = (V , E ), in which each u ∈ V is mapped into M(u) ∈ V by a node transformation mapping M. The transformation is called shortest distance preserving, if for any two nodes u, v ∈ V , δG (u, v) = δG (M(u), M(v)). Here, M(u) ∈ V and M(v) ∈ V are the transformed nodes for u and v, respectively. G is a shortest distance equivalence graph to G. 2.3 Attacks on outsourced graphs Below, we discuss different kinds of attacks on outsourced graphs, when the attackers have knowledge on node identifier, local (or global) graph structure, and transformation rules. The node identifier attack is most straightforward. When attackers know identifiers of target nodes, and the outsourced graph contains these node identifiers, it is easy for the attackers to locate the target nodes in the outsourced graph. The solution to such an attack is also simple. We can remove node identifiers before the graph is outsourced. The structural pattern attack is related to the structural pattern query [10,27]. Intuitively, a structural pattern query finds matched sub-graphs in the original data graph. We illustrate a weighted pattern query in Fig. 1b and an unweighted one in Fig. 1c. We can see that the weighted pattern query is Graph transformation
Outsourced Graph(s)
Link Graph Data owner
Query Results
Query Evaluation
Client Side
Rewritten Queries
Attacker
Query Evaluation
Results
Outsourced Server
123
546
more distinguishing than the unweighted one. For example, the results of the unweighted pattern in Fig. 1c contain 32 sub-graphs, while the results of the weighted pattern have only 1 sub-graph with nodes {i, l, v, p}. This paper focuses on the weighted pattern query, which is formally defined as follows: Definition 2 (Structural pattern query) Given an original graph G = (V, E), a structural pattern query G q = (Vq , E q ) finds a set of sub-graphs, {G s }, in G. For each G s , there exists a node mapping M from G q to G s , under which for any edge (u, v) ∈ G q there exists an edge (M(u), M(v)) ∈ G s with the weight w(M(u), M(v)) = w(u, v). The structural pattern attack is launched when an attacker can compose a structural patten query using his knowledge on (partial) original graph. The attacker evaluates the query on the outsourced graph, attempts to re-identify the nodes from the query results [1,2,15,26,28], and infers other hidden information. Formally, the structural pattern attack can be described as: Definition 3 (Structural pattern attack) Let G be the original graph, G be a transformed graph with the node transformation mapping M, a structural pattern query Q be any subgraph in G and M(Q) be the mapped sub-graph for Q in G , R be the results of the structural pattern query Q against G . Structural pattern attack has a probability pr ob(Q) = pe pd to infer the relationship between Q and M(Q) from R, 1. pe is the existing probability, which is the chance of M(Q) in R; 2. pd is the distinguishing probability, which is 1/|R|. Thus, a successful structural pattern attack is related to two factors. First, for an original sub-graph Q, its mapped subgraph should be in R (the results of structural pattern query Q against G ). Otherwise, the relationships inferred from R are useless. Second, attackers expect that the size of R had better be smaller, and thus, the distinguishing probability 1/|R| can be larger. Most of existing graph publishing techniques [1,2,15,26,28] ensure that the distinguishing probability is less than a threshold 1/k by anonymizing graphs to make each node be locally or globally isomorphic to at least k other nodes. The reconstruction attack can be made when both graph structure and transformation rules on graphs for outsourcing are known to attackers. Since certain properties are expected to be preserved in graph outsourcing, the graph transformations always follow some patterns, and then, it remains the chances that attackers may recover the (partial) original graph directly from the transformed graph. Once the original graph is recovered, the target part can be easily re-identified in the outsourced graph by the structural pattern attack.
123
J. Gao et al. a
b
e
c
b
a 10
d j u
h k o
f l i
g v p
(a) 4-isomorphism
j
6
u
6
o
g
i
h
10
13 10
f
e
d 117
c
9
9
6
8
k 4
15 11
l
14 16 12 12
v p
(b) Hidding Edges
Fig. 3 Candidate graph transformation approaches
2.4 Candidate graph transformation techniques In the following, we show a various of graph transformation methods and analyze security issues on transformed graphs. We can see challenges in striking a balance between the utility and security of transformed edge-weighted graphs. 2.4.1 k-isomorphism approach k-isomorphism is proposed recently to protect the published graphs from the structural pattern attack [2]. Different from other models like 1-neighborhood [26], k-degree [19] which assume that attackers have the local knowledge on the original graph, k-isomorphism approach can counter structural pattern attacks even if the attackers have the global knowledge on the graph. We illustrate a 4-isomorphism graph in Fig. 3a, which is transformed from the original graph in Fig. 1. A node u is mapped into u in the transformed graph. We can see that the original graph is split into 4 disjoint subgraphs. Even if the attackers have the global knowledge on the original graph, the distinguish probability is at most 1/4 in the structural pattern attack, and then, the attackers have at most 1/4 possibility to re-identify the target part in the transformed graph. We can see the limitations of k-isomorphism approach in graph outsourcing. The transformed graph cannot be used to answer shortest distance queries. In addition, the existing methods only consider unweighted graphs. When the edges have weights, the graph requires more anonymization before outsourcing, which further lowers the utility of outsourced graphs. Finally, it is not trivial to choose a right value for k. 2.4.2 Basic utility preserving transformation approach We observe that the techniques above achieve the structural anonymization via arbitrary edge addition/removal operations, which surely impact shortest distance computation. Another viewpoint in the graph transformation is to consider the utility of transformed graphs first. One simple graph transformation is to compute the shortest distance δG (u, v) between any node pair (u, v) in an original graph G, build a direct edge between u and v with its weight δG (u, v), and
Outsourcing shortest distance computing with privacy protection
547
outsource the entire transformed graph. Although the transformed graph can preserve shortest distances, it is easy to know that such a simple adding-edge strategy cannot counter the structural pattern attack. An extension to the simple method is to keep direct edges of the original graph G = (V, E) on the client side, and outsource the graph with newly added edges, such as the edge / E. We illustrate e = (u, v) with the weight δG (u, v) if e ∈ a partial outsourced graph in Fig. 3b. For the original edge (e, k), we keep it on the client side, and outsource the graph with the relationships from other nodes to e and k. Since there are no original edges in the outsourced graph, we cannot find any mapped original edge in the transformed graph, and then, the existing probability in the structural pattern attack is 0. However, the above hiding-edge approach still suffers from the security issues. First, if the attackers have the knowledge on the transformation rules, they can apply the same rules to transform the structural pattern query. The outsourced graph is then still under the structural pattern attack. Second, it is a high chance for the attackers to reconstruct the original graph from the outsourced one. Based on the triangle inequality over G o , the attackers can even infer that the weight of edge (e, k) is likely to be 4, since |w(x, e) − w(x, k)| = 4 for most of nodes {x} in the outsourced graph.
mapped into G o . Thus, even if an attacker can compose a structural pattern query G q from Q, and evaluate G q against G o , the mapped sub-graph M(Q) is not in the query evaluation results. In other words, the existing probability of structural pattern attack is 0. The main problem on the d-radius model lies in the difficulty of choosing a right value for d. A larger d, although strengthening the security of outsourced graphs, will degrade their utility, since fewer nodes are allowed in one outsourced graph. On the contrary, a smaller d may lead to the security leakage of a d-radius graph. When attackers know a path longer than d in the original graph, they can make the same transformation on the query pattern, and the outsourced graph is still under the structural pattern attack. For example, the attackers can link the query node w and y directly to get another structural pattern query Q in Fig. 4c, and the evaluation results of Q include an induced sub-graph with nodes {v, i}, which is the transformed result of Q. Thus, there remain chances of the structural pattern attack. 2.4.4 Delegation node approach Another strategy to improve the security of an outsourced graph is to break the transformation mapping M between the original graph and the outsourced graph. That is, for any node u in the original graph G, the transformed node M(u) is kept only in the link graph. Thus, for any sub-graph Q in G, M(Q) will be not in the results of the structural pattern query Q against the outsourced graph. In order to achieve this goal, we can generate delegation nodes on edges and only outsource these delegation nodes.
2.4.3 d-radius approach In the previous work, we propose a security model named 1-neighborhood-d-radius graph [11]. One-neighborhood means the original edge cannot appear in outsourced graphs. d-radius requires that for any node pair (u, v), u, and v must not co-exist in the same outsourced graph if δG (u, v) < d. Figure 4 shows a 2-radius graph of the original graph in Fig. 1. Take the node b and h as an example, they are not connected directly in the original graph, and their shortest distance is no smaller than 2. d-radius approach provides a strong protection when attackers only know the paths with their length at most d. Let Q be a sub-graph with its longest simple path shorter than d in the original graph, M be the node transformation mapping. Since two nodes in the original graph with their distance smaller than d are not allowed to appear in the same outsourced graph G o , at most one node in Q is selected and Fig. 4 d-radius approach 5
h
b 5
o
3
Definition 4 (Delegation node) Let e = (u, v) be an edge in a graph, w(e) be its weight, g be a parameter to control granularity, f g (w(e)) be a function to produce a random integer value in a range [0, gw(e)]. Delegation node e splits e into two sub-edges with w(u, e ) = f g (w(e))/g and w(e , v) = w(e) − w(u, e ). We annotate each edge e with one delegation node e , which divides e into 2 sub-edges. The granularity of edge splitting relies on the parameter g. A larger g indicates a broader range. We have total gw(e) + 1 candidate delegation nodes on the edge e since we select a random integer value from the range [0, gw(e)]. For example, for an f
10
11 9 10 10
v 5
i
(a) 2-radius graph
1 w4 x 3
y
w
5
y
6
z
(b) Structural Pattern
(c) Transformed Structural Pattern
123
548
edge e = (u, v) with w(e) = 1 and g = 4, the candidate delegation nodes may be located on e with their distances to u as 0, 0.25, 0.5, 0.75, 1, respectively. We can choose different values for g on different edges. For brevity, we use g = 1 on all edges by default. The delegation node approach can be described as follows: for each edge e = (u, v) in the original graph G = (V, E), we add one delegation node e on e. For any two edges e0 = (u 0 , u 1 ) and e1 = (v0 , v1 ), we add one edge between e0 and e1 with the weight δG (e0 , e1 ). The link graph on the client side records the relationships between original nodes and delegation nodes, and the outsourced graph keeps the delegation nodes and edges among them. The outsourced graph containing delegation nodes is called the delegation outsourced graph in the following. We show the delegation nodes with the gray color are added on edges in Fig. 5a, and observe that the delegation nodes instead of the original nodes are selected into the outsourced delegation graph in Fig. 5b. We discuss the structural pattern attack on the delegation graph. Suppose attackers know any sub-graph G s of the original graph and the transform rules. The attackers can enumerate delegation nodes on edges in G s , compose structural pattern queries from delegation nodes, and evaluate the queries against the outsourced graphs. We analyze the probability of the structural pattern attack below. Theorem 1 Let f g (w) be the random function for delegation nodes. The probability of a structural pattern attack on a 1 in the worst case, delegation graph is (w(e0 )g+1)(w(e 1 )g+1) where e0 and e1 are two edges with the first two minimal weights in the original graph G. Proof Let Q = (VQ , E Q ) be any sub-graph known to an attacker in G. The attacker enumerates delegation nodes on edges in Q and composes possible structural pattern queries. For each composed structural pattern query Q ∗ , Q ∗ contains at least two delegation nodes. We study the case when there are 2 nodes e0∗ and e1∗ in Q ∗ first. The number of candidate structural pattern queries will be (w(e0 )g + 1)(w(e1 )g + 1), since there are w(e)g + 1 candidate delegation nodes on e, where w(e) is the weight of e. Thus, the existing probabil1 . pe is maximal ity pe for Q ∗ will be (w(e0 )g+1)(w(e 1 )g+1) when w(e0 ) and w(e1 ) are the first two minimal weights in G, since Q can be any sub-graph in G. When there are more than 2 nodes in Q ∗ , pe can be further lowered. There1 is the maximal existing probability, fore, (w(e0 )g+1)(w(e 1 )g+1) which is also the maximal probability of structural pattern attacks, since pd , the distinguishing probability, is no more than 1. From Theorem 1, we can see that the chances of structural pattern attacks can be reduced significantly with the delegation node approach, when g in the function f g (w) becomes big or when g is assigned randomly for each edge.
123
J. Gao et al.
e
e* x1
b b*
k x2
j
l*
j*
l
e* Q'
l*
x5
x0 x3
p* x4 p
p*
b*
x6
v*
(a) Original partial graph
v
j*
v*
(b) Outsourced graph
Fig. 5 Delegation node approach
However, the delegation node approach cannot resist the reconstruction attack. Still use the example in Fig. 5. For any original node u with a degree d, its related d delegation nodes actually compose a complete sub-graph with d(d − 1)/2 edges in the outsourced graph, as illustrated in Fig. 5b. When attackers learn the transformation rules, they can find a complete sub-graph Q first, and then build equations according to edges in Q . For example, they build the equation x0 + x1 = w(b , e ) for the edge (b , e ) in Q circled with a dotted line in Fig. 5b, where x0 is the weight of the edge (e , k), and x1 is the weight of the edge (k, b ). Finally, they have d(d −1)/2 equation rules with total d variables. By solving these equations, the attackers can find the distances between original nodes and the delegation nodes, such as the distance between b and k in Fig. 5a. In the worst case, the entire original graph can be recovered. Such an attack can be denoted by the reconstruction attack with equations. 2.5 Security model: edge-based 2-HOP delegation graph We introduce a new parameter-free security model named edge-based 2-HOP delegation (shorten as 2-HOP delegation) to overcome the above limitations. Actually, 2-HOP delegation model combines the advantages of d-radius model and delegation node approach. In the following, the graph meeting the requirement of 2-HOP delegation model is termed as 2-HOP delegation graph. We give its formal definition below. Definition 5 (2-HOP delegation graph) Let G = (V, E) be an original graph. For any edge e ∈ E, we add one delegation node e on e. A 2-HOP delegation graph G = (V , E ) meets the following requirements: 1. For any node u ∈ V , u is a delegation node e on the original edge e. (Delegation) 2. For any two nodes e0 ∈ V and e1 ∈ V , their corresponding edges e0 and e1 are not connected directly in G. (2-HOP) 3. For any edge e = (e0 , e1 ) ∈ E , its weight, w(e ), is δG (e0 , e1 ). (Shortest Distance Preservation) We illustrate a 2-HOP delegation graph in Fig. 6. For each edge in the original graph in Fig. 1, we randomly add its delegation node filled with gray in Fig. 6a. The delegation nodes
Outsourcing shortest distance computing with privacy protection
a 1 b 6
2
1 2
c
5
b
6
7 f 3 4 e g 3 z 3 w 5 4 i 3 7 h 4 8 j 1 v 1 k 2 l 2 2 q 2 x 3 y 3 3 o 7 8 4 u
d 3
10
w
f
11
4
11
3
20
12
19
6
12 12 13
q
549
z 12
k-isomorphism [2]. Both two kinds of models need structural adjustments on the original graph. However, 2-HOP delegation model focuses on the reduction in the existing probability, while the structural anonymization models [2,28] attempt to lower the distinguishing probability.
y 10
x
5
9
2.6 Problem formulation
p
(a) Original graph with
(b) 2-HOP delegation graph
delegation nodes
Fig. 6 2-HOP delegation graph
may be the same as the ending nodes. A 2-HOP delegation graph is shown in Fig. 6b. We can see that the edges corresponding to delegation nodes in Fig. 6b are not connected directly in the original graph in Fig. 1. When the shortest path between two delegation nodes e0 and e2 is via another delegation node e1 , the edge between e0 and e2 can be omitted. We formally analyze the security strength on a 2-HOP delegation graph. The chance of the structural pattern attack on a 2-HOP delegation graph is the same as that on a delegation graph in Theorem 1, since only delegation nodes are allowed in a 2-HOP delegation graph. In addition, the reconstruction attack with equations fails on 2-HOP delegation graphs. Theorem 2 2-HOP delegation graph can resist the reconstruction attack with equations. Proof According to the rules in construction of 2-HOP delegation graphs, for any edge (e0 , e2 ) in a 2-HOP delegation graph G o , e0 = (u 0 , u 1 ), and e2 = (u 2 , u 3 ) are not connected in the original graph G, and there exists at least 1 edge between e0 and e2 in G. We begin with the case when there is one edge e1 = (u 1 , u 2 ) between them. According to the construction rule, attackers build an equation x0 + x1 + x2 = w(e0 , e2 ), where x0 is the distance from e0 to u 1 , x1 is the weight of edge (u 1 , u 2 ), and x2 is the distance from u 2 to e2 . Suppose there is another edge (e0 , e2 ) in G o . It is also related to 3 edges in the original graph, including e0 = (u 0 , u 1 ), the edge e1 = (u 1 , u 2 ), and the edge e2 = (u 2 , u 3 ). It is easy to know that e1 is not the same as e1 . Otherwise, e2 and e2 should not co-exist. Thus, at least one new variable is introduced in the equation for (e0 , e2 ). By considering the case that there is more than 1 edge between edges for two delegation nodes, the number of variables is larger than the number of equations. Therefore, the attackers cannot recover exact values of all variables from these equations, and thus, the reconstruction attack with equations fails. We stress the differences between the 2-HOP delegation model (as well as d-radius model) and the existing structural anonymization models, such as k-automorphism [28],
With the 2-HOP delegation security model, we transform an original edge-weighted graph G into a set of outsourced dele|G | gation graphs Go = {G 1o , . . . , G o o } which can be deployed on the cloud server, together with a link graph G l on the client side. An edge in G l takes the form of (u, e ), which maintains the relationship between a node u in G and a delegation node e in an outsourced graph. The edge can also be expressed in the form of (u, G o · e ) to specify that the appearance e is in the outsourced graph G o . The weight of an edge (u, e ), w(u, e ), in G l is equal to δG (u, e ), where δG (u, e ) is the shortest distance from u to e in G. We may use e , e and G o .e interchangeably in the following. The shortest distance query can then be answered using the transformed graphs with Eq. 1. Suppose we have outsourced graphs Go and a link graph G l whose union is a shortest distance equivalent graph to the original graph. The shortest distance query q = (u, v) can be evaluated as follows. We rewrite q into multiple distance queries against the outsourced graphs Go . Specifically, we locate u.Edges for u’s edges and v.Edges for v’s edges in the link graph G l . For each pair of edges eu = (u, ex ) ∈ u.Edges and ev = (ey , v) ∈ v.Edges, a distance query from ex to ey is issued to G o , if ex and ey are in the same outsourced graph G o ∈ Go . We then combine the results from the outsourced server with the distance information (w(u, ex ), w(ey , v)) in G l to find len, the minimal sum of distances. Since outsourced graphs only preserve shortest distances of paths containing more than 2 edges, we need a local breadthfirst-search to find whether u can reach v via at most 2 edges, during which the path length is recorded by d2 (u, v)(if exists). The shortest distance is then computed by the minimum between len and d2 (u, v). len =
w(u, ex ) + δG o (ex , ey ) + w(ey , v)
min
ex ,ey ∈Vo
G o =(Vo ,E o )∈Go (u,ex ),(v,ey )∈El
d2 (u, v) =
min
p=u v,edges( p)≤2
(1) len( p)
δG (u, v) = min{len, d2 (u, v)} There are massive different candidate solutions to the graph transformation problem, each of which contains 2-HOP delegation outsourced graphs along with the corresponding link graph. Among them, we expect to produce one solution with a smaller-size link graph, which indicates
123
550
the burden on the client side in the running time can be lowered. By considering all factors, the graph transformation problem can be formulated as follows: given a graph G = (V, E), the graph transformation produces outsourced |G | graphs Go = {G 1o , . . . , G o o } and a local link graph G l , which achieves the following objectives: 1. Each outsourced graph G o ∈ Go is a 2-HOP delegation graph; 2. The union of Go and G l is a shortest distance equivalent graph to G; 3. The space cost of G l and the time cost of the shortest distance computation on the client side are minimized.
3 Graph transformation with exact distance answering In this section, we begin with a naive approach to show the massive search space in solving problem, then we give a greedy algorithm to transform a graph for exact shortest distance answering. Finally, we analyze our method. 3.1 A naive approach We observe that the two optimization targets (in Objective 3) are in the same line with each other. For instance, if we minimize the space cost of G l , the computational cost over G l tends to be minimized. In the following, we will focus on minimizing the space cost of G l . Formally, our graph transformation can be converted into a problem on minimizing G l as follows: Definition 6 (Minimizing G l ) Given a graph G = (V, E), we seek a set of 2-HOP delegation graphs Go and a link graph G l = (Vl , El ). For each pair of nodes (u, v) in graph G, 1. there exists an outsourced graph G o ∈ Go , where (u, G o .ex ) ∈ El , (G o .ey , v) ∈ El , and δG (u, v) = δG (u, ex ) + δG o (ex , ey ) + δG (ey , v); 2. or there exists an edge (u, v) ∈ El in G l with w(u, v) = δG (u, v). Our objective is to minimize G l , or the number of edges in G l . In order to solve this problem (minimizing G l ), we consider a straightforward brute-force approach. We first add one delegation node on each edge, and then try to enumerate all candidate solutions. Each candidate solution consists of a set of outsourced graphs and a link graph G l , which can answer all shortest distance queries in the original graph. Next, we compute the space cost of G l in each candidate solution, and select the solution with the minimal space of G l .
123
J. Gao et al.
Now, let us look at the number of candidate solutions in the brute-force approach. We note that the nodes in an outsourced graph are actually the sub-set of the edges in the original graph G. Although not all sub-sets of delegation nodes can produce valid 2-HOP delegation graphs, the total number of different outsourced graphs (2-HOP delegation graphs) can still be O(2m ) in the worst case, where m is the number of edges in G. By further considering different forms of the link graph, we face massive search space, which makes the brute-force approach too expensive to be feasible. 3.2 Fast greedy method The naive solution above reveals the massive search space in achieving the optimal solution to our graph transformation problem. In this part, we design a fast greedy method to produce a reasonable graph transformation solution. We analyze relationships between the minimizing G l problem and the set cover problem [3]. The set cover problem is described as follows: given a ground set U , and a candidate family S consisting of sub-sets of U , the goal is to find the minimal number of candidate sub-sets, denoted by C ⊆ S, whose union is U . We can map an outsourced graph G o in the minimizing G l problem into a sub-set S in the set cover problem, and map a node pair p in the minimizing G l problem into an element e in the ground set U in the set cover problem, such that G o can be used to answer the distance for p when S contains the element e. Given this, we may be inclined to adopt a set cover approach [3] to tackle our problem. However, different from those in the set cover problem, the outsourced graphs are not given in advance and must satisfy the 2-HOP delegation constraint. In addition, we expect that the space cost of G l is minimized, while the set cover problem attempts to minimize the number of the sub-sets. Basic idea Since the enumeration of all possible outsourced graphs takes the exponential size, we wish to construct the needed ones on the flying. Recall that the greedy method in the set cover problem always selects the sub-set which contains the most of remaining elements. Here, we use the expressiveness of an outsourced graph as the measure in our greedy method. The expressiveness of an outsourced graph G o is the total number of shortest distances which can be answered by G o . We can see that although our minimizing G l problem has a slightly different optimization objective from the set cover problem, the expressiveness of outsourced graphs still works in our context. Intuitively, for an outsourced graph with more expressiveness, the edges in the link graph have higher chances to be reused in preserving more shortest distances, and thus, the space cost of the link graph can be lowered. The expressiveness of an outsourced graph should also consider the load distribution in the shortest distance
Outsourcing shortest distance computing with privacy protection
computation. As shown in Eq. (1), the shortest distance from u to v can be computed via a sub-path from ex to ey in an outsourced graph. Only when the distance between ex and ey is large, can the outsourced server reduce the shortest distance computational cost on the local side. Thus, we make a restriction on the candidate outsourced node pair for each shortest path.
551
Algorithm 1: Greedy Graph Transformation
1 2 3 4
Definition 7 (Candidate outsourced node pair) Given a pair of nodes (u, v) in a graph G = (V, E) with (u, v) ∈ / E, let p be the shortest path from u to v, ex and e y be two edges in p with their delegation node ex and ey . (ex , ey ) is a candidate outsourced node pair for pair (u, v) if the following requirements are satisfied: 1. u can reach ex via at most one node in V ; ey can reach v via at most one node in V ; 2. ex and e y are not connected directly in G. The first condition requires that ex (ey ) is closer to u(v). Thus, the distance between ex and ey is larger, which reflects restrictions discussed above. The second condition corresponds to the requirements of 2-HOP delegation model. The shortest distance from u to v can be answered via ex and ey , as δG (u, v) = δG (u, ex ) + δG (ex , ey ) + δG (ey , v). With candidate outsourced node pairs, the expressiveness of an outsourced graph is formally defined as follows: Definition 8 (Expressiveness of an outsourced graph) Given an outsourced graph G o , and a set of node pairs U , the expressiveness of G o can be defined by the size of {(ex , ey )|ex , ey are in G o , and (ex , ey ) is a candidate outsourced node pair for a node pair (u, v) ∈ U . }. With the above notations, the basic idea for minimizing G l problem can be described as follows: we enumerate all shortest paths. For each shortest path p between u and v, we locate its all candidate outsourced node pairs and compute the benefits of these outsourced node pairs by their frequencies. These benefits will guide us in constructing an outsourced graph with high expressiveness greedily. After that, we remove the shortest distances which have been answered and generate the next outsourced graph until all shortest distances can be preserved. Greedy graph transformation We present the basic idea above in Algorithm 1. We initialize the shortest path set, outsourced graphs, and the link graph in line 1. Then, we attempt to preserve the distances of these shortest paths with newly constructed outsourced graphs. We make two slight extensions to the basic idea in the greedy method due to the concerns of the intermediate space requirement as well as different objectives in minimizing G l and the set cover problem. The first extension is that we pose a
5 6 7 8 9 10 11
Input: graph G = (V, E), threshold MP on maximal paths in memory Output: outsourced graphs Go and link graph G l . Initialize an empty shortest path set P, Go , and G l ; while there remain shortest paths to be handled do Locate a remaining shortest path p from u to v containing at least three edges, add p into P; For each edge e in p, add its delegation node e if there is no delegation node on e; Enumerate all candidate outsourced node pairs for p; if |P| > MP then Build a delegation node sequence L with the pair-based benefit function; G o ← Out Graph(G, L); Go ← Go ∪ G o ; Build edges in G l from nodes in G to delegation nodes in Go; Remove each path from P, which can be exactly answered by G o union G l ;
repeat be f or eSi ze ← |G l | + |P|; Adjust P and G l by generating G o from line 7 to line 11; until be f or eSi ze ≤ |G l | + |P| For each p ∈ P, build an edge e between two ending nodes of p, and add e into G l ; 17 return Go and G l .
12 13 14 15 16
restriction on the size of the intermediate nodes in the greedy method. If we directly implement the greedy method, we have to enumerate O(m 2 ) candidate outsourced node pairs in the worst case, and compute their benefits, where m is the number of edges in the original graph. On a relatively large graph, the representation of these node pairs alone will easily exceed the memory limitation. Thus, we attempt to generate outsourced graphs and their link graph iteratively in a pipelined fashion. Specifically, we introduce a parameter, MP, to control the maximal number of shortest paths in the memory during the graph transformation. The iterative outsourced graph construction can be done from line 2 to line 11. We locate each shortest path p, add necessary delegation nodes on edges in p, and enumerate all candidate outsourced node pairs for p. When the number of shortest paths exceeds MP in line 6, we encode the benefits of delegation nodes into a node sequence L, and use L to guide the construction of an outsourced graph in Algorithm 2. Once the outsourced graph is constructed, these node pairs whose distances have been preserved will be removed from the memory in line 11. Then, we compute the remaining shortest paths for the next outsourced graph, until all distances between node pairs in G have been preserved. As for the benefit computation of delegation nodes in line 7, we have two benefit functions. The first function is based on the node frequency. That is, we simply count the occurrences of the delegation nodes ex and ey among all candidate
123
552
Algorithm 2: Out Graph(G, L)
1 2 3 4 5 6
Input: graph G = (V, E), a delegation node sequence L. Output: outsourced graph G o = (Vo , E o ). Initialize empty G o = (Vo , E o ); while there exists unmarked node in L do Pick the top unmarked node ex from L as a cluster center, and Vo ← Vo ∪ {ex }; Mark the nodes in L covered which can reach ex via at most one node in V ; For any two cluster centers ex and ey , build the necessary edge between them; Return G o .
outsourced node pairs {(ex , ey )} separately. The second one is based on the node pair frequency. We record the frequencies of all possible candidate outsourced node pairs and sort node pairs in terms of their frequencies. Since two delegation nodes ex and ey with high frequencies separately do not imply that (ex , ey ) can be used to answer more shortest distance queries, the node pair-based benefit function can produce outsourced graphs with higher expressiveness. The second extension lies in the termination condition of the greedy method. We can continue generating new outsourced graphs and pruning the path set P until P is empty. The algorithm will stop since each outsourced graph can be used to preserve at least one remaining distance. However, when the iteration from line 2 to line 11 stops, those node pairs whose distances can be “easily” answered by outsourced graphs have been removed. We may produce a large number of outsourced graphs, which in turn increase the space cost of the link graph in preserving relationships between the original graph and outsourced graphs. Thus, we can stop generating new outsourced graphs from line 13 to line 14 when it is better to keep the shortest paths into the link graph G l directly. Since the remaining shortest distances are stored into G l , the union of G and G l is still a shortest distance equivalent graph of the original graph. Constructing Single Outsourced Graph The next key problem is how to use the benefits of delegation nodes to construct an expressive outsourced graph (line 8 in Algorithm 1). We now show that a 2-HOP delegation graph is corresponding to a hop aware cluster cover. Definition 9 (Hop aware cluster cover) Let G = (V, E) be an original graph, V be the set containing delegation nodes on each edge in E. A hop aware cluster cover C is a set of clusters, with each cluster C(ex ) ∈ C centered at ex ∈ V . It also meets the following requirements: 1. For any node u ∈ V , u belongs to at least one cluster C(ex ) where u can reach ex via at most one another node in V ;
123
J. Gao et al.
2. For any two clusters C(ex ) and C(ey ), the edge ex is not connected with the edge e y in G. Let e = (u, v) be an edge in the original graph, e be delegation node. The cluster centered at e includes the ending nodes of e, and the ending nodes of all edges, each of which is connected with e. Given a graph G and a node sequence L encoding the benefit values of nodes, an outsourced graph can be built by Algorithm 2. We iteratively select the top unmarked node from L as the cluster center and build its cluster, since L is sorted in terms of benefits, and the nodes closer to the head of L have higher benefits. The cluster centers can be used as the outsourced nodes in G o = (Vo , E o ). The other nodes in a cluster can be discovered and marked by a local breadth-first-search. We can ensure that two cluster centers meet the requirement of 2-HOP delegation model, since only the unmarked nodes can be selected as the next cluster center. In line 5, we build edges between any two cluster centers. The weight of an edge (ex , ey ) equals the shortest distance between ex and ey discovered by Dijkstra’s algorithm. Given three outsourced nodes ex , ey , and ez , if δG (ex , ey )+δG (ey , ez ) = δG (ex , ez ), and the edges (ex , ey ) and (ey , ez ) have been in E o , the edge between ex and ez need not be constructed. e s
3.3 Analysis of graph transformation Now, we analyze the time cost of the graph transformation in Algorithm 1, and the overhead distribution between the cloud server and client side. We use the following symbols. The meanings of n o , nl , n, m have been given in Table 1. x is the total number of outsourced graphs, which is related to graph features. Let u.Edges be the edges from a node u in the original graph to an outsourced graph, b is the maximal |u.Edges| for any node u. The time cost of the graph transformation in Algorithm 1 includes enumerating all shortest paths, sorting candidate outsourced node pairs in terms of their benefits, and generating all outsourced graphs. The enumeration of all shortest paths requires O(n(m + n log n)). The number of the delegation nodes is at most m, and then, the number of candidate outsourced pairs is O(m 2 ) in the worst case. In the greedy method, it takes O(m 2 log m) to sort candidate outsourced node pairs with their benefits before constructing each of x outsourced graphs. In the generation of one outsourced graph in Algorithm 2, it takes O(n o nl2 ) to construct a cluster cover, since there are O(n o ) clusters and each requires O(nl2 ) for the local breadth-first-search. In addition, we discover shortest distances O(n o ) times in the original graph to build edges between cluster centers in the worst case. Thus, a single outsourced graph construction takes O(n o (m + n log n) + n o nl2 ) = O(n o (m + n log n)). With all factors considered,
Outsourcing shortest distance computing with privacy protection
553
Table 2 Overhead distribution with graph outsourcing a
Client side
Cloud server
Space
O(m + xn o nl )
O(xn 2o )
Query time
O(nl2 + xb2 )
O(xb2 n 2o )
1 2
d
1
3
h 5 w*
q* 3
u
b*
1
3
5
k
j 8 2
i
b*
c
4
o e 7 x*
3
2 f*
p 3
l
f* 10
3
8
g 7
v
4 12
3 z*
w*
q*
x*
2 y*
the total time cost of graph transformation in Algorithm 1 is O(n(m + n log n) + xm 2 log m + xn o (m + n log n)). We list the overhead distribution after graphs are outsourced in Table 2. As for the space cost, the client side needs O(m + xn o nl ) to store the original graph and the link graph, while the cloud server requires O(xn 2o ) space to store all outsourced graphs. As for the time cost of shortest distance query answering, the client side needs O(nl2 ) time in the local breadth-first-search and O(xb2 ) time in result merging with Eq. (1). The cloud server takes O(xb2 n 2o ) time to evaluate rewritten queries. Note that the cloud server can build indices [4,12,21] and perform parallel processing over x graphs to lower the query evaluation cost significantly. Compared with O(m + n log n) time cost used in shortest distance discovery without graph outsourcing, the client side saves much time cost with the aid of the cloud server. The final experimental results also show the effectiveness of graph outsourcing.
4 Graph transformation with approximate distance answering In this section, we first propose a method to transform a graph with approximate distance answering, and then discuss heuristic methods in the outsourced graph construction. 4.1 Random graph transformation Graph transformation with approximate distance answering on large graphs becomes an important research problem, since approximate shortest distances are often good enough in many applications [17,21,22] and the graph transformation to support the exact shortest distance computation is costly. As shown in Algorithm 1, the transformation requires enumerating all shortest paths and computing the edge weights between delegation nodes in outsourced graphs, making the method unsuitable for large graphs. In this part, we wish to transform the graph efficiently while at the same time to obtain results with a reasonable quality. We use the average additive error to measure the quality of approximate shortest distances. Generally, the quality of the approximate distance from u to v can be measured by αδG (u, v) + β [22], where α is the multiplicative error, and β is the additive error. Here, we attempt to achieve α = 1 and a given average additive error β for all shortest distance
6 12 z* y*
Fig. 7 Approximate edge weight computation in outsourced graph
queries. For any distance query q = (u, v) ∈ Q from u to Go and G l . Average v, a path pq is discovered for q using βq
additive error β can be defined by q=(u,v)∈Q , where βq = |Q| len( pq )−δG (u, v). The rationale of average addition error is to get acceptable results with a limited number of outsourced graphs. The average additive error can be useful when a large number of shortest distance queries are evaluated in graph analysis. We next improve the graph transformation performance by relaxing the exact edge weight computation, which needs Dijkstra’s search O(n o ) times in Algorithm 2. Here, n o is the number of nodes in an outsourced graph. In order to lower the cost, we compute approximate edge weights instead. Specifically, we select l nodes from n o outsourced nodes and build the full shortest path trees for these l nodes in the original graph with Dijkstra’s search. The path in the tree is also the shortest path in the original graph. Then, we build edges for any two outsourced delegation nodes ex and ey when ex is the lowest ancestor of ey in the shortest path tree, as illustrated in Fig. 7. The overall edge building can be implemented by Dijkstra’s search O(l) times. Note that such a relaxed edge building method is similar to that of the landmark index [21]. However, the landmark index only records relationships from nodes to the root of the shortest path tree, while our method records the relationship from a node to its lowest ancestor. Hence, the outsourced graph with such a relaxed edge building strategy can yield more precise results than the landmark index using the same number of shortest path trees. We present our graph transformation method in Algorithm 3 to achieve the given average additive error β. The estimated additive error avg is initialized in line 3. We iteratively construct outsourced graphs in a random way until the estimated additive error avg is less than β. In each iteration, we produce one outsourced graph G o with Algorithm 2 using a random node sequence L, put G o into the cloud and obtain the average additive error avg with the approximate shortest distance computed with Eq. (1) and the exact shortest distance discovered by Dijkstra’s algorithm on the original graph. Specifically, let u and v be two nodes in the original graph, u and v be their cluster centers, respectively. The approximate shortest distance len between u and v can be computed by w(u, u )+δG o (u , v )+w(v , v) in Eq. (1). len
123
554
Algorithm 3: Average Additive Error Guided Outsourced Graph Construction
1 2 3 4 5
6 7 8
Input: graph G, additive error threshold β, s for the number of sampling queries, l for the number of full shortest path trees. Output: outsourced graphs Go . Initialize a s-sized query list Q with randomly generated queries; Evaluate queries in Q over G by Dijkstra’s search; avg ← ∞; while avg > β do G o ← Out Graph(G, l, L), where edge building in G o is based on l shortest path trees, and L is a random node sequence; Go ← Go ∪ {G o }; Compute avg with exact answers by Dijkstra’s search and approximate answers using Go ; Return Go .
may be not shortest since edge weights in outsourced graphs are approximately computed, and u (v ) may be deviated from the shortest path between u and v. The given average additive error can be achieved finally by sufficient outsourced graphs, as outsourced graphs are constructed randomly and each one can preserve different shortest distances. The time cost in the approximate outsourced graph construction mainly comes from the evaluation of s sampling shortest distance queries in the original graph, the construction of l shortest path trees in each outsourced graph, and the computation of relaxed edge weights in outsourced graphs. Let x be the number of the outsourced graphs, n and m be the number of nodes and edges in the original graph, respectively. It takes O(m + n log n) time to build a shortest path tree originating from a delegation node. As we need to compute approximate edge weights between delegation nodes in x delegation graphs and to find exact shortest distances for s sampling queries in the original graph, the approximate transformation takes O((xl + s)(m + n log n)) time, where x is related to graph features and the additive error assigned. 4.2 Heuristic construction rules Algorithm 3 achieves the desired additive error with outsourced graphs constructed in a random way. Another improvement is to introduce heuristic rules in the outsourced graph construction. From Eq. (1), it is feasible for outsourced graphs to produce more precise results when they contain more shared shortest sub-paths. Existing studies on landmark indices also show that the heuristic rules work better than the random ones [21]. In our paper, we design two heuristic construction rules. The weight-based construction method tends to select edges with lower weights as cluster centers. And the cluster-based method favorably selects the delegation node ex with the larger number of nodes in the cluster C(ex ). In order to make
123
J. Gao et al.
the outsourced graph construction be aware of heuristic values, we can sort the nodes with these values in the sequence L before invoking Algorithm 2. Thus, the nodes with higher heuristic values have more chances to be outsourced according to the rules in Algorithm 2. The duplication of nodes in different outsourced graphs is another concern in heuristic construction methods. In Algorithm 2, the same node sequence L will produce the same outsourced graph, which cannot make new contributions in answering queries. In order to address the above issue, we introduce a parameter k(0 < k < 1) and a function f (x) into Algorithm 2, where x is the number of outsourced graphs, f (1) = k and f (i) < f (i + 1). In the first outsourced graph construction, rather than always choosing the node with the maximal benefit value among all unmarked nodes in the sequence L, we select a node randomly from unmarked nodes with top-k percentage benefit values. In the following xth outsourced graph generation, k is enlarged by f (x) such as f (x + 1) = 2 f (x) until k ≥ 1. After multiple rounds of outsourced graph construction, the heuristic values have been extensively exploited, and the outsourced graph construction retreats to the random method, which focuses on the distribution of the outsourced nodes in outsourced graphs. 5 Experimental results In this section, we implement the graph transformation with exact and approximate distance answering, and conduct extensive experiments on both real and synthetic datasets. 5.1 Experimental setup Measures We focus on the following measures related to graph transformation: the transformation time cost, the space cost |G l | of the link graph (the number of edges in G l ), and the average additive error achieved by outsourced graphs. In addition, in order to show the effectiveness of graph outsourcing in shortest distance computing, we define a local overhead ratio rl = tl /t f , where tl is the time cost to discover the shortest distance with Eq. (1) on the client side, and t f is the time cost used by Dijkstra’s algorithm on the client side. Implementation details and competitors We implement 4 approaches related to the paper. 2-HOP is for the graph transformation method based on 2-HOP delegation model in this paper. d-radius is for the method based on 1-neighborhoodd-radius model in the previous work [11]. LP is for the edge anonymization method with all-pair shortest path preserving in the technical report version of [7]. Note that LP only anonymizes edge weights, and then, the transformed graph cannot survive the structural pattern attack. Just as in [7], we
Outsourcing shortest distance computing with privacy protection
use LPsolver 5.51 to solve the rules generated. The fourth method, SE, is an approach to protect the data in the outsourced server with encryption. For each node pair (u, v), we store the form f (u, v) → DES (δG (u, v)) in the outsourced server, where f (u, v) is an identifier uniquely determined by the identifiers of u and v, DES(δG (u, v)) is the encrypted shortest distance between u and v by DES encryption schema. We implement all above methods in Java with JDK 1.6. The maximal runtime memory of JVM is set to 15 GB. All experiments are carried out on a server with 2× Intel Xeon E5620 64bit 2.4 GHz processors, 24 G of RAM, running Windows server 2008 R2 Standard. The shortest distance computation in an outsourced server is simulated on the same machine. When one outsourced graph is constructed, we store it into a relational database so that we can support multiple outsourced graphs. The time cost in accessing the relational database is not included in the results reported. Datasets We use six graph datasets to test our methods, including four real graphs named DBLP, Gnut08, NotreDame, and FLA, and two synthetic graphs named Random and Power. DBLP is extracted from a recent snapshot of DBLP dataset.2 We select a subset of records after 2004. Gnut08 and NotreDame are downloaded from Standford’s data collection.3 The former is a directed Gnutella P2P network, and the latter represents the web graph for the University of Notre Dame from 1999. FLA dataset4 describes the road network in Florida. A Random graph is generated by building m edges among n nodes randomly. The Power graph set is generated using Barabasi Graph Generator v1.4.5 It can create graphs, in which the distribution of degrees obeys a power law. The weights of edges in all graphs except FLA are assigned randomly in [1,100]. Each edge weight in FLA is divided by 100 so that the average edge weight 47 in FLA is similar to that of the other graphs. Some statistics about these graphs are summarized in Table 3. Synthetic graphs have the suffix x N yd, where x is the number of nodes and y is the average degree. For example, Random1mN3d represents a Random graph with 1 million nodes and an average degree of 3. 5.2 Graph transformation with exact answer Below, we study the impacts of different factors on the measures of graph transformation with exact shortest distance 1
http://lpsolve.sourceforge.net/5.5/.
2
http://dblp.uni-trier.de/xml/.
3
http://snap.stanford.edu/data/p2p-Gnutella08.html.
4
http://www.dis.uniroma1.it/~challenge9/data/.
5
http://www.cs.ucr.edu/~ddreier/barabasi.html.
555 Table 3 Statistics of graph datasets Dataset
# Nodes
# Edges
DBLP5k
5,000
20,663
Gnut08
6,301
20,777
NotreDame
325,729
1,497,134
FLA
1,070,376
2,712,798
Randomx k(m)N y d
1 k–1 m
y k–y m
Powerx kN y d
50 k–200 k
50yk–200 yk
answering. Specifically, we are interested in as follows: (i) What is the impact of the threshold MP in Algorithm 1 on the transformation overhead, the size of the link graph and local overhead ratio for 2-HOP method? (ii) We have discussed two benefit functions. Can the node pair-based function ByPair reduce the overhead compared with the node-based function ByNode? Transformation time cost Figure 8a, b show the transformation time cost in terms of graph sizes and MP. Obviously, a larger graph leads to a higher graph transformation cost. In addition, the setting of MP has twofold impacts on the transformation performance. On the one hand, a larger MP can result in more precise benefit values, which can produce fewer outsourced graphs to preserve shortest distances, and then lower the graph transformation cost. On the other hand, we need more time in generating a single outsourced graph with more shortest paths in memory. From the experimental results, we see that the latter factor overwhelms the former one. In addition, we observe that d-radius method with a smaller d consumes less time than 2-HOP method. However, with the increase of d, d-radius method takes more time cost than 2-HOP method. For example, the time cost in 2-HOP method is about 1/2 or 2/3 of that in d-radius method with d = 60 on Random graphs. We also find LP method does not scale well. LP method produces O(r n 2 ) rules to preserve all-pairs shortest paths, where r is the average degree, and n is the number of nodes. For example, a Random graph with only 600 nodes and average degree 3 needs 1,071,608 rules. LPSolver cannot handle these rules and terminates. SE method can achieve similar or even better transformation performance than our 2-HOP method. Note that graphs used in the exact graph transformation are relatively small. As for the space requirement on the outsourced server, the total space cost (the number of edges in all outsourced graphs) in 2-HOP method is nearly 1/3 to 1/2 of that in SE method. For example, on a Random graph with 10k nodes and average degree 3, SE method will produce 50M edges while our 2-HOP method produces 22M edges with 69 outsourced graphs, each of which contains averaged 4,250 nodes. Figure 8c compares the graph transformation time cost on Random graphs with the node-based benefit function
123
J. Gao et al.
0
2
4
6
8
2
2.5
Benefit=ByPair
1 0
10
Dblp5k
# nodes(k)
MP=1M
800 400 0 Dblp5k
Gnut08
700 600 500 400 300 200 100
0.0
Dataset=RandomxkN3d Benefit=ByPair
1
2
4
6
8
2
3
4
ByPair 2-HOP ByNode 2-HOP LP SE ByPair 40-radius ByPair 60-radius Dataset=RandomxkN3d MP=1M
10
0
2
3.0 2.5 2.0 1.5 1.0 0.5 0.0
(f)
MP=1.0M 2-HOP MP=1.5M 2-HOP MP=2.0M 2-HOP MP=1M 20-radius MP=1M 40-radius Benefit=ByPair
6 5 4 3 2 1 0
Dataset=RandomxkN3d Benefit=ByPair
0
Gnut08
8
10
2
4
MP=1M 2-HOP MP=2M 2-HOP MP=3M 2-HOP LP MP=1M 40-radius MP=1M 60-radius
6
8
10
# nodes(k)
Data Set
(g)
6
(d)
Dblp5k
5
# nodes(k)
4
# nodes(k)
(c)
MP=1M 2-HOP MP=2M 2-HOP MP=3M 2-HOP MP=4M 2-HOP
Data Set
(e)
0.5
7 6 5 4 3 2 1 0
# nodes(k)
Local Overhead Ratio(%)
1200
ByPair 2-HOP ByNode 2-HOP ByPair 20-radius ByPair 40-radius
1.0
Data Set
Space Cost(k)
Space Cost(k)
1600
ByPair 2-HOP ByNode 2-HOP
1.5
Gnut08
(b) 2000
Dataset=RandomxkN3d MP=1M
2.0
Space Cost(m)
Dataset=RandomxkN3d Benefit=ByPair
MP=1.0M 2-HOP MP=1.5M 2-HOP MP=2.0M 2-HOP SE MP=1M 20-radius MP=1M 40-radius
3
Local Overhead Ratio(%)
4
MP=1M 2-HOP MP=2M 2-HOP MP=3M 2-HOP LP SE MP=1M 40-radius MP=1M 60-radius
Time Cost(s)
6 5 4 3 2 1 0
Time Cost(ks)
Time Cost(ks)
556
(h)
Fig. 8 Experimental results on graph transformation with exact answers
ByNode and the node pair-based benefit function ByPair in Algorithm 1. As discussed before, the case that two nodes x and y have higher frequency does not indicate that the node pair (x, y) can answer more shortest distance queries. Thus, ByPair is a more reasonable function which produces fewer outsourced graphs and then yields a better transformation performance. Size of link graph Figure 8d summarizes the space cost of link graphs on Random graphs, for various graph sizes. We can see the advantage of ByPair benefit function over ByNode one in the size of the link graph. The similar results can be observed in Fig. 8e. Thus, we prefer choosing ByPair function to minimize the space cost of link graphs. The results in Fig. 8d, e also clearly show that 2-HOP method outperforms d-radius method in terms of local space cost when d is large. More to the point, it is very difficult to make a right setting on d for d-radius method, while 2-HOP delegation approach is parameter-free. Figure 8f shows the space cost of the link graph with 2-HOP on Random graphs varying MP from 1M to 4M. It also verifies our earlier claim. The increase of MP can produce more precise benefit values, which then results in fewer outsourced graphs and less space cost in the link graph generally. Local overhead ratio Figure 8g, h illustrate the local overhead ratio for various graph sizes and MP values. We randomly generate 100 shortest distance queries and compute the average local overhead ratio. As shown in both figures, the local overhead ratio is lower than 0.04 in all test cases for 2-HOP methods. In other words, the client side requires very low cost in finding shortest distances by Eq. (1). As for d-radius method, it results in a higher overhead ratio when d is larger. In such a case, more outsourced graphs have to
123
be constructed, and then, a higher cost in result merging in Eq. (1) is needed. 5.3 Graph transformation with approximate answer Now, we study the impacts of different factors on the measures of graph transformation with approximate shortest distance answering. Specifically, we are interested in as follows: (i) Can the given additive error bound be achieved with 2HOP method? (ii) What is the impact of different heuristic construction rules on the additive error? In all experiments, the number of full shortest path trees in each outsourced graph is set to 50 for Random and Power graphs and 100 for Bay, and we use 100 sampling queries. Transformation time cost Figure 9a, b present the impacts of different additive errors on the transformation time cost for 2-HOP method across Random graphs and real data, respectively. We observe that a larger additive error leads to less transformation time cost, since fewer outsourced graphs are needed. We also obtain the time cost consumed by d-radius method with additive error fixed to 28. Similar to the results in the transformation for exact answers, d-radius method with a larger d (e.g., d = 60) is slower than 2-HOP method in Random graphs. In addition, the main drawback of d-radius method lies in its weak security and the difficulty in selecting the optimal d. As for SE method, we run the test over 15 h and just get very few shortest paths and then we stop it. SE method cannot handle large graphs as it needs to pre-compute all O(n 2 ) shortest paths. Figure 9c compares the transformation time cost used by 2-HOP method with different heuristic rules over Power graphs. The average additive error is set to 26. We use
3 0 50
100
150
Rule=Random
200
NotreDame
Obtained Error
Space Cost(m)
50 40 30 Dataset=FLA Random 2-HOP ClusterBased 2-HOP WeightBased 2-HOP
12
14
16
3
28
Dataset=RandomxkN3d Rule=ClusterBased
24 20 16
18
50
100
150
Error=20 2-HOP Error=24 2-HOP Error=28 2-HOP Error=28 50-radius Error=28 60-radius
40 30 20
Dataset=RandomxkN3d Rule=Random
10
100
150
50
200
200
Random 2-HOP ClusterBased 2-HOP WeightBased 2-HOP
40 30
Dataset=FLA
20 10
# nodes(k)
200
1
2
3
4
5
6
7
8
0.05
(g)
Random 2-HOP ClusterBased 2-HOP ClusterBased 50-radius
0.04 0.03
Dataset=PowerxkN3d Error=26
0.02 0.01 0.00
9
50
# outsourced graphs
(f)
150
(d)
50
0
100
# nodes(k)
# nodes(k)
Error=20 2-HOP Error=24 2-HOP Error=28 2-HOP
32
50
0 50
(c)
Additive Error
(e)
6
Random1mN3d
(b)
10
Dataset=PowerxkN3d Error=26
9
Data Set
(a)
10
Random 2-HOP ClusterBased 2-HOP WeightBased 2-HOP
12
0
# nodes(k)
20
15
Space Cost(m)
Dataset=RandomxkN3d Rule=Random
6
Error=32 2-HOP Error=36 2-HOP Error=40 2-HOP Error=40 30-radius
Local Overhead Ratio(%)
9
70 60 50 40 30 20 10 0
Time Cost(ks)
Error=20 2-HOP Error=24 2-HOP Error=28 2-HOP Error=28 50-radius Error=28 60-radius
12
557
Average Additive Error
15
Time Cost(ks)
Time Cost(ks)
Outsourcing shortest distance computing with privacy protection
100
150
200
# nodes(k)
(h)
Fig. 9 Experimental results on graph transformation with approximate answers
random method (denoted by Random), weight-based method (denoted by Weightbased), and cluster-based method (denoted by ClusterBased) to construct outsourced graphs. We find the ClusterBased method works best among them, but its advantage is not significant. Size of link graph Figure 9d summaries the space cost of the link graph on Random graphs, for various graph sizes. Obviously, a larger original graph results in a larger link graph with a given average additive error. In addition, it is no surprise that d-radius method consumes more space cost than 2-HOP method when d is larger. Figure 9e shows the impact of heuristic construction rules on the space cost of link graph along with different additive errors for 2-HOP method. The results show that Random method achieves least local space cost. We believe that the distribution of outsourced nodes is the key factor in reducing the size of the link graph after multiple rounds of outsourced graph construction. As for the space requirement on the outsourced server, 2-HOP method produces 16M edges in all 11 outsource graphs to achieve average additive error 10 on FLA graph, which are nearly 1/35,000 of those in SE method calculated by O(n 2 ). Here, n is the number of nodes in the graph. Additive error Figure 9f studies whether we can achieve the additive error as expected with 2-HOP method. After we generate outsourced graphs with Algorithm 3, we evaluate 100 shortest distance queries with Eq. (1) and Dijkstra’s algorithm, and test whether their average additive error is the same as the specified one. We observe that these two values (in X and Y axis) are very close. In other words, our graph transformation method can achieve the specified additive error well.
Figure 9g compares average additive errors on the same number of outsourced graphs constructed by different heuristic rules over FLA graph. The results show that the clusterbased method can produce outsourced graphs with the lowest additive errors when the number of outsourced graphs is relatively small (e.g., less than 3). When more outsourced graphs are generated, three construction methods produce similar answers, since the distribution of outsourced nodes becomes the key factor in such a case. Local overhead ratio Figure 9h illustrates the results on local overhead ratio on Power graphs, which are similar to those in Fig. 8h. In all cases, the local time cost used in shortest distance answering is nearly zero. By combining the results in Fig. 8h, we can see that the local overhead ratio scales very well in terms of the graph size in 2-HOP method. As shown in Table 2, the client side requires O(nl2 + xb2 ) time cost for shortest distance computation, while nl , x, and b are nearly constant with the respect to the graph size. At the same time, the graph size has a significant impact on Dijkstra’s algorithm. Therefore, the local overhead ratio declines sharply with the increase in graph size. 5.4 Summary To sum up, from the experimental results, we can draw the following conclusions: (i) The graph transformation to fit 2-HOP delegation model with the exact distance answering scales better than the existing LP method significantly. It also outperforms the graph transformation method for d-radius model when d is relatively large. SE method is a simple but effective transformation method in the exact distance
123
558
answering. (ii) Our graph transformation with approximate distance answering achieves the given additive error well and can support larger graphs than the exact version. Neither of LP method nor SE method can handle large graphs. (iii) In all test cases, the local overhead ratio is very low and even goes down with the increase of graph size. Such results illustrate the effectiveness of graph outsourcing.
6 Related work In this part, we review the current state-of-the-art techniques related to our paper. Privacy-preserving publishing Privacy protection for graph publishing has been studied recently. Most of the existing works focus on certain structural anonymizations, such as 1-neighborhood [26], k-degree [19], k-automorphism [28], k-isomorphism [2], cluster-based node anonymity [5,6], as well as many others. These techniques typically attempt to use the least amount of modifications of the original graph to make it satisfy the target security requirement. Unfortunately, for any pair of nodes, there is no guarantee of the degree of similarity or preservation of shortest distances between the anonymized graph and the original graph. In addition, most of the exiting works deal with privacy on unweighted graphs and do not consider the impact of edge weights. A few recent works [7,24] notice the importance of preserving graph theoretical characteristics during graph publishing. Ying and Wu [24] propose a method to preserve the eigenvalue of a graph during graph transformation. Das et al. [7] propose a linear programing (LP) method to change edge weights while preserving shortest paths. However, the eigenvalue of a graph is only related to the average shortest distance [24]. LP approach does not adjust the structure of graphs and requires too many inequality rules to preserve shortest paths. The minimality attack [23] is also studied besides the structural pattern attack on the published data. The minimality attack attempts to infer the original data via the clue that publishing techniques always minimize the information loss in the transformation. Our 2-HOP delegation model can counter the minimality attack, since outsourced delegation nodes are annotated randomly. Recently, differential privacy [9] has emerged as a powerful model to protect against unknown adversaries with guaranteed probabilistic accuracy. Hay [14], Li et al. [18] perform some of the first studies to support differential privacy in analyzing networks. Specifically, they design an efficient method for releasing a provably private estimate of the degree distribution of a network. However, it is still an open problem on how to publish a graph with respect to the differential privacy [14], and thus, it is not clear whether the techniques devel-
123
J. Gao et al.
oped in [14] can be applied to more complicated queries, such as the shortest distance query. Security issues in outsourced server Sensitive data protection and query result verification in an outsourced server have attracted much attention recently [13,20]. A work closely related to this paper studies the verification issue in outsourcing graphs for the shortest path discovery [25]. In their solution, the original graph data are outsourced along with verification objects, from which the client side can validate the correctness of results. However, the protection of sensitive information in the original graph is not considered in graph outsourcing in their method. Shortest path discovery Shortest path discovery is one fundamental problem in graph theory. Dijkstra’s algorithm [8] is a well-known approach to find single-source shortest paths. HiTi [16] can be viewed as a multiple-level index for the shortest path discovery. The combination of A* algorithm with the landmark index is studied in [12]. 2-HOP index [4] assigns each node with two label sets and supports the shortest distance discovery between two nodes with an intersection of their label sets. In addition, the indices with different error bounds and various construction methods have also been studied in [17,21,22] for approximate distance answering. However, we cannot simply outsource these indices as they will also disclose sensitive information of the graph to outsourced servers. For instance, in the 2-HOP index, each node is very likely to record the distances to its immediate neighbors. Such information often needs to be protected [26]. 7 Conclusion In this paper, we study how to utilize cloud computing to efficiently compute shortest distances in graphs without compromising their sensitive information. We define a new parameter-free security model called 2-HOP delegation model to lower the chances of structural pattern attack and reconstruction attack on outsourced graphs significantly. We then devise methods to transform an original graph before outsourcing, which can reduce the space cost and shortest distance computation cost on the client side while satisfy both security and utility requirements. Acknowledgments NSFC supported Gao via 61073018 and 61272156. The research grants Council of the Hong Kong SAR supported Yu via 418512 and 419109. National High Technology Research and Development Program of China supported Wang via 2012AA011002.
References 1. Backstrom, L., Dwork, C., Kleinberg, J.M.: Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography. In: WWW, pp. 181–190 (2007)
Outsourcing shortest distance computing with privacy protection 2. Cheng, J., Fu, A.W., Liu, J.: K-isomorphism: privacy preserving network publication against structural attacks. In: SIGMOD, pp. 459–470 (2010) 3. Chvatal, V.: A greedy heuristic for the set-covering problem. Math. Oper. Res. 4(3), 233–235 (1979) 4. Cohen, E., Halperin, E., Kaplan, H., Zwick, U.: Reachability and distance queries via 2-hop labels. In: SODA, pp. 937–946 (2002) 5. Cormode, G., Srivastava, D., Yu, T., Zhang, Q.: Anonymizing bipartite graph data using safe groupings. PVLDB 1(1), 833–844 (2008) 6. Cormode, G., Srivastava, D., Bhagat, S., Krishnamurthy, B.: Classbased graph anonymization for social network data. PVLDB 2(1), 766–777 (2009) 7. Das, S., Egecioglu, M., Abbadi, A.E.: Anonymizing weighted social network graphs. In: ICDE, pp. 904–907 (2010) 8. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische Mathematik 1(1), 269–271 (1959) 9. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC (2006) 10. Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: from intractable to polynomial time. PVLDB 3(1), 264– 275 (2010) 11. Gao, J., Yu, J.X., Jin, R., Zhou, J., Wang, T., Yang, D.: Neighborhood-privacy protected shortest distance computing in cloud. In: SIGMOD, pp. 409–420 (2011) 12. Goldberg, A.V., Harrelson, C.: Computing the shortest path: search meets graph theory. In: SODA, pp. 156–165 (2005) 13. Hacigümüs, H., Iyer, B.R., Mehrotra, S.: Providing database as a service. In: ICDE, pp. 29–40 (2002) 14. Hay, M., Li, C., Miklau, G., Jensen, D.: Accurate estimation of the degree distribution of private networks. In: ICDM, pp. 169–178 (2009) 15. Hay, M., Miklau, G., Jensen, D., Towsley, D.F., Weis, P.: Resisting structural re-identification in anonymized social networks. PVLDB 1(1), 102–114 (2008)
559 16. Jung, S., Pramanik, S.: An efficient path computation model for hierarchically structured topographical road maps. TKDE 14(5), 1029–1046 (2002) 17. Kleinberg, J.M., Slivkins, A., Wexler, T.: Triangulation and embedding using small sets of beacons. J. ACM (JACM) 56(6), 1–37 (2009) 18. Li, C., Hay, M., Rastogi, V., Miklau, G., McGrego, A.: Optimizing linear counting queries under differential privacy. In: PODS, pp. 123–134 (2010) 19. Liu, K., Terzi, E.: Towards identity anonymization on graphs. In: SIGMOD, pp. 93–106 (2008) 20. Nath, S., Yu, H., Chan, H.: Secure outsourced aggregation via oneway chain. In: SIGMOD, pp. 31–44 (2009) 21. Potamias, M., Bonchi, F., Castillo, C.: Fast shortest path distance estimation in large networks. In: CIKM, pp. 867–876 (2009) 22. Thorup, M., Zwick, U.: Approximate distance oracles. In: STOC, pp. 183–192 (2001) 23. Wong, R., Fu, A., Wang, K., Pei, J.: Minimality attack in privacy preserving data publishing. In: VLDB, pp. 543–554 (2007) 24. Ying, X., Wu, X.: Randomizing social networks: a spectrum preserving approach. In: SDM, pp. 739–750 (2008) 25. Yiu, M.L., Lin, Y., Mouratidis, K.: Efficient verification of shortest path search via authenticated hints. In: ICDE, pp. 237–248 (2010) 26. Zhou, B., Pei, J.: Preserving privacy in social networks against neighborhood attacks. In: ICDE, pp. 506–515 (2008) 27. Zou, L., Chen, L., TamerÖzsu, M.: Distancejoin: pattern match query in a large graph database. PVLDB 2(1), 886–897 (2009) 28. Zou, L., Chen, L., TamerÖzsu, M.: K-automorphism: a general framework for privacy preserving network publication. PVLDB 2(1), 946–957 (2009)
123