We call this implementation modified a(S) and our approach ad(S). ... Proceedings of the 2007 ACM SIGMOD International Conference on. Management of Data ...
Anonymizing Set-Valued Social Data
Shyue-Liang Wang1, Yu-Chuan Tsai2 1
3
Department of Information Management Department of Computer Science and Information Engineering National University of Kaohsiung Kaohsiung, Taiwan 81148 {slwang1, tphong3}@nuk.edu.tw
Abstract—The increasing popularity of social networks has generated tremendous amount of data to be exploited for commercial, research and many other valuable applications. However, the release of these data has raised an issue that personal privacy may be breached. Current practices of simply removing all identifiable personal information (such as names and social security numbers) before releasing the data is insufficient. More effective anonymization techniques are required. In this work, we propose a k-anonymization-based technique on set-valued network node data. The proposed algorithm is based on the principle of minimizing the number of addition and deletion operations to achieve k-anonymity. Numerical experiments on real dataset show that it requires less number of operations than current suppression-based approach. Keywords- k-anonymity, privacy preserving, set-valued data, suppressioning
I.
INTRODUCTION
Privacy preserving network publishing has attracted considerable attention in recent years because of the concern of breaching the privacy from the published data. Social network applications, such as MySpace and Facebook and other online communities, collaboration networks, telecommunication networks, have become very popular for sharing information. There are millions of registered users associated with others through friendships, hobbies, professional association, and so on. These user information and relationship can be modeled as vertices and edges in complex graphs and are of significant important in various application domains such as marketing, psychology, epidemiology and homeland security. As a result, companies and institutions hosting the data are interested and expect to be beneficial in releasing portions of the graphs so that research communities can analyze the data. However, these social network graphs may contain sensitive information. In order to protect the privacy of users against different types of attacks, graphs should be anonymized before they are published. Some current practices to protect user privacy from published data include removing all identifiable personal information such as names and social security numbers, limiting access, “fuzzing” the data, eliminating unnecessary groupings, augmenting with additional data, etc. However, it
Hung-Yu Kao2, Tzung-Pei Hong3 2
Department of Computer Science and Information Engineering National Cheng Kung University Tainan, Taiwan 70101 {p78941312, hykao2}@nuk.edu.tw
is still easy for an attacker to identify the target by performing different structural and non-structural queries. Let’s consider the following examples of re-identification attack on relational data, transaction data, and graph data. For published relational data, given a public voter registration data and a private microdata such as the deidentified (name and social security number removed) patient data of Massachusetts’s state employees, a simple “linking” attack by joining the two datasets can re-identify the identity and medical history of the state’s governor. According to one study, approximately 87% of the population of the United States can be uniquely identified on the basis of their 5-digit zip code, sex, and date of birth [10, 11]. For published transaction data, America Online (AOL) released a large portion of its search engine query logs for research purposes in August 2006. The dataset contained 20 million queries posed by 650,000 AOL users over a 3 month period. Before releasing the data, AOL replaces each user’s name by a random identifier. However, by examining unique query terms, the New York Times [2] demonstrated that the searcher No. 4417749 was traced back to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Georgia. Despite a query does not contain address or name, a searcher may still be re-identified from combination of query terms that are unique enough about the searcher. For published graph data, even when a network is published without any identity information, it is still possible to locate the target with high probability based on some structural information around the target [4, 13]. Similar to the quasi-identifiers in relational or set-valued data that can be used as background knowledge for re-identification, any topological structure of the network can be utilized to identify the target in a released network. There can be four types of structural attacks in this environment [4, 6, 12]: degree-attack, subgraph attack, 1-neighborhood attack, and hub-fingerprint-attack. For example, given a social network G in Figure 1a, and a corresponding naïve anonymized network G’ in Figure 1b, by removing all node identities, we can identify that David is vertex 4 in G’ if we know that David has four neighbors (degree attack). It is also possible that an attacker can also launch a query based on nonstructural information (such as vertex label) to identify the target.
5 Ed
1 Alice 3 Carl
4 David 6 Frank 7 Gabe
2 Bob
(a) Original graph G
1
5 3
4 6
2
7
(b) Naïve anonymized graph G’ Figure 1 Anonymized Networks
There are basically two types of sensitive information that one may want to keep private and may be under attack in a social network environment: node information and link information [3]. The node information is the information attached to a vertex. For example, the emails sent by an individual, the personal information such as age, sex, zip code, and transaction data such as purchased items [5]. The link information is about the relationships among the individuals which may be considered sensitive. To protect link information, there are some studies such as k-degree, kautomorphism, k-isomorphism privacy models addressing various types of structural attacks. To protect node information, many generalization and suppression-based kanonymity techniques for relational and set-valued data have been proposed. Since most privacy models against link information attacks transform the graph into various k“identical” subgraphs and node information in each subgraph or between subgraphs are not emphasized, this work concentrates on anonymizing node information that contains set-valued data. We propose a k-anonymization-based technique that minimizes the number of operations on setvalued node data to achieve k-anonymity. The rest of the paper is organized as follows. Section 2 gives the problem description. Section 3 describes the proposed algorithm. Section 4 reports the numerical experiments. Section 5 concludes the paper. II.
PROBLEM DESCRIPTION
Current studies in privacy preserving network publication have proposed many valuable privacy models and anonymization techniques. Similar to studies in privacy preserving data publishing, the anonymization depends on what external information or background knowledge may be acquired by an adversary. An adversary may use an arbitrary subgraph to locate the vertex in a graph that belongs to an
individual for re-identification attacks. Several methods [1, 3, 6, 13] proposed to achieve k-anonymity based only on structural adversary knowledge. In this work, we assume that the adversary may possess both structural and nonstructural background knowledge but will concentrate on anonymizing set-valued node information for k-anonymity. Let D = {T1, …, Tn} be a dataset containing n records that belong to n nodes in a network. Each record is a set of items. We define an anonymous dataset as follows [8]. Definition 1. (K-anonymity for set-valued data) We say that D is k-anonymous if every transaction Tj ∈ D has at least (k-1) other identical records in the dataset D. Given this definition, the k-anonymization problem is to determine the minimum number of transformation to be made to a dataset to obtain an anonymous dataset. Definition 2. (K-anonymization problem for set-valued data) Given a dataset D, find the minimum number of items that need to be added or deleted from the transactions T1, ..., Tn, to ensure that the resulting dataset D’ is k-anonymous. Figure 2 shows an example of 3-anonymization on setvalued data. The item e1 is added to records T2 and item e2 is deleted from record T1. The transformed dataset consists of two 3-anonymous groups: {T1, T4, T5} and {T2, T3, T6}. e1 e2 e3 e1 e2 e3 T1 1 1 0 T1 1 0 0 T2 0 0 1 T2 1 0 1 T3 1 0 1 T3 1 0 1 T4 1 0 0 T4 1 0 0 T5 1 0 0 T5 1 0 0 T6 1 0 1 T6 1 0 1 (a) Original dataset
(b) Anonymized dataset
Figure 2 3-anonymization
To achieve such an anonymization, Motwani [8] proposed a skillful 2-phase technique that is based on suppression algorithm on relational data proposed by Park etc [9] and flipping. The aim of the suppression algorithm is to efficiently obtain a partition of a dataset so that minimal number of suppression is required on each subset of the partition to achieve k-anonymity. For example, in the first phase, the suppression algorithm will partition the dataset into two subsets Π = {{S1, S4, S5}, {S2, S3, S6}}, where S1 = S4 = S5 = (1, *, 0), S2 = S3 = S6 = (*, 1, 0), * represents that the item value is suppressed, 1 represents that the corresponding item is in the record, and 0 represents the item is not in the record. In the second phase, for each subset of the partition, the suppressed items are flipped to 1 or 0 depending on the minimum number of flipping required. For example, for subset {S1, S4, S5}, there are two records S4, S5 that do not contain item e2 and one record S1 that contains item e2. Deleting item e2 from record S1 will make this subset of records identical and become 3-anonymity. To efficiently partition a dataset, Park etc [9] proposed using minimum length sum to estimate the number of
suppression required for a given partition, where the size of each subset is between k and 2k-1. The partition with minimum length sum will achieve 2(1 + ln 2k)approximation to the optimal suppression of the kanomymization problem. The minimum length, a(S), for a set of records is defined as follows. Definition 3. (Minimum length of suppression for relational data) Let a(S) be the number of attributes with multiple distinct values in a table S and defined as follows: a(S) := |{i : ∃u, v ∈ S, u[i] ≠ v[i]}| where u[i] and v[i] are the values of i-th attribute of the records u and v respectively. However, it is noticed that although minimum length may be a better estimate than minimum diameter [7] for suppressing relational data. For anonymizing set-valued data, direct estimation of addition and deletion of items may achieve better partition that requires fewer operations. In the next section, we will propose an effective algorithm to partition a dataset so that minimal number of items needs to be added or deleted and the resulting dataset is k-anonymous. III.
PROPOSED ALGORITHM
Given a relational dataset, k-minimum diameter sums [7] and k-minimum length sum [9] have been proposed to determine a partition with minimum number of suppression for achieving k-anonymity. For set-valued data, we propose the following measure, ad(S), to effectively estimate the number of addition and deletion operations required for a given set of records. For a given set of records S,
ad(s) =
∑ O (j)
1≤ j ≤ I
s
Where OS(j) = min { number of transactions in S containing item j, number of transactions in S not containing item j}, and |I| is the total number of items in the dataset. The number of operation required for a partition can be expressed as ad( ∏ ) =
∑ ad(S)
S∈∏
Where Π is a partition of the dataset. Figure 3 shows two partitions of eight records and their minimum length sums a(S) and minimum operation sums ad(S). In Figure 3a, the a(S) is 3 + 3 = 6 and the ad(S) is 4 + 4 = 8. In Figure 3b, the a(S) for this partition is 3 + 3 = 6, which is same as previous partition and cannot distinguish the difference between the two partitions. The ad(S) for this partition is 3 + 3 = 6, which is smaller than the previous partition and indicates this is a better partition. To obtain optimal partition of a given dataset, Myerson [7] proposed a set-cover type greedy approach. Let F be the collection of all subsets of D with cardinality in the range of
[k, 2k-1]. Let S ∈ F be a set in the collection. The greedy approach selects the S with minimum diameter and includes it to the cover. The process repeats itself until all records are included to the cover. The cover is then converted to a partition to ensure each subset is disjoint from others. a1 a2 a3 a1 a2 a3 T1 1 1 1 T1 1 1 1 T2 1 1 0 T2 1 1 0 T4 0 1 1 T3 1 0 1 T5 1 0 0 T4 0 1 1
T3 T6 T7 T8
a1 1 0 0 0
a2 0 1 0 0
a3 1 0 1 0
(a) Arbitrary partition
T5 T6 T7 T8
a1 1 0 0 0
a2 0 1 0 0
a3 0 0 1 0
(b) Optimal partition
Figure 3 Two partitions of eight records
For a set-valued dataset, we propose a similar set-covered greedy algorithm. Instead of checking all subsets of D in the range of [k, 2k-1], we propose using the sets of transactions that contain frequent itemsets with support count greater than or equal to k to be included in the collection F. The greedy approach then selects the S with minimum operation ad(S) and includes it to the cover. The proposed algorithm is given as follows. Algorithm (Set_Anonymize) Input: dataset D Output: anonymized database D’ that satisfies k-anonymity 1. Find all frequent itemsets vi from D, let FIL = { vi }; 2. For each vi, find all subsets of transactions S(vi) containing vi and let F = {S(vi)}; 3. Sort S(vi) in increasing order of ad(S(vi)); 4. While (D ≠ ∅ and F ≠ ∅) { 5. Remove S(vi) with the smallest ad(S(vi)) from D and F; 6. Anonymize S(vi) and add to D’; 7. Update vi and S(vi);} 8. Output D’; IV.
NUMERICAL EXPERIMENT
To evaluate the performance of the proposed algorithm, we run simulations on the BMS-WebView-1 dataset and compare with the suppression-based algorithm proposed by Park etc [9]. The BMS-WebView-1 dataset contains 59,602 transactions, 497 items with maximum transaction length 267 and average length 2.5. There are five implementations in [9]. The fastest implementation is OPT-LB, which is not algorithm for the k-anonymity problem, but can show the lower bound on the number of suppressed cells for the optimal solution for the k-anonymity problem. As a preliminary test, frequent itemsets are used instead of closed frequent itemsets. We call this implementation modified a(S) and our approach ad(S).
All experiments reported in this section were performed on a Pentium-4 2.3 Ghz machine with 2 GB main memory, running Microsoft Windows 2000 operating system. All the methods were implemented using Microsoft SQL Server 2000. Figure 4 shows the number of addition and deletion operations under different privacy threshold k. The tested dataset size is 10,000 records selected from the BMSWebView-1 dataset. The number of items tested is 10, which are the top-10 frequent items. It can be observed that the proposed algorithm required less number of addition and deletion operations than the previously proposed suppression-based approach. Figure 5 shows that both approaches require about the same amount of running time to achieve k-anonymity.
to minimize the number of operations to achieve kanonymity. Numerical comparison with previous modified suppression-based algorithm shows that our technique requires less number of operations. In the future, we will consider further improving the running time of the proposed approach, show the complexity bound and combined with structural anonymization. ACKNOWLEDGMENT This work was supported in part by the National Science Council, Taiwan, under grant NSC-99-2221-E-390-033. REFERENCES [1]
[2] [3]
[4]
[5] [6] [7] Figure 4 Varying k for number of operations
[8] [9]
[10]
[11]
[12] [13]
V.
CONCLUSION
In this work, we have studied the privacy preserving network publishing problem in general, and preserve privacy against linking attack on set-valued node data in particular. The problem of k-anonymizing relational and set-valued data with the minimum number of suppressed cell is known to be NP-hard. Approximation algorithms based on set-cover greedy approach have been developed for relational data. We extend the approach and propose an effective technique
L. Backstrom, D. P. Huttenlocher, J. M. Kleinberg, and X. Lan. Group formation in large social networks: membership, growth, and evolution. In KDD, pages 44–54, 2006. M. Barbaro and T. Z. Jr. A face is exposed for AOL searcher no. 4417749. New York Times, Aug 2006. J. Cheng, A. Fu, and J. Liu, K-isomorphism: privacy preserving network publication against structural attacks, In SIGMOD conference, 459-470, 2010. M. Hay, G. Miklau, D. Jensen, D. F. Towsley, and P. Weis. Resisting structural re-identification in anonymized social networks. PVLDB, 1(1):102–114, 2008. Y. He and J.F. Naughton, Anonymization of set-valued data via topdown, local generalization, in VLDB 2009. K. Liu and E. Terzi. Towards identity anonymization on graphs. In SIGMOD Conference, pages 93–106, 2008. A. Meyerson and R. Williams. On the complexity of optimal kanonymity. In Proc. of PODS, 2004. R. Motwani and S.U. Nabar, Anonymizing unstructured data, arXiv: 0810.5582v2, [cs.DB], 2008. H. Park and K. Shim. Approximate algorithms for k-anonymity. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pages 67–78, 2007. P. Samarati and L. Sweeny. Generalizing data to provide anonymity when disclosing information. In Proc. of ACM Symposium on Principles of Database Systems, page 188, 1998. L. Sweeny. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledgebased Systems, 10(5):557–570, 2002. B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks. In ICDE, pages 506–515, 2008. L. Zou, L. Chen, and M. T. Ozsu. K-automorphism: A general framework for privacy preserving network publication. In VLDB, 2009.