Privacy-preserving publishing microdata with full functional ...

3 downloads 9430 Views 1MB Size Report
Nov 10, 2010 - Y. We call X the determinant attributes and Y the dependent ..... Given the sensitive values {s1,s2,s3,s4,s5,s6,s7,s8,s9} of frequency (50, 45, 40, 30, 26, ...... adversarial knowledge, Proceedings of the International Conference.
Data & Knowledge Engineering 70 (2011) 249–268

Contents lists available at ScienceDirect

Data & Knowledge Engineering j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / d a t a k

Privacy-preserving publishing microdata with full functional dependencies Hui Wang ⁎, Ruilin Liu ⁎ Department of Computer Science, Stevens Institute of Technology, Hoboken, NJ, USA, 07030

a r t i c l e

i n f o

Article history: Received 23 January 2010 Received in revised form 30 October 2010 Accepted 2 November 2010 Available online 10 November 2010 Keywords: Privacy-preserving data publishing Functional dependency Utility

a b s t r a c t Data publishing has generated much concern on individual privacy. Recent work has shown that different background knowledge can bring various threats to the privacy of published data. In this paper, we study the privacy threat from the full functional dependency (FFD) that is used as part of adversary knowledge. We show that the cross-attribute correlations by FFDs (e.g., Phone → Zipcode) can bring potential vulnerability. Unfortunately, none of the existing anonymization principles (e.g., k-anonymity, ℓ-diversity, etc.) can effectively prevent against an FFD-based privacy attack. We formalize the FFD-based privacy attack and define the privacy model, ðd; ℓÞ-inference, to combat the FD-based attack. We distinguish the safe FFDs that will not jeopardize privacy from the unsafe ones. We design robust algorithms that can efficiently anonymize the microdata with low information loss when the unsafe FFDs are present. The efficiency and effectiveness of our approach are demonstrated by the empirical study. Published by Elsevier B.V.

1. Introduction Recent years have witnessed vast volumes of data to be collected on a large scale. Driven by mutual benefits, or by regulations that require certain data to be published, there is a demand for publishing collected data to the public. Although it offers significant advantages for ad-hoc analysis in a variety of domains such as public health and population studies, it may immediately violate individual privacy, especially when the data is of microdata format, i.e., it contains detailed person-specific data in its original form. How to publish microdata with preserved privacy has received considerable attention in recent years. It has been shown that simply removing explicit identifiers (IDs), e.g., name and SSN, from the released data is insufficient to protect privacy [31]. The existence of a set of non-ID attributes, called quasi-identifiers, that can uniquely identify individuals (e.g., the combination of zipcode, gender and date of birth) can be joined with information obtained from diverse external sources (e.g., public voting registration data) to re-identify the individuals in the released data. This is called the record linkage attack. Various privacy principles have been proposed recently to defend against the record linkage attack. The earliest principle, kanonymity, requires that each record is indistinguishable from at least k − 1 other records with respect to their quasi-identifiers [29,31]. An improved principle, called ℓ-diversity, further requires that every group of indistinguishable tuples must contain at least l distinct sensitive values [20]. Generalization [28,31] is a popular methodology to achieve both k-anonymity and ℓ-diversity. In particular, the microdata is partitioned into anonymization groups. For the tuples in the same group, their quasi-identifier values are generalized to be identical so that they are indistinguishable to each other with regard to their quasi-identifier values. For instance, consider the microdata in Table 1 in which the quasi-identifier attributes include age, sex and zipcode. Table 2 (a) shows an example of a 3-diversity version of microdata in Table 1. In this 3-diversity table, each tuple is included in a group that contains at least three distinct sensitive values. Therefore the attacker cannot explicitly re-identify any individual and his/her sensitive values. ⁎ Corresponding authors. Tel.: + 1 201 216 8736; fax: + 1 201 216 8249. E-mail addresses: [email protected] (H. Wang), [email protected] (R. Liu). URL: http://www.cs.stevens.edu/~hwang4/ (H. Wang). 0169-023X/$ – see front matter. Published by Elsevier B.V. doi:10.1016/j.datak.2010.11.002

250

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

Table 1 Microdata. ID

Quasi-identifiers

Sensitive

Name

Age

Sex

Zip

Phone

Disease

Alice Bob Calvin Doris Eve Flora

20 30 30 35 40 50

F M M F F F

07921 07920 07902 07921 07902 07903

1111111 2222222 3333333 1111111 3333333 2000001

Ovarian cancer Bronchitis Diabetes Ovarian cancer Bronchitis Pneumonia

Besides k-anonymity and ℓ-diversity, recent years have witnessed the proposal of a few privacy models, for example, tcloseness [18], (α, k)-anonymity [36], and (c, k)-safety [23]. All these privacy models address diverse adversary knowledge. However, none of them consider the full functional dependencies (FFDs) of the original microdata as part of the adversary knowledge. Indeed, if the attacker has obtained the knowledge of the FFDs, and applies that knowledge on the released data that has met some aforementioned privacy principles (e.g., k-anonymity, ℓ-diversity), the attacker may be able to breach privacy. The privacy attack is illustrated by Example 1.1 below. Example 1.1. Assume the microdata in Table 1 contains the functional dependency F : Phone → Zip, which states the fact that any two same phone numbers must correspond to the same zipcode. Assume that the attacker possesses the knowledge of F. Then when applying F on the 3-diversity table in Table 2 (a), since the second group contains the phone numbers 1111111 and 3333333 with zipcode “0790*”, the attacker can modify the zipcode value of the first and the third tuples from “079**” to “0790*”. The anonymized table after the FD-based inference is shown in Table 2 (b). By the record linkage attack, the privacy guarantee on the second tuple (in bold font) is only 1-diversity. Thus by the record linkage attack, the attacker can explicitly decide that Bob has Bronchitis. □ Example 1.1. shows that using FFDs as adversary knowledge may lead to privacy breach. Given the fact that it is not difficult for the attacker to obtain these functional dependencies from either common sense or other sources, it is necessary to develop robust privacy criterion and anonymization algorithm for the microdata that contains full functional dependencies. There is a small body of work that considers the adversary knowledge including data correlations [15,23,27]. In particular, Martin et al. [23] and Rastogi et al. [27] are the first to consider correlations between tuples as adversary knowledge. They show that if tuple correlations are available, then there exist privacy leakage on the published dataset that is of “meaningful” utility. Kifer [15] shows that the attacker may induce correlations from the sanitized dataset. The inferred correlations can enable potential vulnerabilities on the sanitized dataset. Although these studies identify the privacy threat by data correlations, none of them provide any solution to defend against the FFD-based attack. In this paper, we study the problem of privacy-preserving publishing the microdata with FFDs as adversarial knowledge. We have the following contributions. First, we formally define the FFD-based privacy attack. Intuitively, FFDs enable de-generalization of the anonymized data and thus lead to privacy breaches. Based on the impact of FFDs to privacy, we distinguish “safe” FFDs that cannot enable any FFD-based attack from the “unsafe” ones that can.

Table 2 Anonymized microdata before and after FFD inference. Quasi-identifiers Age

Sensitive Sex

Zip

Phone

Disease

(a) The 3-diveristy table before FD inference [20, 30] * [20, 30] * [20, 30] * [35, 50] F [35, 50] F [35, 50] F

079** 079** 079** 0790* 0790* 0790*

1111111 2222222 3333333 1111111 3333333 2000001

Ovarian cancer Bronchitis Diabetes Ovarian cancer Bronchitis Pneumonia

(b) The table after FD inference [20, 30] [20, 30] [20, 30] [35, 50] [35, 50] [35, 50]

0790* 079** 079** 0790* 0790* 0790*

1111111 2222222 3333333 1111111 3333333 2000001

Ovarian cancer Bronchitis Diabetes Ovarian cancer Bronchitis Pneumonia

* * * F F F

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

251

Second, we define the ðd; ℓÞ-inference model to defend against the FFD-based attack. The ðd; ℓÞ-inference model requires that every anonymization group contains sensitive values that are of similar frequency, where the similarity is controlled by d. Furthermore, it requires that for any two anonymization groups, they have either zero or at least ℓ overlapping distinct sensitive values, as well as at least ℓ non-overlapping distinct sensitive values. Third, we propose three novel grouping strategies to group sensitive values into smaller partitions. The key to achieve ðd; ℓÞinference is the appropriate grouping of sensitive values. For each grouping strategy, we analyze the amount of information loss by tuple suppression (i.e., tuple removal). Fourth, we propose efficient anonymization algorithms that produce the ðd; ℓÞ-inference anonymization scheme with low information loss. The anonymization algorithm consists of two steps, phase-1 partition and phase-2 QI-group construction. We prove that finding the optimal partitioning scheme with minimal information loss is NP-hard. Thus we propose two heuristics, namely top-down and bottom-up approaches, to construct partitions with low information loss. Fifth, we study the impact of multiple unsafe FFDs to anonymization. We design efficient anonymization algorithms for multiple unsafe FFDs, and measure both time performance and information loss of the anonymization algorithm empirically. Last but not least, we demonstrate the efficacy of our approach by an extensive set of experiments. Our experimental results show that our approach can efficiently anonymize the microdata with low information loss when FFDs are available. Our previous work [33] initialized the research on privacy-preserving publishing microdata that contains full functional dependencies. It identified the privacy attack that utilizes functional dependencies as the adversary knowledge, and defined the privacy model that addresses such attack. In this paper, we extend it significantly by the following: • [33] only considers numerical data. We extend it to cover categorical data and re-model our privacy framework accordingly. • The anonymization algorithm in [33] only uses intersection-grouping as its grouping strategy, which may produce large amounts of information loss. Therefore, we propose two new grouping strategies (disjoint-grouping and containment-grouping) that can dramatically reduce the information loss by anonymization, and quantify the amount of information loss by all three grouping strategies. Furthermore, we formally prove that finding an optimal anonymization scheme that returns the minimal information loss is NP-hard, and design two efficient utility-driven heuristics anonymization algorithms that can greatly improve information loss. We evaluate the performance of the two heuristics algorithms with an extensive set of experiments. • [33] only considers single FFD. We extend the reasoning to multiple FFDs. We observe that some FFDs F can play as representative of the others; an anonymized dataset that satisfies ðd; ℓÞ-inference at the presence of F must satisfy ðd; ℓÞ-inference at the presence of all the FFDs that F represents. Thus increasing the number of FFDs may not impact the performance of the anonymization algorithm. Based on this observation, we design efficient anonymization algorithms for multiple unsafe FFDs, and measure both time performance and information loss of the anonymization algorithm empirically. The rest of the paper is organized as follows: Section 2 introduces the preliminaries; Section 3 defines our privacy model; Section 4 proposes three grouping strategies; Section 5 presents the details of our anonymization algorithm; Section 6 extends the investigation to multiple FFDs; Section 7 presents the experimental results; Section 8 discusses the related work; and Section 9 concludes the paper. 2. Preliminaries In this section, we introduce the preliminaries. 2.1. Functional dependency A functional dependency (FD) is a type of integrity constraint. Given two attributes X and Y, a database instance D satisfies the FD F : X → Y if for every two tuples t1, t2 ∈ D, if t1. X = t2. X, then t1. Y = t2. Y. We call X the determinant attributes and Y the dependent attributes, and their values the determinant values and dependent values. A functional dependency F : X → Y is called a full FD (FFD) if it holds for all values of X and Y. Otherwise it is called a conditional FD (CFD). In this paper, we only consider FFDs. 2.2. Application scenario We consider the 2-phase data collection and publishing model as in [10]. In particular, first, in the data collection phase, the data publisher collects data from record owners. In the data publishing phase, the data publisher releases the collected data to the data recipient, who can be a third party service provider or the public, for data analysis on the published data. We consider the trusted model of data publishers [12], in which the data publisher is trustworthy and record owners are willing to provide their personal information to the data publisher. We follow the assumption in [10] that the publisher has the expertise to fulfill the data anonymization task. 2.3. Anonymization framework Let D be a microdata table that stores private information of a set of individuals. There are three types of attributes in D: (i) identifiers (ID), which are the primary key of D, (ii) quasi-identifiers QI, whose combination can play as the key and uniquely

252

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

identify any individual, and (iii) sensitive attributes S. Table 1 shows an example of these three types of attributes. We call the quasi-identifier attributes, QI-attributes, and the values of QI-attributes, QI-values. A popular anonymization technique is suppression and generalization [28,31]. By suppression, some tuples or data values are removed from the released dataset. By generalization, numerical QI-values are recoded as an interval (e.g., age 20 is recorded as [20, 40]), while the categorical QI-values are replaced with higher level domain values in the taxonomy tree (e.g., country “US” is replaced with “North America”). The purpose of generalization is to hide each individual tuple in a group, which is called the QIgroup. To be more formal, given microdata table D, it is partitioned into a set of QI-groups that satisfy the following conditions: (1) each tuple belongs to exactly one QI-group, (2) each QI-group only contains the tuples in D, (3) all QI-groups are disjointed, and (4) all tuples in the same QI-group have the same QI-values after generalization. It is important to measure the incurred information loss by anonymization. The existing metrics that measure the information loss by anonymization include generalization height [16,28], the discernibility metric [3], and the accuracy of aggregated queries [37]. In this paper, we consider the information loss as the distance between the original QI-values and their anonymized ones, and use the generalized loss metric [14] and the similar normalized certainty penalty (NCP) [38] to measure such information loss. For both metrics, the information loss is measured as a ratio. We assume that for each categorical QI-attribute, there exists a taxonomy tree. We define the information loss by suppression and generalization individually. • Information loss by suppression: the information loss of suppressing value v is ILv = 1. • Information loss by generalization (1) For any numerical attribute A, let v be a value of attribute A which is generalized to an interval [L, U]. Let Min and Max be the minimum and maximum of the values of the attribute A. Then the information loss of v I Lv = (U − L)/(Max − Min). (2) For any categorical attribute A, let T be its taxonomy tree. Let v and v′ be a data value of A before and after generalization. Let N be the corresponding node of v′ in T . Then the information loss of v I Lv = (MN − 1)/(M − 1), where MN is the number of leaf nodes in the subtree rooted at node N, and M is the total number of leaf nodes in T . Given a tuple t, its information loss I Lt = ∑ vi ∈ tI Lvi/m, where m is the number of attributes of t. The information loss of the microdata D is defined as I LD = (∑ t ∈ DI Lt)/|D|. For instance, assume the domain range of the age attribute is [10, 70]. Then when generalizing age value 20 to [15, 30], its information loss equals (30 − 15)/(70 − 10) = 1/4. 3. Privacy model In this section, first, we define the privacy attack based on full functional dependencies. Second, we formally define the ðd; ℓÞinference model that combats the FFD-based attack. 3.1. FFD-based attack To analyze the impact of FFDs to privacy, we consider a popular privacy model, ℓ-diversity [20]. It requires that each QI-group must consist of at least ℓ “well-represented” distinct values that are of close frequency. We define d-closeness to address the requirement of close frequency. In particular, we say two data values s1 and s2 are d-close if |f1 − f2| ≤ d, where f1 and f2 are the count frequency of s1 and s2. We say a group G that consists of a set of distinct data values is d-close if for any two values si, sj ∈ G, si and sj are d-close. The definition of d-closeness is a simplified version of “well-represented” in [20]. Based on d-closeness, we give a simplified version of ℓ-diversity. Definition 3.1 [(ℓ, d)-diversity]. Given a microdata D and its anonymized version D⁎, we say D⁎ is (ℓ; dÞ-diverse if for each tuple t ∈ D⁎, its QI-group G consists of at least ℓ distinct sensitive values that are d-close. □ Next, we explain how FFDs enable a privacy breach on the microdata that is (ℓ; dÞ-diverse. The attack is possible when different QI-groups share some values (either sensitive or QI-values). In particular, assume that both QI-groups G1 and G2 include a sensitive value a, which is also a determinant value of the FFD F : A→B (e.g., phone number “1111111” in Example 1.1). Let b be the corresponding dependent value of a in the original microdata (e.g., zipcode “07921” in Example 1.1). If a is not generalized in both G1 and G2, but b is generalized to b⁎1 and b⁎2 (b⁎1 ≠ b⁎2) in G1 and G2 (e.g., generalized zipcode “079**” and “0790*” in Example 1.1), then the attacker can infer that b⁎1 and b⁎2 must correspond to the same original value. Thus he/she can “intersect” b⁎1 and b⁎2. We define the intersection operation (denoted as ∩ ⁎) on generalized values below. Specifically, given two generalized values b⁎1 and b⁎2, • b⁎1 and b⁎2 are two intervals [l1, u1] and [l2, u2]: if these two intervals overlap, then b⁎1 ∩ ⁎b⁎2 = [max(l1, l2), min(u1, u2)]; otherwise b⁎1 ∩ ⁎b⁎2 = NULL. • b⁎1 and b⁎2 are categorical values: let N1 and N2 be the corresponding nodes of b1 and b2 in the taxonomy tree. Then if N1 and N2 are on the same path, b⁎1 ∩ ⁎b⁎2 returns the lower node of N1 and N2 in the taxonomy tree; otherwise b⁎1 ∩ ⁎b⁎2 = NULL. For example, given two generalized Age values b⁎1 = [20, 40] and b⁎2 = [30, 50], b⁎1 ∩ ⁎b⁎2 = [30, 40]. Another example is that for two generalized Gender values b⁎1 = ⁎ (i.e., the b1 value is generalized as ⁎) and b⁎2 = F, b⁎1 ∩ ⁎b⁎2 = F. Note that for both b⁎1 and b⁎2 that are generalized from the same value b, b⁎1 ∩ ⁎b⁎2 always returns a non-null result. Then the attacker can replace both b⁎1 and b⁎2 with b⁎1 ∩ ⁎b⁎2, which is more specific. The value replacement separates the tuples in G1 (resp. G2)

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

253

into two sets. We formally define these two sets below. To distinguish from the conventional set intersection/difference operation, we use G1 ∩ F G2 and G1 − F G2 to denote the set intersection/difference operations of G1 and G2 by using FFD F. Since every QI-group has the identical QI-values on each QI-attribute, we use G[A] to denote the generalized QI-value of the attribute A in QI-group G. Definition 3.2 [FFD-based Intersection/difference of QI-groups]. Given the FFD F : A→B of the microdata D and two QI-groups G1,G2, let GI = {t|t ∈G1, ∃ t′ ∈G2 s.t. t ½A = t ′ ½A}. Then G1 ∩ FG2 is a set of tuples J s.t. for each tuple t∈ GI, there exists at least a tuple t′ ∈ J s.t. (1) for each attribute A∈A, t′[A] = t[A], (2) for each attribute B∈B, t ′ ½B =



t ½B G1 ½B∩⁎ G2 ½B

if t ½B is not generalized if t ½B is generalized

Furthermore, G1 − F G2 = G1 − G12.



For example, for the two QI-groups G1 and G2 in Table 2 (a) with FFD F : Phone → Zip, G1 ∩ F G2 = {{[20,30], *, 0790*, 1111111, Ovarian cancer}, {[20,30], *, 0790*, 3333333, Diabetes}}, and G1 − F G2 returns the second tuple in G1. Note that the ∩ F operation is not symmetric, i.e., G1 ∩ F G2 ≠ G2 ∩ F G1. After the FFD-based QI-group intersection/difference operation, the attacker may be able to distinguish the tuples in G1 ∩ F G2 from those in G1 − F G2 since they have different QI-values. Indeed, the old QI-group G1 can be replaced as G1 ∩ F G2 and G1 − F G2, two relatively smaller QI-groups. This may bring privacy leakage. Definition 3.3 formally defines the FFD-based privacy attack. Definition 3.3 [FFD-based privacy attack]. Given a microdata D, let D⁎ be its generalized version that satisfies (ℓ; d)-diversity. Then the FFD F : A → B (A; BpQI ∪ S) will enable the FFD-based privacy attack if there exist two QI-groups G1, G2 ∈ D⁎ such that at least one of followings is non-empty set and does not satisfy (ℓ; d)-diversity: (1) G1 ∩ F G2, (2) G2 ∩ F G1, (3) G1 − F G2, and (4) G2 −F G1. Otherwise, we say D⁎ is safe at the presence of F. □ Example 1.1 shows an example of the FFD-based privacy attack. Note that Definition 3.3 is not limited to (ℓ; d)-diversity only; by replacing the (ℓ; d)-diversity requirement with other specific privacy principles (e.g., k-anonymity, ℓ-diversity), Definition 3.3 can validate the robustness of those principles against the FFD-based attack too. Although FFDs may threaten privacy, not all FFDs can enable the FFD-based attack. Based on this, we define safe and unsafe FFDs. Definition 3.4 [Safe/unsafe FFDs]. An FFD F is safe if for any microdata D that satisfies F, all of its possible generalized versions D⁎ are safe at the presence of F. Otherwise, we say F is unsafe. □ Based on the definition, next, we discuss how to distinguish the “safe” FFDs from the “unsafe” ones. We assume that sensitive values are kept as unchanged in the anonymized datasets. This assumption holds for most of the existing work (e.g., [20,31]). We also assume that the generalization on the same set of QI-values will return the same result. For instance, for two QI-groups G1 and G2 that contain the age values {20, 30, 40}, applying generalization on them always returns the range [20,40]. We have: Theorem 3.1 ((Un)safe FFDs): Given the microdata D, let F : A→B (A; BpQI ∪S) be one of its FFDs. then F is safe iff ApQI. Otherwise, F is unsafe. Proof. We first prove that if ApQI, then F is safe. There are two cases, the values of A are either generalized or not. If they are generalized, as the generalized sensitive values of A in the same QI-group are homogenized, applying the FFDs on these values will not infer different dependent values. Thus it cannot enable the FFD-based privacy attack. On the other hand, if the values of A are not generalized. Our goal is to prove that for any two QI-groups G1 and G2 that satisfy (ℓ; d)-diversity, when G1 ½A∩ G2 ½A≠⊘, G1 ∩ F G2, G2 ∩ F G1, G1 − F G2, and G2 − F G1 still satisfy (ℓ; d)-diversity. This can be proven as following: when G1 ½A∩G2 ½A≠⊘, for each attribute A∈A, both G1 and G2 must have the same QI-values on A, as each QI-group only contains homogenized QI-values. Those identical determinant QI-values will lead to the same dependent values in the original dataset. If these dependent values are not generalized in G1 and G2, G1 ∩ F G2 = G1, and G1 − F G2 is empty. The similar results apply to G2 ∩ F G1 and G2 − F G1. If these dependent values are generalized in G1 and G2, according to our assumption, generalization on those identical dependent values (if needed) will return the same result. Applying intersection on these identical generalized values will not produce any more specific result. Thus G1 ∩ F G2 = G1, G2 ∩ F G1 = G2, and both G1 − F G2 and G2 − F G1 are empty sets. Thus for both cases, all the non-empty intersection/difference results of G1 and G2 still satisfy ðℓ; dÞ-diversity, i.e., G1 and G2 are still safe with presence of F. Then we prove that if F is safe, then ApQI by contrapositive, which is, we prove that if A∩S≠⊘, then F is not safe. The reasoning is based on the assumption that sensitive values are not generalized. We have shown that the FFD-based attack on the sensitive values as ungeneralized sensitive values can lead to de-generalization of QI-values. Thus F is not safe. □ Theorem 3.1 shows that unsafe FFDs must include at least one sensitive attribute as part of its determinant attribute. For these unsafe FFDs, we have to define a robust privacy principle to eliminate the possible privacy attack that they can bring. Before we present the privacy definition, we first define the largest overlapped QI-group.

254

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

Definition 3.5 [Largest overlapped QI-groups]. Given an unsafe FFD F : A→B and a set of QI-groups GfG1 ; …; Gn g, let Si be the set of   distinct sensitive values of A in the QI-group Gi (1≤ i ≤ n). Then for any QI-group Gi ∈G, its largest overlapped QI-groups O Gj ; …; Gk is a subset of G such that: (1) |Si ∩ Sj ∩ … ∩ Sk| ≠ 0, and (2) for any O′{Gj, …, Gk, …, Gl} such that OpO′, |Si ∩ Sj ∩ … ∩ Sk ∩ … ∩ Sl| = 0. The largest overlapped QI-groups define the largest set of QI-groups whose intersection of sensitive values is non-empty. Now we are ready to define our privacy model, ðd; ℓÞ-inference. Definition 3.6 [ðd; ℓÞ-inference]. Given microdata D that contains the unsafe FFD F : A→B, let D⁎ be its anonymized version that contains the QI-groups GfG1 ; …; Gn g. Let Si be the set of distinct sensitive values of A in the QI-group Gi (1 ≤ i ≤ n). Then D⁎ satisfies ðd; ℓÞ-inference at the presence of F if • d-close: for each Gi in G, all sensitive values in Gi are d-close,  • ℓ-overlapping: for each Gi in G, let O Gj ; …; Gk be its largest overlapped QI-groups. Then j Si ∩Sj ∩…∩Sk j≥ℓ, i.e., there are at least ℓ shared distinct values in Si ∩ Sj ∩ … ∩ Sk, and • ℓ-non-overlapping: for any two Gi, Gj in G, jSi −Sj j≥ℓ, i.e., there are at least ℓ non-overlapping distinct values in Si − Sj.

Intuitively, ℓ-overlapping and ℓ-non-overlapping ensure that the result of applying ∩ F and − F operations on any subset of QIgroups still satisfies ðd; ℓÞ-inference. To avoid considering the exponential number of all possible subsets, both ℓ-overlapping and ℓ-non-overlapping only consider the largest subsets that return a non-null result. We note that the ðd; ℓÞ-inference model is not based on the reasoning of the generalized datasets. Instead, it only reasons on the sensitive values and thus can be applied to the original microdata directly. To show the robustness of the ðd; ℓÞ-inference model, we have the following theorem: Theorem 3.2 (Robustness of ðd; ℓÞ-inference): Given a microdata D, if for any unsafe FFD F of D, the anonymized version D⁎ of D satisfies ðd; ℓÞ-inference at the presence of F, then D⁎ is safe at the presence of F. Proof. As shown by Theorem 3.1, unsafe FFDs A→B must satisfy that A ∩ S ≠ ⊘. There are two cases, ApS, and A ∩ QI ≠ ⊘. For the case that ApS, since there are always at least ℓ overlapped and non-overlapped distinct sensitive values (of roughly the same frequency) between different QI-groups, for any two groups G1 and G2, due to the ℓ-overlapping requirement, G1 ∩ F G2 and G2 ∩ F G1 must satisfy (ℓ; d)-diversity. Similarly, due to the ℓ-non-overlapping requirement, G1 − F G2 and G2 − F G1 must satisfy (ℓ; d)-diversity. Then for the case that A∩QI ≠⊘, since all QI-values in the same anonymized QI-groups will be identical, no matter whether they are generalized or not, thus they will not enable FFD-based attack. Then the proof of this case is the same as the discussed case that ApS. □ The ðd; ℓÞ-inference model can be considered an enhanced version of (ℓ; d)-diversity that defends against the FFD-based attack. But it is not limited to (ℓ; d)-diversity only; the requirement of ℓ-overlapping and ℓ-non-overlapping conditions can be adapted to any privacy model as the enhancement to combat the FFD-based privacy attack. For instance, it can be adapted to the kanonymity model by changing the requirement of ℓ-overlapping and ℓ-non-overlapping to k-overlapping and k-non-overlapping. 4. Grouping strategies The key to achieving ðd; ℓÞ-inference is to appropriately group the sensitive values so that all these groups meet the three conditions of ðd; ℓÞ-inference. Given the fact that the sensitive values of any two QI-groups must be either disjoint or overlapped, we design three grouping strategies, namely disjoint-grouping, containment-grouping, and intersection-grouping: (1) Disjoint-grouping (DG): partition the sensitive values into groups that do not overlap. (2) Containment-grouping (CG): partition the sensitive values into groups that follow a strict containment relationship. (3) Intersection-grouping (IG): partition the sensitive values into groups that intersect (but not contain) in a chain. Fig. 1 illustrates the grouping results by these three strategies. Next, we explain the details of these three grouping strategies.

Fig. 1. Illustration of three grouping strategies on sensitive values.

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

255

4.1. Disjoint-grouping (DG) Disjoint-grouping (DG) partitions the sensitive values into groups that do not overlap. It consists of three steps. First, the distinct sensitive values are sorted by their frequencies in descending order. The reason for using descending order is to minimize the number of tuples to be removed. Second, starting from the first sensitive value in the sorted result, the adjacent ℓ distinct sensitive values are put into the same group. This procedure is repeated until either all sensitive values are grouped or the number of ungrouped distinct sensitive values is less than ℓ. Third, each group G that consists of sensitive values Sfsi ; …; sj gð j−i≥ℓÞ is to be trimmed to satisfy d-closeness. In particular, for each sensitive value si ∈ G, there are max(fi − fj − d, 0) tuples containing si that will be removed, where fj is the smallest frequency of sensitive values in S. It is possible that at the end of partitioning, there exist less than ℓ distinct sensitive values. There are two options for these values, they are either removed or merged into the last group whose sensitive values are of the closest count frequency to theirs. Out of these two options, we pick the one that removes the fewer tuples. Example 4.1 gives more details. Example 4.1. Given the sensitive values {s1, s2, s3, s4, s5, s6, s7, s8, s9} of frequency (50, 45, 40, 30, 26, 10, 9, 7, 1), with d = 3 and ℓ = 2. By disjoint-grouping, first, we have G1 = {s1, s2} of frequency (48, 45), with 2 tuples containing s1 removed, G2 = {s3, s4} of frequency (33, 30), with 7 tuples containing s3 removed, and G3 = {s5, s6} of frequency (13, 10), with 13 tuples containing s5 removed. There are two options to group the remaining sensitive values {s7, s8, s9}: (1) Option 1: G4 = {s7, s8, s9} of frequency (4, 4, 1), with 5 tuples of s7 and 3 tuples of s8 removed, (2) Option 2: G4 = {s7, s8} of frequency (9, 7), with 1 tuple of s9 removed. We pick Option 2 since it returns a smaller number of removed tuples. There are 2+7+13+ 1=23 tuples in total to be removed. □ Intuitively, given n sensitive values, DG partitions them into [n/l] number of groups that are disjoint with each other; each group contains at least ℓ sensitive values that are d-close. Thus the groups constructed from these groups must satisfy the ðd; ℓÞinference requirement. Based on the details of the DG scheme, we can compute the total number of tuples to be removed by DG. To be more specific, given a set of distinct sensitive values {s1, …, sn} sorted in descending order of their frequencies, let fi(1 ≤ i ≤ n) be the frequency of si. Then the number of tuples that contain si removed by DG is:

DG ri

8 < maxðfi −fk −d; 0Þ =

:

fi

if si is included in a group fsj ; …; sk gðk−j ≥ ℓÞ: Otherwise:

The total information loss of tuple suppression by using DG on {s1, …, sn} equals DG(1, n) = ∑ ni = 1riDG. The complexity of DG is O(n), where n is the number of distinct sensitive values that are to be grouped. 4.2. Containment-grouping (CG) The containment-grouping approach partitions the sensitive values into the groups G1,…,Gm such that Gm p Gm − 1 … p G1. Fig. 1 (b) illustrates the anonymization effect. In particular, given n distinct sensitive values that are sorted by their frequencies in ascending order, we start from constructing the first group that contain all n sensitive values. Then we update the frequencies of these n sensitive values, move forward along the next l distinct sensitive values, and construct the second group that contains n−ℓ sensitive values. For those ℓ sensitive values that are only used in the previous group, their frequencies are adjusted to satisfy d-closeness by removing tuples. We repeat this procedure until the number of remaining sensitive values is less than ℓ. Since these residue sensitive values cannot construct a group of a size at least ℓ, these sensitive values are removed. We use Example 4.2 for a better illustration of CG procedure. Example 4.2. Given the sensitive values {s1, s2, s3, s4, s5, s6, s7, s8, s9} of frequency (1, 7, 9, 10, 26, 30, 40, 45, 50) (the same as Example 4.1 but in reverse order), with d = 3 and ℓ = 2. The grouping procedure is illustrated as below. (1) Construct G1 (s1, s2, s3, s4, s5, s6, s7, s8, s9) of frequency (1, 4, 4, 4, 4, 4, 4, 4, 4). 3 tuples of s2 are removed. The remaining sensitive values are (s3, s4, s5, s6, s7, s8, s9) of frequency (5, 6, 22, 26, 36, 41, 46). (2) Construct G2 (s3, s4, s5, s6, s7, s8, s9) of frequency (5, 6, 8, 8, 8, 8, 8). The remaining sensitive values are (s5, s6, s7, s8, s9) of frequency (14, 18, 28, 33, 38). (3) Construct G3 (s5, s6, s7, s8, s9) of frequency (14, 17, 17, 17, 17). 1 tuple of s6 is removed. The remaining sensitive values are (s7, s8, s9) of frequency (11, 16, 21). (4) Construct G4 (s7, s8, s9) of frequency (11, 14, 14). The remaining sensitive values are (s8, s9) of frequency (2, 7). At last, 2 tuples that contain s8 and 7 tuples that contain s9 are removed. The total number of removed tuples is 3+1+2+7=13. □   + 1 groups G1, …, Gt s.t. (1) each group Gi contains at least ℓ Intuitively, given n sensitive values, CG partitions them into n−ℓ ℓ d-close distinct sensitive values, and (2) Gi p Gi − 1, with jGi−1 −Gi j ≥ ℓ. Thus the groups that are constructed from these groups satisfy ðd; ℓÞ-inference. Theorem 4.1 shows the number of tuples that will be removed by CG.

256

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

Theorem 4.1 (# of Removed Tuples by CG): Given a set of distinct sensitive values {s1, …, sn} sorted in ascending order of their frequencies, let fi(1 ≤ i ≤ n) be the frequency of si. Let {sj, …, sn} be the last group that contains the sensitive value si by CG. Then the number of removed tuples rCG that contain the sensitive value si equals rCG i i = max(fi − fj − d, 0). The total information loss of tuple suppression by using CG on {s1, …, sn} equals CG(1, n) = ∑ ni = 1 rCG □ i . The proof of Theorem 4.2 is in the Appendix A. Theorem 4.2 shows that even though the CG procedure updates the frequency repeatedly, the number of removed tuples can be computed directly from the original microdata. For instance, in Example 4.2, the number of removed tuples that contain s7 equals f9 − f7 − d = 50 − 40 − 3 = 7. Note that although the way that the number of removed tuples by CG is similar to that by DG, due to the different orders in which the sensitive values are sorted, these two approaches produce a different number of tuples to be removed. Example 4.1 and 4.2 show an example that CG approach produces less number of tuples to be removed. But if we change the frequency of s4 from 10 to 25, DG will win. The complexity of CG is O(n), where n is the number of distinct sensitive values that are to be grouped. 4.3. Intersection-grouping (IG) The CG approach requires strict containment relationships between groups. However, as long as there are at least ℓ overlapping and ℓ non-overlapping sensitive values between different groups, the containment relationship is not necessary. Thus we design the intersection-grouping strategy, which allows different groups to share at least ℓ distinct sensitive values, but not contained in the other. To avoid considering intersection of arbitrary groups, we do not allow the intersection of more than two groups. Furthermore, we require that all overlapped groups construct a chain, i.e., given a set of groups G1, …, Gm, Gi intersects with Gi + 1 and Gi − 1, but not the others. Fig. 1 (c) illustrates the effect of the intersection-grouping strategy. The intersection-grouping approach consists of two steps: (1) construct order-preserving, ℓ-diverse, disjoint buckets, and (2) construct intersected groups from the buckets. 4.3.1. Step 1: Bucket construction First, we sort the distinct sensitive values by their frequencies in ascending order. Based on the sorted result, the adjacent sensitive values that satisfy d-closeness are grouped into the same bucket. After bucketing is finished, only the buckets that contain at least ℓ distinct sensitive values are kept. For any residue sensitive value s that cannot be bucketized, there are three options: (1) it is removed, (2) it is added to an existing bucket, and (3) it is added to a new bucket with other residue sensitive values. For the second option, we pick the bucket whose maximum frequency is the largest value that is less than the frequency of s. For the third option, we pick the other ℓ−1 residue sensitive values that have the closest frequency to that of s. Out of these three options, we choose the one that returns the minimal number of tuples to be removed to make the buckets satisfy d-closeness. After all sensitive values are bucketized, “large” buckets are split into smaller ones. In particular, each bucket Bi is split into ⌊ni = ℓ⌋ smaller disjoint buckets, where ni is the number of distinct sensitive values in Bi. The splitting is order-preserving, i.e., the frequency of all sensitive values in the bucket Bi is smaller than that of all sensitive values in the split bucket Bi + 1. After splitting, all buckets are ordered by the maximum frequency of their sensitive values in ascending order. At the end of the process, Step 1 returns a set of disjoint buckets that are order-preserving, d-close, and of at least ℓ distinct sensitive values. 4.3.2. Step 2: Construction of intersected groups To construct a chain of intersected groups, we use the pseudo code in Algorithm 1. Intuitively, except for the first and the last buckets, the sensitive values in the bucket Bi are put into two adjacent groups Gi and Gi + 1. Since each bucket consists of at least ℓ distinct sensitive values, Gi and Gi + 1 must satisfy ℓ-overlapping and ℓ-non-overlapping. Further, since the group construction respects d-closeness (Line 5), the constructed Gi and Gi + 1 must contain at least ℓ distinct td-close values. Note here: since all the buckets are d-close already, group construction from these buckets does not need to remove any tuple. Algorithm 1. IG: d; ℓ for ðd; ℓÞ-inference. Require: A set of buckets B1, …, Bk ordered in ascending order; 1: Initialize p = 1; 2: repeat 3: Assume Bp contains sensitive values si, …, sj; 4: Assume Bp + 1 contains sensitive values sj + 1, …, st; 5: Construct group G = {si, …, sj, sj + 1, …, st}, each of the frequency f j′= min(fj, fi + d). 6: for all sensitive value si in Bp + 1 do 7: fi: = fi − f i′ (f i′: the frequency of si in G). 8: p: = p + 1; 9: until p N k, the number of buckets

Example 4.3. Given the sensitive values {s1, s2, s3, s4, s5, s6, s7, s8, s9} of frequency (1, 7, 9, 10, 26, 30, 40, 45, 50) (the same as Example 4.2), assume d=3 and ℓ = 2. By Step 1, we construct the buckets B1{s2,s3,s4} of frequency (7, 9, 10), B2{s5,s6} of frequency (26,

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

257

29), and B3{s7, s8,s9} of frequency (40, 43, 43). There are 1 tuple containing s1, 1 tuple containing s6, 2 tuples containing s8, and 7 tuples containing s9 that are removed. In Step 2, the following groups are constructed: • G1: (s2, s3, s4, s5, s6) of frequency (7, 9, 10, 10, 10). • G2: (s5, s6, s7, s8, s9) of frequency (16, 19, 19, 19, 19). • G3: (s7, s8, s9) of frequency (21, 24, 24). There are 1 + 1 + 2 + 7 = 11 tuples in total that are removed.



Intuitively, IG constructs a set of groups of sensitive values G1, …, Gm such that intersection only holds on adjacent two groups, with at least ℓ distinct values in the intersection. Further, for any Gi and Gj, jGj −Gi j≥ℓ. In addition, all groups satisfy dcloseness. Thus the groups constructed from these groups must satisfy the ðd; ℓÞ-inference requirement. Given n distinct sensitive values {s1, …, sn}, let fi(1 ≤ i ≤ n) be the frequency of the value si. Let Bi be the bucket that includes si, and fmin(Bi) the minimum frequency of the sensitive values in Bi, then the number of tuples containing si that will be removed by IG is riIG = max(fi − fmin(Bi), 0). The total number of tuples that will be removed by using IG on {s1, …, sn} equals IG(1, n) = ∑ ni = 1riIG. We note that the number of groups by IG is decided by the frequency distribution of the sensitive values; it is related to neither n nor ℓ value. The complexity of IG is O(n), where n is the number of distinct sensitive values that are to be grouped. 5. Anonymization algorithm In this section, we explain the details of the algorithm that constructs the QI-groups for anonymization. Given an unsafe FFD A→B (recall that unsafe FFDs must include at least one sensitive attribute in its determinant attribute), a naive anonymization method is to apply DG, CG and IG grouping strategies directly on all distinct values of the sensitive attributes in A. This may incur tremendous information loss by tuple suppression, especially for the dataset whose sensitive values are of skewed frequency distribution. Therefore, first, to reduce the information loss by tuple suppression, we split the sensitive values into smaller disjoint segments, and apply one of DG, CG, and IG groupings on these segments, depending on which returns the smallest number of removed tuples. We call this the phase-1 partition. The second phase of anonymization is QI-group construction, by which we reduce the information loss by data generalization while construct QI-groups from the phase-1 partitions. In the next section, we explain the details of these two phases. 5.1. Phase-1 partition In general, phase-1 partition is to segment the dataset into smaller groups, while DG/CG/IG is to further segment those smaller groups. Fig. 2 illustrates the effect of the partition. We formally define the partitions as following. Given a set of distinct values S {s1, …, sn}, a partition scheme of S is a set of segments {P1, …, Pt} such that: (1) ∀ i ≠ j, Pi ∩ Pj = ⊘ and (2) ∪ ti = 1Pi = S. For each partition Pithat contains the values {sj, …, sk} (1 ≤ j, k ≤ n), the information loss of tuple suppression ili by applying DG, CG and IG on Pi equals to ili = min(DG(j, k), CG(j, k), IG(j, k)). The total information loss by tuple suppression of {P1, …, Pt} I L = ∑ ti = 1ili. We say the partitioning scheme is optimal if its total information loss by tuple suppression is the minimal out of all possible partition schemes. Note here we restrict the optimality definition to DG/CG/IG grouping schemes only. There may exist other schemes that can be applied on the partitions. Next, we show that the optimality problem in a restricted DG/IG/CG grouping setup is already NP-hard. Theorem 5.1 (NP-hardness of optimal scheme for DG/CG/IG grouping): Given a set of distinct values S {s1, …, sn} with their frequencies, finding the optimal partition scheme of S by using DG/CG/IG grouping is NP-hard. Proof. The proof is by reduction from the knapsack problem [22]. The problem is that given n items I, each of a value p and a weight w, decide the subset of items I′ p I to be chosen so that: (1) the sum of value of these items is maximum, and (2) the sum of weight of these items is below a given threshold. Given an instance of this problem, we can create a set of segments, each corresponding to an item in the knapsack problem. The assigned value of item Ij equals the minimal number of removed tuples by applying DG/CG/IG on the segment Pj, while the assigned weight of item Ij equals the number of distinct values in the segment Pj. It can be seen that I′is optimal knapsack solution if and only if the segments returns the minimal number of removed tuples in total. □ Theorem 5.1 shows that finding the best grouping with the three specific grouping strategies is already computationally hard; pursuing optimality under a broader context with all possible grouping schemes must be NP-hard too. The NP-hardness is in the number of distinct values.

Fig. 2. Illustration of phase-1 partition.

258

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

Since finding the optimal solution is NP-hard, we propose two heuristic approaches, namely top-down and bottom-up, as the alternatives to the optimal solution. Both heuristics are designed in a greedy fashion. 5.1.1. Top-down approach Given an unsafe FFD F : A→B, the top-down approach starts from the set of all unique sensitive values P {s1, …, sn} of the sensitive attributes in A. We calculate the number of removed tuples I L(P) by applying DG, CG, or IG on P, i.e., IL(P) = min(DG(1, n), CG(1, n), IG(1, n)). If the size of the partition P is less than 2ℓ, i.e. it cannot be split into two sub-segments that meet ðd; ℓÞ-inference, we keep P as original. Otherwise we evenly split P into two smaller ones, P1 {s1, …, s[n/2]} and P2 {s[n/2] + 1, …, sn}. We calculate I L(P1) = min(DG(1, i), CG(1, i), IG(1, i)) and I L(P2) = min(DG(i + 1, n), CG(i + 1, n), IG(i + 1, n)), where i = [n/2]. We compare I L(P) with I L(P1) + I L(P2). If I L(P1) + I L(P2) ≥ I L(P), we do not split P; otherwise, since splitting helps to reduce the information loss by suppression, we split P into P1 and P2. We repeat the above procedure on all partitions, until none of them can be further split. Fig. 3 (a) illustrates the top-down procedure. 5.1.2. Bottom-up approach The bottom-up approach is opposite to the top-down approach. Initially, the set of all unique values S {s1, …, sn} of the sensitive attributes in A is split into multiple disjoint segments, each of ℓ unique values, except the last one that may contain more than ℓ values. For each partition P{si, …, sj}, we calculate the number of removed tuples I L(P) = min(DG(i, j), CG(i, j), IG(i, j)) by applying DG, CG and IG on P. Then starting from the first segment, we merge every two adjacent ones P′i = Pi ∪ Pi + 1, and compare I L(P′i) with IL(Pi) + I L(Pi + 1). If I L(P′i) ≥ I L(Pi) + I L(Pi + 1), Pi and Pi + 1 will not be merged. Otherwise, we merge Pi and Pi + 1 and construct P′i. We repeat the above procedure until no Pi can be merged. Fig. 3 (b) illustrates the procedure. We note that the number of partitions by either the top-down or the bottom-up approach is not pre-defined; both bottom-up and top-down heuristics approaches repeat partitioning the values, until no further merge/split can reduce information loss. Furthermore, our top-down and bottom-up approaches do not naturally meet the monotonicity property [16] (i.e., the small groups satisfying ðd; ℓÞ-inference guarantees that their merged result satisfies ðd; ℓÞ-inference, and vice versa). However, when the top-down approach splits and the bottom-up approach merges the groups, they apply the DG/CG/IG grouping, which always construct ðd; ℓÞ-inference groups. Therefore, all (intermediate and final) partitions must satisfy ðd; ℓÞ-inference. 5.1.3. Complexity analysis In each round of split/merge, the total complexity of computing the number of removed tuples is O(n), where n is the number of distinct values of the sensitive attributes that are determinant attributes of the unsafe FFDs. For the top-down approach, the worst case is that split only terminates on the partitions whose sizes are less than 2ℓ, which results in log½n = 2ℓ rounds. Thus the complexity is Oðnlog ½n = 2ℓÞ. For the bottom-up approach, the worse case is that merge only terminates when merging all segments together. With similar reasoning, the complexity is Oðnlog ½n = ℓÞ. Since ℓ≪n, the complexity of both approaches is O(nlogn). 5.1.4. Information loss threshold To ensure that the phase-1 partitioning incurs reasonable amounts of utility, one possible solution is to specify a threshold of the information loss; partition approaches that fail to produce partitioning schemes whose information loss is less than the threshold will be discarded and the microdata will be kept as un-anonymized. This may prevent publishing of the microdata, since publishing it will violate either privacy or utility requirements. 5.2. Phase-2 QI-group construction Since the resulted phase-1 partition only considers the number of distinct sensitive values, but not the number of tuples, it is possible that the partitions contain large number of tuples with repeated sensitive values. Simply applying generalization on such partitions may incur large amount of information loss by generalization. To reduce such information loss, we split each partition into smaller groups. In particular, given the partition P, let k be the number of distinct values in P. It is straightforward that k≥ℓ, the parameter of the (d,ℓ)-inference model. Then the tuples are bucketized by hashing on their values, with each bucket containing the tuples that are of the same values. We require that for k distinct values, there exist k hashed buckets, so that different values will not be hashed into the same bucket. After bucketization, we construct the QI-group by picking k tuples from k buckets, one from each. We repeat the construction procedure until there is a bucket that contains only one unpicked tuple. Then the residue tuples of all buckets are put into one QI-group. Let fmin be the minimal frequency in P. Then at the end of QI-group construction, there are

Fig. 3. Illustration of top-down and bottom-up approaches.

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

259

fmin constructed QI-groups, out of which fmin − 1 of them contain kðk≥ℓÞ distinct values of unique occurrence, and one contains k distinct values, with the value si of fi − fmin + 1 frequency, where fi is the frequency of si. Since the partition P satisfiesðd; ℓÞinference, it is straightforward that all QI-groups constructed from P must satisfy ðd; ℓÞ-inference. Example 5.1. Given the partition that contains values (s1, s2, s3, s4) of frequency (3, 5, 6, 6) (d = 3, ℓ=4), phase-2 grouping constructs three QI-groups G1, G2 and G3, with G1 and G2 consisting of tuples that contain (s1, s2, s3, s4) of frequency (1, 1, 1, 1), and G3 consisting of tuples that contain (s1, s2, s3, s4) of frequency (1, 3, 4, 4). □ 5.3. Discussion Our 2-phase anonymization algorithm can effectively preserve the amount of utility as much as possible. However, it may not be suitable for the dataset with skewed frequency distribution on its sensitive values, as it may incur high information loss. Indeed, such information loss is inherent as a large number of values have to be removed to satisfy d-closeness. The other disadvantage of our algorithm is that it does not support incremental anonymization of the datasets with updates; when the frequency information is changed, we have to run the algorithm on the updated dataset from scratch to construct a new anonymization scheme. 6. Multiple unsafe FFDs When multiple unsafe FFDs are present, since these FFDs may share determinant/dependent attributes, applying the anonymization algorithm for each FFD individually may produce inconsistent QI-grouping decisions on the tuples. To overcome this problem, we propose the following solution which consists of three steps. 6.1. Step 1: Pick representative FFDs We say FFD F : A→B is the representative of F ′ : A′ →B′ if ApA′ . A nice property of the representative FFDs F is that if an anonymized dataset D⁎ satisfies ðd; ℓÞ-inference at the presence of F, then D⁎ must satisfy ðd; ℓÞ-inference at the presence of all FFDs whose representatives are F. Therefore, we select the representative FFDs of all available FFDs. There may exist multiple representative candidate FFDs that are of different dependent attributes. We randomly pick one as the representative. 6.2. Step 2: Construct FFD-chains For any representative FFDs F : A→B and F ′ : B→C (i.e., whose determinant attribute is the same as the dependent attributes of the other FFDs), we put them in a chain, with F′ following F. For those representative FFDs that do not present such property, we put each of them in a single chain. At the end of this step, there may exist multiple chains, each of minimum size 1. 6.3. Step 3: Anonymization For each FFD-chain, we apply the 2-phase anonymization algorithm only on the last FFD in the chain; the anonymized dataset will satisfy ðd; ℓÞ-inference at the presence of the other FFDs in the chain. The reason is that for any FFD F : A→B, if the grouped values on B satisfies ðd; ℓÞ-inference (i.e., at least l distinct values that are d-close), since every distinct value of B must correspond to a distinct value of A, the grouped values of A must satisfy ðd; ℓÞ-inference. The above procedure shows that when multiple FFDs are present, it is the number of FFD-chains but not the number of FFDs that can influence the time performance of the anonymization algorithm. 7. Experiments We have done an extensive set of experiments to evaluate both the effectiveness and efficiency of our anonymization algorithm. Specifically, we evaluated the impacts of ℓ, d parameters, data distribution, and number of FFDs on both the performance and utility of the anonymized data. We also evaluated the effectiveness of our two heuristic approaches by comparing them with an existing anonymization algorithm, Mondrian [17]. In this section, we describe our experiment design and results. 7.1. Experimental setup 7.1.1. Setup We used a workstation machine equipped with 2.4 GHz Intel core, 3 GB of RAM, and Windows XP. We implemented the algorithms in C++. We used the implementation of Mondrian [17].1 The reason we chose Mondrian is that it is also a partitionbased algorithm. 1

http://www.cse.cuhk.edu.hk/~taoyf/paper/codes/sigmod08-dyn-mondrian.zip.

260

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

Fig. 4. Uniform & skewed frequency distribution of synthetic and real datasets.

7.1.2. Datasets We used both synthetic and real datasets in the experiments. To measure the impact of frequency distribution to anonymization, we generated two types of synthetic datasets that contain six attributes in total, the one whose distribution on sensitive values is close to uniform, and the one whose sensitive values are of skewed distribution. We call these two datasets the U-dis dataset and S-dis dataset. We used the CENSUS dataset that contains personal information of 500,000 American adults 2 as our real dataset. For the CENSUS dataset, we defined five FFDs: salary-class → occupation, (salary-class, education level) → age, marital-status → salary-class, race → marital-status, and birth place → age, and modified the dataset to make it satisfy the FFDs. It turned out that the modified data values take 47.62% of the original dataset. In the following, we used i-FD (1 ≤ i ≤ 5) to denote that the first i FFDs of the five FFDs in the aforementioned order are available. The frequency distribution of parts of the datasets after modification is shown in Fig. 4 (a) and (b). CENSUS is of the most skewed distribution out of all three datasets. 7.1.3. Setup of d values Since d value controls the distance of frequency, randomly picked d values are not appropriate to reflect the impact of frequency distribution to anonymization. Thus we decided to use d values based on the frequency distribution of the dataset. In particular, we sorted the sensitive values by their frequencies, and calculated the average frequency distance as davg = Σni = 1| fi − fi − 1|/n, where n is the number of distinct sensitive values, and fi and fi − 1 are the frequency of two adjacent sensitive values si and si − 1. We picked five d values of equal distance from the range [0.5 ⁎ davg, 1.5 ⁎ davg]. 7.2. Time performance First, we compared the time performance of our two heuristics, namely the top-down and bottom-up approaches, with Mondrian on the CENSUS dataset when 1-FD (i.e., salary-class → occupation) is available. Fig. 5 (a) shows the result. We observed that the two heuristic approaches are much slower than Mondrian; Mondrian needs around 40 s, while the top-down and bottom-up approaches need around 2300 s. This is the cost that we have to pay for privacy protection when FFDs are present. Second, for all three approaches, increasing ℓ value resulted in better performance. This is because larger partitions result in smaller number of partitions, which consequently reduce the number of rounds needed in all three approaches. Second, we focused on the top-down and bottom-up approaches and measure the impact of the d value to the performance of anonymization. Fig. 5 (b) and (c) show that for both U-dis and S-dis datasets, the performance is relatively stable with changing d value. In other words, the time performance is not sensitive to d.

2

http://www.ipums.org.

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

261

Fig. 5. Time performance comparison (TD: top-down, BU: bottom-up), 1-FD.

We also measured the performance of phase-1 partition of both top-down and bottom-up approaches. The experiment result shows that it is around 0.016 s for both U-dis and S-dis datasets. Thus it is negligible compared with the total time of anonymization (shown in Fig. 5 (b) and (c)). Due to the space limit, we omit the result. Next, we compared the time performance of the IG, CG, and DG approaches. We randomly picked three samples of the CENSUS dataset, each of size 100K, and apply IG, CG, and DG on them. These three samples are of different data frequency distribution. Fig. 6 (a) presents the results. It shows that the time performance of these three approaches is not related to the frequency distribution of the data to be grouped; CG always runs the fastest, while DG is always the slowest. Lastly, we measured the performance when multiple FFDs are present. The result is present in Fig. 6 (b). First, it shows that the time performance of the anonymization algorithm for 1-FD, 2-FD, 3-FD and 4-FD are the same. This is because the first four FFDs construct a single FFD-chain. Therefore, the anonymization algorithms for these four cases are applied on the same FFD (the first FFD indeed). When the fifth FFD is available, it constructs a different FFD-chain, which increases the time performance of the anonymization. 7.3. Information loss 7.3.1. Ratio-based information loss metric We used the ratio-based information loss metric as defined in Section 2.3. The ratio is in the range [0, 1]. Intuitively, the smaller the ratio is, the better is the data utility. First, we compared the total information loss by both the top-down, bottom-up, and Mondrian approach on the CENSUS dataset. The result is shown in Fig. 7 (a). We observed that our bottom-up approach yields much better information loss than

262

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

Fig. 6. Time performance comparison, CENSUS dataset, ℓ = 5, d = 400.

Mondrian for small ℓ values. However, Mondrian eventually wins when the ℓ value increases. Second, the top-down approach always incurs the worst information loss out of the three approaches. Second, we varied ℓ value and measure its impact to the information loss. Fig. 7 (b) and (c) show that for both U-dis and S-dis datasets, increasing ℓ values brings more information loss. This happens for two reasons: first, larger ℓ values result in partitions of more distinct sensitive values, which consequently forces more tuples to be removed to satisfy d-closeness; Second, larger ℓ values result in more general distortion on the data, which incurs worse information loss by generalization. Fig. 8 (a) shows more details of information loss by tuple suppression on S-dis dataset with varying ℓ values. It follows the same observation as above. Then we varied d value and measured the total information loss of both U-dis and S-dis datasets. Fig. 7 (d) show that for the Udis dataset, the information loss is relatively stable, since the sensitive values are of similar frequencies. However, for the S-dis dataset, various d values influence the amount of information loss. In particular, Fig. 7 (e) shows that larger d values result in slightly worse information loss by tuple generalization. On the other hand, Fig. 8 (b) shows that the larger d value incurs smaller information loss by tuple suppression, as more tuples are not needed to be removed to satisfy d-closeness. Next, we compared the information loss by IG, CG, and DG approaches, and present the result in Fig. 9 (a). We used the same samples as in Fig. 6 (a). Fig. 9 (a) shows that the amount of information loss by tuple suppression for three approaches is highly related to the data distribution; for example, CG incurs the worst information loss for Sample 2, but the best information loss for Sample 3. We also measured the information loss when multiple FFDs are present. The result is present in Fig. 9 (b). First, since the first four FFDs construct a single FFD-chain, the information loss of the 1-FD, 2-FD, 3-FD and 4-FD are the same. However, since the 5th FFD constructs a different chain, it introduces more information loss by anonymization.

7.3.2. Discernibility information loss metric The discernibility metric is proposed in [3] and has been used as the measurement in a few previous works such as Mondrian [17] and ℓ-diversity [20]. In this metric, each generalized tuple is assigned a penalty of |G|, where |G| is the size of the QI-group the tuple is in, while a suppressed tuple is assigned a penalty of |D|, the size of the input dataset. The total information loss I L = ∑ ti = 1|Gi|2 + b ⁎ |D|, where t is the number of QI-groups, and b is the number of removed tuples. We compared the discernibility-based information loss of our top-down and bottom-up approaches with Mondrian and show the result in Fig. 9 (c). We observed that Mondrian always returns less information loss than our two approaches. This proves the trade-off between privacy and utility; for better privacy protection we have to sacrifice utility. But the experimental result shows that our approaches do not sacrifice too much to combat the FFD-based attack.

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

Fig. 7. Total information loss by top-down, bottom-up and Mondrian, 1-FD.

263

264

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

Fig. 8. Information loss by tuple suppression.

8. Related work Privacy-preserving data publishing has received considerable attention in recent years (see the recent survey [10] and the references therein). We studied the related work from various aspects. 8.1. Privacy models The k-anonymity model is the first anonymization principle in the literature [29,31]. It requires that in the published data, every individual is related with no less than k tuples. However, the result may be that the tuples in the same group are of the same sensitive values. To address this defect, Machanavajjhala et al. proposed the ℓ-diversity [20] model. It requires that every QI-group contains at least ℓ “well-represented” sensitive values, i.e., the sensitive values that have roughly the same frequency. Other variants of k-anonymity and ℓ-diversity are defined to address different privacy requirements, e.g., t-closeness [18], (α, k)anonymity [36], (c, k)-safety [23], (k, e)-anonymity [40], and MultiR-anonymity [25]. Unfortunately, none of them can defend against the FFD-based privacy attack we defined. 8.2. Correlation-based adversary knowledge Chen et al. [5] proposed a skyline approach to quantifying an adversary's external knowledge. However, they did not consider functional dependency as part of the external knowledge. Martin et al. [23] and Rastogi et al. [27] are the first to consider correlation-based adversary knowledge. Both of them focus on tuple correlations but do not consider the correlations between attributes such as FFDs. They show that the correlations between tuples can lead to privacy leakage on the published dataset that is of “meaningful” utility. Du et al. [7] integrated background knowledge in privacy quantification. They provide a privacy definition for the attacker who uses the maximum entropy principle. Kifer [15] shows that the attacker may induce correlations from the sanitized dataset by using the exchangeability and deFinetti's theorem. The inferred correlations can enable potential vulnerabilities on the sanitized dataset. Although these work identify the possible attacks by FD correlations, none of them provide any solution to defend against the correlation-based attack. There are a few privacy-preserving studies in which correlations take the form of association rules. The works [1,32] studied how to selectively hide some association rules from large databases with as little impact as possible on other non-sensitive ones. The studies [34,35] propose the problem as inverse frequent itemset mining, which reconstructs a dataset from the published

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

265

Fig. 9. Information loss comparison.

frequent itemsets, and use the reconstructed dataset to analyze privacy. Atzori et al. [2] investigated the possible privacy leakage by publishing frequent patterns. Li et al. [19] focuses on only negative association rules, which specify the fact that some combination of quasi-identifier values cannot be associated with certain sensitive values, as the adversary's background knowledge. Much recent work has investigated the problem of anonymizing data for data mining analysis [11,21,24]. Unlike these work in which data correlations are considered as utility and should be preserved, our work considers data correlations as adversary knowledge and thus they should be destroyed.

266

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

8.3. Anonymization approaches There are several approaches to anonymizing a dataset to ensure privacy. These include generalizations [3,16,17,29], cell and tuple suppression [29], adding noise [4,8], data swapping [6], and applying permutation on the sensitive values [23,37]. A large body of them (e.g., Incognito [16], Mondrian [17], ℓ-diversity [13,20], Anatomy [37], (k, e)-anonymous permutation [40]) follow the similar partition-based principle as our work; the database is partitioned into multiple non-overlapping groups, each representing a QI-group. However, our partition criteria is different from these algorithms; as they do not have any constraint on the overlapping sensitive values of different QI-groups, we require that the sensitive values of different QI-groups must satisfy both ℓ-overlapping and ℓ-non-overlapping conditions (see Definition 3.6), as the defense mechanism against the FFD-based attack. 8.4. Privacy issues by correlation-based inference in other contexts Su et al. [30] investigated the inference problems due to functional dependencies (FD) and multi-valued dependencies in a multilevel relational database (MDB). They showed that eliminating precise inference compromise due to functional dependencies and multi-valued dependencies is NP-complete. Yang et al. [39] studied secure publishing XML databases at the presence of functional dependencies as the adversary knowledge. They formulated the process that users can infer data using functional dependencies, and proposed the algorithm that finds a subset of a given XML document that will not cause any information leakage when publishing. They expressed functional dependencies as XML constraints, including both structural and value constraints. Rastogi et al. [26] investigated the problem of privacy-preserving query answering over data containing relationships in the context of social networks. They consider the possibility that both positive and negative correlations are available. Farkas et al. [9] provide a survey of indirect data disclosure via inference channels. Their work is orthogonal to ours. 9. Conclusion In this paper, we studied the problem of privacy-preserving publishing of data that contains full functional dependencies. We formally defined the privacy model, ðd; ℓÞ-inference to prevent privacy leakage that caused by FFDs and developed robust and efficient algorithms that can anonymize the data with low information loss. Our empirical studies using both synthetic and real datasets demonstrated the efficiency and effectiveness of our algorithm. For future work, first, we plan to further improve the heuristic approaches in the phase-1 partition step. One possibility is to make the construction of the initial partitions for the bottom-up approach driven by the frequency distribution. A similar idea applies to the choice of the split point for the top-down approach also. Second, as ðd; ℓÞ-inference cannot effectively defend against the privacy attack by conditional functional dependencies (CFDs), we will move to the study of privacy-preserving publishing microdata that contains CFDs. Appendix A

Proof of Theorem 4.1. We use f ji to denote the frequency of the sensitive value si at the end of round j. Then the construction of CG is as follows: • Round 0: Initially there are n sensitive values {s1, …, sn} of frequency f10, …, fn0. Initialize x = 1. • Round 1: Construct the group G1 that contains the sensitive values {s1, s2, …, sn}, each of frequency g1i = min(f10 + d, fi0)(1 ≤ i ≤ n). Further, let t be the position of the last sensitive value in the sorted set whose frequency equals fl0 (t ≥ l) (i.e., move forward for l distinct sensitive values). Then ∀ i ∈ [1, t], there are fi0 − g1i tuples that contain sensitive values si to be removed. The frequency of the remaining sensitive values are updated as follows: ( 1

fi =

if 1 ≤ i ≤ t

0 0

1

fi −gi

Otherwise

• Round 2: Construct the group G2 that contains the sensitive values {st + 1, …, sn}, each of frequency gi2 = min(ft1+ 1 + d, fi1). Further, let t to be the position of the last sensitive value in the sorted set that equals ft1+ l (i.e., move forward for l distinct sensitive values), then ∀ i ∈ [t + 1,], there are fi1 − gi2 tuples that contains the sensitive value si to be removed. The frequency of the remaining sensitive values is updated as the following: ( 2 fi

• ...

=

0

if t + 1 ≤ i ≤ t 2

fi1 −gi2

Otherwise

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

267

• Round j: Construct the group Gj that contains the sensitive values {stj − 1 + 1, …, sn} of frequency ftjj−−1 1+ 1, sj, …, sj), where sj = min (ftjj−−1 1+ 1 + d, ftjj−−1 1+ 2). Further, for ∀ i ∈ [(j − 1) ⁎ l + 1, j ⁎ l], there are fij − 1 − sj tuples of sensitive values si that will be removed. The frequencies of the sensitive values are updated as follows: ( j fi

=

0

if 0 ≤ i ≤ tj−1

fij−1 −sj

Otherwise

• ... Following the above procedure, for any sensitive value si, its frequency will be changed to 0 at the end of round t, where t = ⌈i/l⌉. If i%l = 1, then si will be put as the first in the anonymization group constructed in round t. Thus there is no tuple that contains si that will be removed. Otherwise, if i b [n/l] ⁎ l, then there will be fit − 1 − gt number of tuples that contain si will be removed, where 1 gt = f(tt − − 1) ⁎ l + 1 + d. We have: t−1

fi

t−1

−gt = fi

t−1

−fðt−1Þ⁎l + 1 −d:

ð:1Þ

Since for every sensitive value fij, fij = fij − 1 − gj = fij − 2 − gj − 1 − gj = … = fi0 − ∑ jt = 1 gt, we have: t−1

fi

t−1

−gt = fi

t−1

−fðt−1Þ⁎l t−1

+ 1 −d 0

= fi0 − ∑ gt − fðt−1Þ⁎l t=1

=

t−1

!

+ 1 − ∑ gt −d t=1

ð:2Þ

fi0 −fð0t−1Þ⁎l + 1 −d

= fi −fðt−1Þ⁎l + 1 −d It is straightforward that f(t − 1) ⁎ l + 1 equals fj, where fj is the number of frequency of sj, the first sensitive value in the last group that contains the sensitive value si by CG. If i ≥ [n/l] ⁎ l, the number of tuples of sensitive value si that will be removed is fij, where j = [n/l]. Since fij = fij − 1 − gj, we can use the similar reasoning as Eq. (.1) to get the result. The theorem then follows. □ References [1] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, V. Verykios, Disclosure limitation of sensitive rules, Workshop on Knowledge and Data Engineering Exchange (KDEX), 1999, pp. 45–52. [2] M. Atzori, F. Bonchi, F. Giannotti, D. Pedresch, K-anonymous patterns, Proceedings of the Ninth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), 2005, pp. 10–21. [3] R.J. Bayardo, R. Agrawal, Data privacy through optimal k-anonymization, Proceedings of the International Conference on Data Engineering (ICDE), 2005, pp. 217–228. [4] S. Chawla, C. Dwork, F. McSherry, A. Smith, H. Wee, Toward privacy in public databases, Second Theory of Cryptography Conference (TCC), 2005, pp. 363–385. [5] B.-C. Chen, K. LeFevre, R. Ramakrishnan, Privacy skyline: privacy with multidimensional adversarial knowledge, Proceedings of the International Conference on Very Large Data Bases (VLDB), 2007, pp. 770–781. [6] T. Dalenius, S.P. Reiss, Data swapping: a technique for disclosure control, Journal of Statistical Planning and Inference, 1982. [7] W. Du, Z. Teng, Z. Zhu, Privacy-maxent: integrating background knowledge in privacy quantification, Proceedings of ACM International Conference on Special Interest Group on Management of Data (SIGMOD), 2008, pp. 459–472. [8] A. Evfimievski, J. Gehrke, R. Srikant, Limiting privacy breaches in privacy preserving data mining, Proceedings of ACM Symposium on Principles of Database Systems (PODS), 2003, pp. 211–222. [9] C. Farkas, S. Jajodia, The inference problem: a survey, ACM SIGKDD Explorations Newsletter 4 (2) (2002) 6–11. [10] B.C.M. Fung, K. Wang, R. Chen, P.S. Yu, Privacy-preserving data publishing: a survey of recent developments, ACM Computing Surveys 42 (4) (June 2010) 1–53. [11] B.C.M. Fung, K. Wang, L. Wang, P.C.K. Hung, Privacy-preserving data publishing for cluster analysis, Data & Knowledge Engineering (DKE) 68 (6) (2009) 552–575. [12] J. Gehrke, Models and methods for privacy-preserving data analysis and publishing, Proceedings of the International Conference on Data Engineering (ICDE), 2006, p. 105. [13] G. Ghinita, P. Karras, P. Kalnis, N. Mamoulis, Fast data anonymization with low information loss, Proceedings of the International Conference on Very Large Data Bases (VLDB), 2007, pp. 758–769. [14] V.S. Iyengar, Transforming data to satisfy privacy constraints, Proceedings of ACM International Conference on Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2002, pp. 279–288. [15] D. Kifer, Attacks on privacy and de Finetti's theorem, Proceedings of ACM International Conference on Special Interest Group on Management of Data (SIGMOD), 2009, pp. 127–138. [16] K. LeFevre, D. DeWitt, R. Ramakrishnan, Incognito: efficient full-domain k-anonymity, Proceedings of ACM International Conference on Special Interest Group on Management of Data (SIGMOD), 2005, pp. 49–60. [17] K. LeFevre, D. DeWitt, R. Ramakrishnan, Mondrian multidimensional k-anonymity, Proceedings of the International Conference on Data Engineering (ICDE), 2006, p. 25. [18] N. Li, T. Li, t-Closeness: privacy beyond k-anonymity and l-diversity, Proceedings of the International Conference on Data Engineering (ICDE), 2007, pp. 106–115. [19] T. Li, N. Li, Injector: mining background knowledge for data anonymization, Proceedings of the International Conference on Data Engineering (ICDE), 2008, pp. 446–455. [20] A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, l-Diversity: privacy beyond k-anonymity, Proceedings of the International Conference on Data Engineering (ICDE), 2006, p. 24.

268

H. Wang, R. Liu / Data & Knowledge Engineering 70 (2011) 249–268

[21] E. Magkos, M. Maragoudakis, V. Chrissikopoulos, S. Gritzalis, Accurate and large-scale privacy-preserving data mining using the election paradigm, Data & Knowledge Engineering (DKE) 68 (11) (2009) 1224–1236. [22] S. Martello, P. Toth, Knapsack Problems: Algorithms and Computer Implementations, 1990. [23] D.J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, J.Y. Halpern, Worst-case background knowledge for privacy-preserving data publishing, Proceedings of the International Conference on Data Engineering (ICDE), 2007, pp. 126–135. [24] S. Mukherjee, M. Banerjee, Z. Chen, A. Gangopadhyay, A privacy preserving technique for distance-based classification with worst case privacy guarantees, Data & Knowledge Engineering 66 (2) (2008) 264–288. [25] M.E. Nergiz, C. Clifton, A.E. Nergiz, Multirelational k-anonymity, IEEE Transactions on Knowledge and Data Engineering 21 (8) (2009) 1104–1117. [26] V. Rastogi, M. Hay, G. Miklau, D. Suciu, Relationship privacy: output perturbation for queries with joins, Proceedings of ACM Symposium on Principles of Database Systems (PODS), 2009, pp. 107–116. [27] V. Rastogi, D. Suciu, S. Hong, The boundary between privacy and utility in data publishing, Proceedings of the International Conference on Very Large Data Bases (VLDB), 2007, pp. 531–542. [28] P. Samarati, Protecting respondents' identities in microdata release, IEEE Transactions on Knowledge and Data Engineering (TKDE) 13 (6) (2001) 1010–1027. [29] P. Samarati, L. Sweeney, Generalizing data to provide anonymity when disclosing information, Proceedings of ACM Symposium on Principles of Database Systems (PODS), 1998, p. 188. [30] T.-A. Su, G. Ozsoyoglu, Controlling fd and mvd inferences in multilevel relational database systems, IEEE Transactions on Knowledge and Data Engineering (TKDE) 3 (4) (1991) 474–485. [31] L. Sweeney, K-anonymity: a model for protecting privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002, pp. 557–570. [32] V. Verykios, A.K. Elmagarmid, E. Bertino, Y. Saygin, E. Dasseni, Association rule hiding, IEEE Transactions on Knowledge and Data Engineering (TKDE) 16 (4) (2004) 434–447. [33] H. Wang, R. Liu, Privacy-preserving publishing data with full functional dependencies, International Conference on Database Systems for Advanced Applications (DASFAA), 2010, pp. 176–183. [34] Y. Wang, X. Wu, Approximate inverse frequent itemset mining: privacy, complexity, and approximation, IEEE International Conference on Data Mining (ICDM), 2005, pp. 482–489. [35] Z. Wang, W. Wang, B. Shi, Blocking inference channels in frequent pattern sharing, Proceedings of the International Conference on Data Engineering (ICDE), 2007, pp. 1425–1429. [36] R.C.-W. Wong, J. Li, A.W.-C. Fu, K. Wang, (α, k)-anonymity: an enhanced k-anonymity model for privacy-preserving data publishing, Proceedings of ACM International Conference on Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2006, pp. 754–759. [37] X. Xiao, Y. Tao, Anatomy: simple and effective privacy preservation, Proceedings of the International Conference on Very Large Data Bases (VLDB), 2006, pp. 139–150. [38] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, A.W.-C. Fu, Utility-based anonymization using local recoding, Proceedings of ACM International Conference on Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2006, pp. 21–308, 8(2). [39] X. Yang, C. Li, Secure xml publishing without information leakage in the presence of data inference, Proceedings of the International Conference on Very Large Data Bases (VLDB), 2004, pp. 96–107. [40] Q. Zhang, N. Koudas, D. Srivastava, T. Yu, Aggregate query answering on anonymized tables, Proceedings of the International Conference on Data Engineering (ICDE), 2007, pp. 116–125. Dr. Hui Wang received the BS degree in computer science from Wuhan University in 1998, the MS degree in computer science from University of British Columbia in 2002, and the PhD degree in computer science from University of British Columbia in 2007. She has been an assistant professor of the Computer Science Department, Stevens Institute of Technology, since 2008. Her research interests include database security, data privacy, and semi-structured databases.

Ruilin Liu is a PhD student in Department of Computer Science at Stevens Institute of Technology. He received his bachelor degree from Southwest University of Science and Technology, China in 2007. His research interest includes privacy-preserving data publishing, XML database, network access control and social network.