A New Framework for Privacy Preserving - CiteSeerX

2 downloads 0 Views 531KB Size Report
“ABCD”, two consisting of “ABC”, one consisting of. “AB” and there is no transaction that is solely composed of “A”. Therefore, according to Formula 1, we can pre ...
A New Framework of Privacy Preserving Data Sharing Xia Chen1,2, Maria Orlowska2, and Xue Li2 School of Electronic and Information Engineering Tianjin University, Tianjin, China1 School of Information Technology and Electrical Engineering University of Queensland, Brisbane, Australia2 {chenxia, maria, xueli}@itee.uq.edu.au Abstract We introduce a dataset reconstruction based framework for data sharing with privacy preserving. The proposed framework uses a constraint-based inverse itemset lattice mining technique to automatically generate a sample dataset to be released for sharing. In this framework, data owners can control the potential mine-able knowledge (frequent itemsets in our context) from the released dataset. Before generating the sample dataset, the potential mine-able knowledge set is checked for two aspects: One is for the compliancy of user-specified security constraint and the trade-off principle, so that the sensitive patterns are well protected while the side-effect is tolerable. The other check is verification for consistency among itemset supports in the lattice so that it is sensible for inverse dataset reconstruction. This mechanism offers the data owner total control of the potentially discoverable knowledge from publicly accessible datasets, and at the same time the released data matches with the main features of the original dataset for sharable knowledge, thus the user’s privacy can be preserved.

1.

Introduction

Association rule mining is a common task in data mining. By discovering rules hidden in datasets, a company can identify underlying patterns useful in strategic decision-making. This technique has been successfully applied to many application domains such as the analysis of market baskets, website linkages etc. To provide such decision support tools, market consultants or research institutions must obtain accurate and real datasets to serve as a field of data-mining analytical activities. It is well understood that data sharing can be beneficial to cooperating companies, but on the other hand, due to increasing

privacy concerns, some partners may hesitate to disclose sensitive datasets for sharing in order to preserve their competitive edge. Therefore, there is a need for techniques that offer some level of control on data manipulation procedures that can respond to those privacy considerations. This paper addresses this class of problems limiting its scope to control of selected disclosure of frequent itemsets.

1.1. Problem Description We begin with introduction of notation used in this paper. Let I = {i1, i 2,..., im} be a set of items. Any X ⊆ I is called an itemset. Further, an itemset consisting of k elements is called a k-Itemset. Let D = {t1, t 2,..., tn} be a set of transactions, where each transaction ti is an itemset. Given X is an itemset, we say that a transaction ti contains X, if X ⊆ ti . Clearly, all itemsets that can be mined from D constitute a power set of I : P ( I ) = { X | X ⊆ I } . The support of an itemset X ⊆ I is the cardinality of set of all transactions containing X, denoted as: Support (X, D) = | T(X) | T(X) = {ti | X ⊆ ti, ti ∈ D}

(1) (2)

Itemsets that meet a minimum support threshold “ σ ” are called as frequent itemsets, and a collection of frequent itemsets denoted as: F (σ , D) = { X | Support ( X , D) >= σ , X ⊆ I } .

(3)

The problem discussed in this paper can be formulated as follows: Let D be the source database, R be a set of significant rules (frequent itemsets in our context) that are mined from D, and Rh be a set of sensitive rules to be hidden from R. We look for a transformation of database D into D’, where D’ is a sharable database, so that all rules in (R-Rh) can still be mined from D’. The above problem formulation indicates that if a user wants to prevent some sensitive frequent itemsets from disclosure in the released dataset, the creation of this new dataset must be such that the selected frequent itemsets will not be mined from the new transaction set with the same data mining strategy. Since privacy-preserving data sharing is one of the primary problems in the area of privacy preserving data mining, it has attracted already considerable attention. Most research efforts adopt data-modification techniques [1]. In general, there are two commonly used distinct approaches. One is to obscure data directly- making private data available, but with enough noise added so that exact values cannot be determined [2,3]. The other is to firstly find out those sensitive rules by applying mining algorithm and then go back to the dataset to identify related sensitive transactions from which the sensitive patterns can be mined and alter them by removing some items, or inserting new items [5,6,7,8]. We observed that application of the first method might compromise the accuracy of data mining results; using the second method, non-sensitive patterns could be unnecessarily hidden or non-frequent patterns could become frequent accidentally [1] and some of these “side-effects” might be un-acceptable to the user. Therefore, we argue that such “data modification” method does not provide any mechanism to keep track of the data modification process in order to minimize the occurrence of unexpected patterns. In other words, they suffer from the weakness of providing a way of fine-tuning the generation of the released dataset. In this paper, we propose a novel framework that can be regarded as a “knowledge sanitization” approach. We define all partial ordered subset items generated from given transactions to be a knowledge base from which association rules can be derived. In order to hide restricted patterns, a recursive sanitization procedure over this knowledge base can be applied. It allows users to define the potential min-able frequent itemset collections until user-specified security constraints are met (sensitive rules are hidden) while the unexpected patterns brought by the sanitization process also become acceptable under trade-off principles. As a result a sample released dataset can be automatically constructed based on this sanitized knowledge base, so that only insensitive knowledge is inducible. In this way, one can control the availability of

knowledge that can be mined from the sharable dataset and releasing a dataset based on user-specified restriction rules becomes manageable. 1.2.

Motivation and Methodology

Our work is inspired by the inverse frequent itemset mining problem. The “inverse frequent itemset mining” is an emerging topic in privacy-preserving association rule mining. By this approach, one can find a dataset from the given frequent itemsets and supports, such that the new dataset precisely agrees with the supports of the given itemset collection and the supports of other itemsets would be less than the pre-determined threshold [9]. Since frequent itemsets and their supports can be considered as a kind of summarization of the original dataset from which association rules are derived, the techniques of attacking data privacy by applying inverse frequent itemsets mining to recover original dataset have received attention. Mielikainen in [9] first identified this problem and analyzed the computational complexity of the problem. Mielikainen provided a proof that finding a binary dataset compatible with a given collection of frequent itemsets or deciding whether there is a dataset compatible with a given frequent set is a NP-hard problem. The high complexity of this simplified problem indicates that it is difficult to build a controllable privacy-preserving method. In this paper we will illustrate why this problem could become NP-hard (Subsection 3.1). Moreover, we show that in some special cases, it is possible that the adversary could render the original dataset based on the given frequent itemsets and supports through wisely estimating supports of those infrequent ones with some heuristic inference methods. The complexity of this process depends on the level of knowledge available to the inverse process. In an extreme case where the whole itemset collection is given, the binary transaction rendering process will be straightforward. It is well known that the power set of a dataset and a partial order relation (the subset relation in our consideration) form a lattice structure in the itemsets space P(Ĩ) [12]. Each itemset in the lattice is associated with its subsets as well as its supersets. The supports of all subset items form a frequency set of the lattice. We can prove that there exists a one-to-one mapping between a set of transactions and its itemset lattice with given supports (we discuss this fact in detail in the next section). Therefore, the itemsets and their support values can be regarded as a tight-coupled constraint over the transactions in the dataset. Besides, there exists also a consistent relationship among these itemsets since their

supports counting is dependent based on the subset relationship. This fact is held for an outcome from a real dataset. Different transaction sets on the same set of items clearly share the same itemset lattice space, but are marked with different sets of supports (examples can be seen in Figures 2 and 3). Therefore, the itemset lattice and its support set can be viewed as another form of knowledge representation of the transaction dataset at an aggregation level. Naturally, there is a simple transformation from one to another. Subsequently, we observed that if a set of supports of the itemset lattice (mine-able knowledge patterns) is given, as long as we can verify its consistency constraint, we can generate its corresponding dataset. The complete algorithm to compute such a dataset will be given in the next section. Our proposed framework (depicted in Figure 1) takes advantage of this special relationship between transactions and itemset-lattice amended with corresponding supports to address the data-reconstruction problem as a mechanism for privacy preserving.

Figure 1: Proposed Framework The framework consists of following procedure: Step 1: Use Apriori-like algorithm to generate itemset lattice with supports over original dataset D; Step 2: Find all frequent itemsets from lattice and generate corresponding association rules; Step 3: Identify sensitive frequent itemsets and select the hiding strategy; Step 4: Perform sanitization algorithm over the lattice; Step 5: Repeat Step 2 - Step 4 until no sensitive patterns appeared in the lattice while the support of other non-sensitive itemsets are either under threshold or practically acceptable by user; Step 6: Generate dataset D’ based on the sanitized lattice.

Ideally, such an advanced mining technology could provide an interface for data owner to view the mine-able rules from the original dataset followed by an ability to view the rules that represent a new dataset as a result of

applying privacy-preserving strategies. Then, only if potential mine-able patterns satisfy their security restrictions and they are aware about trade-off impacts, the new dataset could be generated and released. 1.3.

Related Work

The problem of protecting sensitive information in a database while allowing statistical analysis has been studied in statistical databases for many years [14]. One of the main proposed techniques is data perturbation, such as adding noise to the values in the database, or replacing the original database by a sample drawn from the data with similar characteristics and data distribution. Additionally, it has been shown that this technique cannot satisfy conflicting objectives of data utility while protecting restrictive individual information [4]. In the area of data mining, current works focus on using heuristic-based data sanitization techniques that aim to selectively hide sensitive transactions with little impact on other transactions. The sensitive transactions are those that fully or partially support the generation of sensitive patterns from a dataset. Some prominent works are presented in [6], [7] and [8]. The main idea can be summarized as follows; firstly it finds a set of patterns from the transactional dataset, and then identifies the sensitive ones. Secondly, it uses a transaction retrieval engine (an inverted file index which is composed of all different items sorted on the Transaction IDs that contain the item) to find all sensitive transactions in the original database and alter them respectively while directly copying all non-sensitive transactions to produce a new, sanitized database. Verykios et al in [5] introduced several algorithms for hiding sensitive rules, such as decreasing the confidence of the rule by increasing the support of the rule antecedent through sanitizing transactions that partially support it and decreasing the support of its consequent part in transactions that support both of the two sides; or decrease the support of the rule by decreasing the support of either the rules antecedent or consequent through altering transactions that fully support it. Oliveira et al in [10] discussed protecting sensitive rules before they are shared. They proposed a DSADownright Sanitizing Algorithm that sanitizes sensitive rules while blocking inference channels by selectively sanitizing their supersets and subsets at the same time. Calders in [11] pointed out that the relationships among frequencies of itemsets can be seen as a consistency constraint spanned on a dataset. Only configurations of frequencies that satisfy this relationship represent a valid outcome of a frequent itemset mining. But

the origin of these configurations is equivalent to the probabilistic satisfiability, and hence NP-complete. This paper is organized as follows. Section 2 introduces a one-to-one mapping algorithm between frequent itemset and dataset. A proof of this general observation will be given. Section 3 discusses the application of the one-to-one mapping on privacy preserving problem and the consistency-checking strategies in the process. Section 4 gives the conclusion of the paper.

2.

Mapping between Dataset and Itemset Lattice with Supports

In this section, we firstly discuss the relationship between dataset and itemset lattice with supports. Then we define the consistency problem of supports among itemsets. Finally, we illustrate the mapping algorithm from lattice to dataset with some simple examples. 2.1.

Itemset Lattice and Supports

Let us reiterate a well established fact: the power set of the items in a database together with the subset relation form a lattice of the item space P(Ĩ) that contains all possible subsets of items. The supports of all subsets of items according to database D form a frequency set S(P(Ĩ), D) associated with the itemsets lattice and it can be obtained by applying an Apriori-like algorithm [13]. Mining frequent itemsets is the primary task for generating association rules. Apriori-like algorithms employ a bottom-up, breadth-first search in the itemsets lattice space level by level. At the kth level, the supports of itemsets of cardinality k are counted. In this process, normally there are two ways to scan the dataset. If the number of itemsets in the lattice is much bigger than the number of transactions, a more efficient approach is to scan the dataset once, looking at one transaction at a time and finding all itemsets that occur in the lattice, incrementing the count by 1. Otherwise, the dataset can be scanned once for each itemset in the lattice. Frequent Itemset Mining is to find all itemsets whose support is higher than the pre-defined threshold from itemsets lattice space. Once all frequenct itemsets are found, association rules can be easily derived. There are two important principles in frequent itemset mining: Monotonicity Principle: Let J⊆I be two itemsets, the support of I will be at most as high as the support of J. Apriori Property: All nonempty subsets of a frequent itemset must also be frequent.

These properties have been widely used in improving the efficiency of algorithms by pruning those “infrequent” branches in time so to narrow the search space. Let us consider an example. Tables 1 and 2, show two sample transaction datasets on the same set of items I={A, B, C, D} and D1={T1,…,T15} and D2={T1,…,T10} respectively. Table 1. Sample Dataset

Table 2. Sample Dataset

Transaction T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15

Transaction T1 T2 T3 T4 T5 T6 T7 T8 T9 T10

Items ABD BD BC ABD AC BC ABCD ABC AB AD BCD ABCD ABC ACD C

Items ACD C ACD AB CD CD BC AD BCD A

D2

D1 The corresponding lattices (for D1 and D2 ) are depicted on Figure 2 and Figure 3 respectively.

Figure 2: Lattice built according to D1

Figure 3: Lattice built according to D2

If min_support=4 for D1, then {A10, B11, C10, D8, AB7, AC6, AD6, BC7, BD6, CD4, ABC4, ABD4} can be marked with “frequent itemsets”; {CD4, ABC4, ABD4} is a set of all maximal frequent itemsets. If min_support=2 for D2, then we can get a set of frequent itemsets {(A)5, (B)3, (C)7, (D)6, (AC)2, (AD)3, (BC)2, (CD)5, (ACD)2} and (ACD)2, (BC)2 are maximal frequent itemsts. As indicated on Figures 2 and 3, the above two different datasets have the same itemset-lattice but associated with different sets of supports. Therefore, the itemsetlattice with supports can reflect all potential mine-able knowledge implied by any transactional dataset. Next, we will prove that there exists a one-to-one mapping between these two sides. Obviously, the final result (from application of an Apriori-like algorithm) corresponds only a small part of this knowledge base; namely only the significant elements. Therefore, information disclosure depends on pre-defined minimal support threshold. The higher the threshold is, the fewer data facts could be revealed. It can be seen that itemset-lattice with associated supports represent so-called mine-able knowledge of a dataset. 2.2.

One to One Mapping

The one-to-one mapping between transactions and its itemset space provides a mechanism that allows us to work on a lattice structure for privacy-preserving operations, and then we can translate the modified lattice back to a new dataset. Given a set of transactions D={t1,t2,…,tn} and items Ĩ={i1,i2,…,im} where each transaction ti is a subset of Ĩ. P(D) denotes for power set of D and P(Ĩ) denotes for power set of the items in Ĩ. Lemma 1: There exists a one-to-one mapping between the transaction space of a given database D, P(D) and an itemset space in D, P(Ĩ). Proof: Let a set of transactions T(X)∈ P(D), T(X)={ti |X⊆ti , ti∈D and X⊆ Ĩ} and support(X)=|T(X)|. As X∈ P(Ĩ), P(D)⊆ P(Ĩ); For each itemset X∈P(Ĩ), we can yield a set of transactions that contains a given itemset X, T(X)={ti|X⊆ ti}, such that |T(X)|=support(X). As T(X) ∈ P(D), P(Ĩ) ⊆ P(D). Using Apriori-like algorithm, we can get all itemset’s support based on transactions. And according to Lemma 1, we also know that if itemset lattice with supports from

a real dataset is given, there must exist only one compatible dataset. Here a question is: “How can we translate these itemsets and supports into transactions”? For each X in P(Ĩ), then X might be one of the transactions or a subset of a transaction. If we denote the number of transactions which precisely contains only X as f(X), named Cardinality of X, then Formula 1: For each itemset X in the lattice, f ( X ) =| T ( X ) |= Support ( X ) − f (I ' )



X ⊂ I '∈P ( I )

For support(X) = |T(X)| = |{ tx | X⊆tx and tx∈ P(D)}| = |{ tx | tx = X or X⊂tx}|, there are |{ tx | tx = X }| = support(X) - |{ tx | X ⊂ tx }| transactions in database that contain only X. If X has no superset in P(Ĩ), then its support shows the number of times it appears directly in the original dataset. Otherwise, |{ tx | X⊂tx}| = ∑ | {ty| ty= tx and X ⊂ tx }| = ∑ (support(ty) - | {ty’| ty’⊂ty}|), and so forth. It mean that we would get the cardinality of itemset X only if we can get the cardinality of its direct supersets in P(Ĩ). It is a recursive iteration process and the recursive point is to find an itemset Y that has no any superset such that f(Y) = support(Y). Let us illustrate this function by example in Figure 2. f(ABCD)=support(ABCD)-f(Ø)=support(ABCD)=2 f(ABC)=support(ABC)-f(ABCD)=4-2=2 f(AB)=support(AB)-(f(ABC)+f(ABD)+f(ABCD)=7(2+2+2)=1 f(A)=support(A)-(f(AB)+f(AC)+f(AD)+f(ABC)+ f(ABD) + f(ABCD))=0 The above calculations indicate that in the original dataset there exist two transactions consisting of “ABCD”, two consisting of “ABC”, one consisting of “AB” and there is no transaction that is solely composed of “A”. Therefore, according to Formula 1, we can precisely predict occurrence of transactions in a dataset. Assume further that the returned value of f(X) is an integer n it can bee seen that • If (n >0) then there are n transactions consisting of exactly the itemset in the dataset. • If (n=0) then there is no transaction that exactly consist of this itemset. In this case the itemset is a subset of some transactions. • If (n