Hiding Sensitive Itemsets by Inserting Dummy ... - Semantic Scholar

2 downloads 0 Views 189KB Size Report
this purpose, the Apriori algorithm [3] and the FP- .... 2-itemset & 3-itemset. Item Count Item Count Item. Count a. 4 a. 4 ab. 4 b. 7 b. 7 ac. 4 c. 6 c. 6 bc. 5 d. 3 e. 6.
Hiding Sensitive Itemsets by Inserting Dummy Transactions Tzung-Pei Hong1,3, Chun-Wei Lin1*, Chia-Ching Chang1 and Shyue-Liang Wang2 1 Department of Computer Science and Information Engineering 2 Department of Information Management National University of Kaohsiung 3 Department of Computer Science and Engineering National Sun Yat-sen University [email protected], [email protected], [email protected], [email protected] Abstract The privacy-preserving data mining (PPDM) has become an important issue in recent years. In this paper, a greedy-based approach for hiding sensitive itemsets by inserting dummy transactions is proposed. It computes the maximal number of transactions to be inserted into the original database for totally hiding sensitive itemsets. Experimental results are also performed to evaluate the performance of the proposed approach.

1. Introduction As the data mining technology rapidly progresses, it is easier to get privacy information from user data through data mining technology. Privacy information includes some confidential information, such as income, medical history, address, credit card numbers, phone number and customer purchasing behaviors, among others. Based on business purposes, some shard information among companies may be extracted and analyzed by other partners, which may not only decrease the benefits of the companies but also cause threats to sensitive data. In the past, many approaches have been proposed for handling the privacypreserving data mining [4, 6, 8-10, 14-15], but most of them considered to partially delete the transactions or items within transactions for hiding sensitive itemsets. In real-world applications, however, deleting the transactions from original database may cause the serious results in decision making. In this paper, the greedy-based approach for inserting newly transactions into the original database is thus proposed. The empirical rules in standard normal distribution is respectively applied to determine the number of newly inserted transactions, the lengths of the inserted transactions, and the itemsets to be *corresponding author

added into the inserted transactions. The proposed approach can thus easily make good trade-offs between privacy preserving and execution time. Several experiments are respectively made to evaluate the performance of the execution time, the number of inserted transactions for hiding sensitive information, and the side effects of the proposed algorithm.

2. Related work In this session, the related works of data mining and data sanitization are then respectively reviewed.

2.1. Data mining Data mining is the most commonly used in attempts to induce association rules from transaction data, such that the presence of certain items in a transaction will imply the presence of some other items. To achieve this purpose, the Apriori algorithm [3] and the FPgrowth algorithm [7] are recommended as the efficient approaches to derive frequent itemsets in association rules mining. The former one is a level-wise approach to generate-and-test candidates and the later one uses a tree structure to keep the frequent itemsets without candidate generation, thus reducing the computational cost of rescanning database. In the Apriori algorithm, the database is recursively scanned to find the large (frequent) itemsets. All possible association combinations for each large itemset were formed, and those with calculated confidence values larger than the minimum confidence were output as the desired association rules.

2.2. Data sanitization Years of effort in data mining have produced a variety of efficient techniques, which have caused the

security problems and privacy threats [11]. The privacy preserving data mining (PPDM) techniques has thus become a critical research issue for hiding the confidential or secured information. In the past, Atallah et al. proposed the protection algorithm for data sanitization to avoid the inference of association rules [2]. Dasseni et al. then proposed a hiding algorithm based on the hamming-distance approach to reduce the confidence or support values of association rules [5]. Oliveira and Za¨ıane [12] introduced the multiple-rule hiding approach to efficiently hide sensitive itemsets. Amiri then proposed three heuristic algorithms to hide multiple sensitive rules [1]. Pontikakis et al. [13] then proposed two heuristics approaches based on data distortion. Hong et al. proposed two approaches to partially delete the items within the transactions or the whole transactions from the original database for hiding sensitive itemsets [8-9].

3. A greedy-based algorithm In this session, a greedy-based algorithm for data sanitization is then stated. The proposed algorithm maintains that the original frequent itemsets would be still frequent after the numbers of transactions are inserted for hiding sensitive itemsets. The proposed algorithm is then described below.

3.1. The proposed greedy-based algorithm INPUT: A transaction dataset D = {T1, T2, …, Tm}, a set of k frequent itemsets FI = {fi1, fi2, …, fik}, a set of l infrequent (small) itemsets I ={i1, i2, …, il}, a user-specified minimum support threshold α , and a set of userspecified sensitive itemsets S = {si1, si2, …, sij, …, sip}. OUTPUT: A sanitized dataset. STEP 1: Calculate the value of maximum safety bound (MSB) for the number of newly inserted transactions as: p ⎢ sii ⎥ − m⎥ + 1, n = max ( SBi ) = ⎢ i =1 ⎣α ⎦ where SBi is the safety bound of each sensitive itemset, m is the number of

original transactions in D, sii is the count of sensitive itemset sii. STEP 2: Calculate the length pn of each inserted transaction in d according to the empirical rules in standard normal distribution, where

d = {d1, d2, …, dn},, and n is the number of inserted transactions obtained in STEP 1. STEP 3: Choose the itemsets to be inserted into each inserted transaction dn. Do the substeps as follows. Substep 3-1: Calculate the count difference (CDfi) of each frequent itemset to be possibly inserted into the new transactions as: CD fik = ⎣ fik − (m + n)α ⎦ , where fi k is the count(frequency) of an item fik, m is the number of transactions in D and n is the number of transactions in d. Substep 3-2: Put the frequent itemsets fik with negative CD fik into the set of Insert_Items. Substep 3-3: Sort the fik in the set of Insert_Items in descending order of their lengths. Substep 3-4: Sort the sorted results obtained in substep 3-3 in descending order of their CD fi . k

STEP 4: Process the inserted transactions dn one-byone respectively to add the fik in the set of Insert_Items according to the sorted order obtained in substep 3-4. Note that the length of an inserted itemset fik is no longer than pn in dn and the inserted itemset fik in a transaction cannot be formed as any superitemsets of a sensitive itemset in S. STEP 5: Update (decrease) the value and the corresponding sub-itemsets of the processed itemset fik by 1. STEP6: Repeat the STEPs 4 to 5 until the set of Insert_Items is null or there is no longer itemsets to be inserted into dn obtained the constraints in STEP 4. STEP 7: Add the small items in the set of I into the dn while dn remains positions to be added according to empirical rules in standard normal distribution.

3.2. An example In this section, an example is then used to illustrate the proposed algorithm step-by-step. Assume a database shown in Table 1 is used as an example. It consists of 8 transactions with 7 items, denoted a to g. Assume the user-specified sensitive itemsets in the set of S are {c:6, be:5, abc:4}, and the minimum support threshold is set at 50%. The minimum count is calculated as (0.5*8) (= 4). The Apriori-like approach

is then executed to find all frequent itemsets and the results are then shown in Table 2. Note the sensitive itemsets are marked in yellow color in Table 2. Table 1. A database with 8 transactions TID 1 2 3 4

Item a, b, c, d, e a, b, c, e c, e a, b, c, e

TID 5 6 7 8

Item b, g b, d, e, f a, b, c, d b, c, e, f

Table 2. All large itemsets All items Item a b c d e f g

Count 4 7 6 3 6 2 1

Large 1-itemset Item Count a 4 b 7 c 6 e 6

Large 2-itemset & 3-itemset Item Count ab 4 ac 4 bc 5 be 5 ce 5 abc 4 bce 4

STEP 1: The number of inserted transactions in this example is first calculated. In this example, there are 3 sensitive itemsets to be hidden. The maximum safety bound (MSB) among three sensitive itemsets is thus calculated as 5, indicating the number of newly inserted transactions. STEP 2: For five inserted transactions obtained in STEP 1, the length of each transaction is thus computed according to empirical rules in standard normal distribution. That is, the probability of length 2, 3, 4, and 5 are calculated as (13.5%, 34%, 34%, 13.5%). The lengths of five inserted transactions of TID 9 to 13 are then respectively set at {4, 3, 4, 2, 3}. STEP 3: The count difference of each frequent itemset in Table 2 is then respectively calculated. The results are then shown in Table 3. Table 3. The count differences of all frequent itemsets Large 1-itemset Item CD a -3 b 0 e 1

Large 2-itemset Item CD ab -3 ac -3 bc -2 ce -2

Large 3-itemset Item CD bce -3

In Table 3, only the itemsets with negative CD will be considered as the itemsets for insertion and sorted according to their lengths and |CD| value. After that, the sorted results are then put into the set of Insert_Items = {bce:3, ab:3, ac:3, bc:2, ce: 2, a:3}. STEP 4, 5 & 6: The itemsets are then respectively added into the transactions 9 to 13 according the sorted order in the set of Insert_Items. The corresponding sub-itemsets are then also updated (decreased). The greedy procedure is terminated while there are no more spaces for inserting the frequent itemsets. STEP 7: The small items are then alternative selected by empirical rules in stand normal distribution to insert into the transactions with remaining spaces. The updated database of this example is then shown in Table 4. Table 4. The updated database TID 1 2 3 4 5 6 7 8 9 10 11 12 13

Item a, b, c, d, e a, b, c, e c, e a, b, c, e b, g b, d, e, f a, b, c, d b, c, e, f b, c, e, g b, c, e b, c, e, f a, b a, b, g

4. Experimental results Experiments were made to show the performance of the proposed approach. Three database BMS-POS, Webview-1, Webview-2 are then compared to show the performance of the proposed approach [16]. The numbers of sensitive itemsets are then defined by the percentages of the frequent itemsets in the databases, which is more flexible to see the performance of the proposed algorithm. For the proposed greedy-based algorithm, the relationships between the numbers of inserted transactions, the execution, and the side effects are then compared in three different databases. The numbers of newly inserted transactions are then computed for three different databases show in Figure 1.

inserted transactions, for reducing the side effects of missing rules. The small items are also added into the inserted transactions with the remaining spaces to be filled. Experimental results are then shown the performance of the proposed greedy-based approach for inserting new transactions.

6. References

Figure 1. Numbers of added transactions In Figure 1, it is obvious to see that the BMS-POS database requires more transactions to be inserted since it contains more information (transactions) than the others. That is because more sensitive itemsets are required to be hidden due to the number of sensitive itemsets are calculated by the percentage of frequent itemsets. The side effects of the proposed greedy-based approach are also evaluated, including the hiding failure for the number of sensitive itemsets, the number of the missing non-sensitive itemsets, and the number of artificial itemsets. The results are then shown in Figure 2. From figure 2, it is obvious to see that the proposed greedy-based approach can thus totally hide the sensitive itemsets without any side effects.

Figure 2. The evaluation of three side effects in Webview-1

5. Conclusion In this paper, a greedy-based algorithm for inserting newly transactions into the original database for efficiently hiding the sensitive itemsets is thus proposed. The empirical rule in standard normal distribution is used to evaluate the numbers of newly inserted transactions and the lengths of the transactions. In the proposed algorithm, the large itemsets in the original database are respectively added into the

[1] A. Amiri, “Dare to share: Protecting sensitive knowledge with data sanitization,” Decision Support Systems, pp. 181– 191, 2007. [2] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. S. Verykios, “Disclosure limitation of sensitive rules,” Knowledge and Data Engineering Exchange Workshop, pp. 45-52, 1999. [3] R. Agrawal, T. Imielinski, and A. Sawmi, “Mining association rules between sets of items in large databases,” The ACM SIGMOD Conference on Management of Data, pp. 207-216, 1993. [4] R. Agrawal and R. Srikant, “Privacy-preserving data mining,” The ACM SIGMOD Conference, pp. 439-450, 2000. [5] E. Dasseni, V. S. Verykios, A. K. Elmagarmid, and E. Bertino, “Hiding association rules by using confidence and support,” The International Workshop on Information Hiding, pp. 369-383, 2001. [6] M. N. Dehkordi, K Badie, and A. K. Zadeh, “A novel method for privacy preserving in association rule mining based on genetic algorithms,” Journal of Software, vol. 4, no. 6, pp. 555-562, 2009. [7] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” The ACM SIGMOD International Conference on Management of Data, pp. 1-12, 2000. [8] T. P. Hong, C. W. Lin, K. T. Yang, and S. L. Wang, “A heuristic data-sanitization approach based on tf-idf,” The International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, pp. 156164, 2011. [9] T. P. Hong, K. T. Yang, C. W. Lin and S. L. Wang, “Evolutionary privacy-preserving data mining,” World Automation Congress, pp. 1-7, 2010. [10] D. E. O. Leary, “Knowledge discovery as a threat to database security,” Knowledge Discovery in Databases, pp. 507-516, 1991. [11] M. Mitchell, “An Introduction to Genetic Algorithms,” MIT press, 1996. [12] S. R. M. Oliveira and O. R. Za¨ıane, “Privacy preserving frequent itemset mining,” IEEE International Conference on Privacy, Security and Data Mining, pp. 43-54, 2002. [13] E. D. Pontikakis, A. A. Tsitsonis, and V. S. Verykios, “An experimental study of distortion-based techniques for association rule hiding,” IFIP Workshop on Database Security, pp. 325-339, 2004. [14] E. Sanchez, T. Shibata, and L. A. Zadeh, “Genetic algorithms and fuzzy logic systems,” Soft Computing Perspectives, pp. 1-17, 1997. [15] V. S. Verykios, A. Elmagarmid, E. Bertino, Y. Saygin, and E. Dasseni, “Association rule hiding,” IEEE

Transactions on knowledge and Data Engineering, vol. 16, no. 4, pp. 434-447, 2004. [16] Zijian Zheng, Ron Kohavi, and Llew Mason, “Real world performance of association rule algorithms,” The ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 401-406, 2001.

Suggest Documents