Hiding Sensitive Association Rules Efficiently By ...

21 downloads 0 Views 832KB Size Report
Nov 30, 2008 - Mahua Bhattacharya. Indian Institute Of Information Technology and Management. Gwalior (M.P.) India [email protected]. L INTRODUCTION.
Hiding Sensitive Association Rules Efficiently By Introducing New Variable Hiding counter Jitendra Varshney

Ramesh Chandra Belwal

LLLT.M. Gwalior (M.P.) India [email protected]

LLLT.M. Gwalior (M.P.) India [email protected]

Sohel Ahmed Khan

Anand Sharma

LI.LT.M. Gwalior (M.P.) India [email protected]

LLLT.M. Gwalior (M.P.) India [email protected]

Mahua Bhattacharya Indian Institute Of Information Technology and Management Gwalior (M.P.) India [email protected] L INTRODUCTION Abstract-we know that large database contain certain information that must be protected against unauthorized access. One known fact which is very important in data mining is discovering the association rules from database of transactions where each transaction consists of set of items. In this paper we discuss confidentiality issues of a broad category of association rules. Two important terms support and confidence are associated with each of the association rule. Actually any rule is called as sensitive if its disclosure risk is above a certain privacy threshold. Sometimes we do not want to disclose sensitive rules to the public because of confidentiality purposes. There are many approaches to hide certain association rules which take the support and confidence as a base for algorithms ([1, 2, 6] and many more). Our approach is a modification of ISL (increase support of LHS) and DSR (decrease support of RHS) and has some modifications so that it hides any desired association rule as previous work sometimes can not. Our work has the basis of reduction of support and confidence of sensitive rules but in our work we are not editing or disturbing the given database of transactions directly(as it is generally done in· previous works) rather we are performing the same task indirectly bye modifying the some new introduced terms associated with database transactions and association rules. These new terms are Mconfidence (modified confidence), Msupport (modified support) and Hiding counter. Our algorithm use some modified definition of support and confidence so that it would hide any desired sensitive association rule without any side effect. Actually we are using the same method (as previously used method) of getting association rules but we are modifying the definitions of support and confidence.

Keywords-Association rule hiding, hiding counter, modified confidence, modified support.

978-1-4244-2013-1/08/$25.00 ©2008 IEEE

Securing information against unauthorized access is an important goal of database security and privacy communities. As we know that data mining is a process of discovering the useful and hidden information from large database. Privacy is a term which is associated with this data mining task so that we are able to hide some sensitive information which we don't want to disclose to the public. So the concept privacy preserving data mining is the process of preserving personal information from data mining algorithms. Actually for any given specific rules to be hidden, many approaches for hiding association, classification and clustering rules have been proposed. However, to specify hidden rules, entire data mining process needs to be executed. For some applications, we are only interested in hiding certain sensitive predicative rules that contain given items. In our work, we assume that we have given only sensitive items and propose our algorithms to modify data (by introducing some additional terms) in database so that sensitive predicative rules containing sensitive items on the left hand side of rule cannot be inferred through association rule mining. In our work we are only in concern of hiding certain association rules which contain some sensitive information on the left hand side of the rule, so that rules containing sensitive items can't be disclosed. Our approach is based on modifying the database in a way that confidence of the association rule (which contain sensitive data item) can be reduced. As the confidence of the sensitive rule is reduced below a specified threshold, it is hidden or we can say it will not be disclosed. It is shown that our approach required less number of databases scanning and is comparatively simpler then other approaches. But main thing in our approach is that we are introducing slight

130

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on November 30, 2008 at 00:54 from IEEE Xplore. Restrictions apply.

modified definition of support and confidence which we explain in later section. In our discussion database modification term relates our concept in which we are not directly disturbing or editing the given database of transactions, rather we are introducing some new terms with the help of which we are able to hide the association rule which contain sensitive elements on the left hand side. In order to hide association rules there are two strategies which are used till now. These two strategies are following. 1: Increase the support of the item which is in the left hand side of rule. 2: Decrease the support of the item which is in the right hand side of the rule.

Our method is based on previous one Le. ISL (increase the support of the item which is in the left hand side of rule) method. Rest of the work is organized as follow. Section 2 presents the backgrounds and related work. Section 3 presents the problem statements. Section 4 presents our algorithm and example. Section 5 presents the analysis and conclusion part. II. BACKGROUND AND RELATED WORK There is a large amount of work related to association rule hiding. Maximum researchers have worked on the basis of reducing the support and confidence of sensitive association rules ([1, 2, 6]). ISL and DSR are the common approaches used to hide the sensitive rules. Actually any given specific rules to be hidden, many approaches for hiding association, classification and clustering rules have been proposed. Some of the researchers have used data perturbation techniques ([5]) to modify the confidential data values in such a way that the approximate data mining results could be obtained from the modified version of the database. Some researchers also recognize the necessity of analyzing the various data mining algorithms in order to increase the efficiency of any adopted strategy that deals with disclosure limitation of sensitive data and knowledge. Also disclosure limitation of sensitive knowledge by data mining algorithms, based on the retrieval of association rules, has been recently investigated. Our work also has the basis of reduction of support and confidence of sensitive rules but in our method we are using some modified terms and some new variable to do the job. Also our work specifies that we can hide any given association rule, as some of the previous work can not.

Let I = {I}, 12, 13, ••• ,I m} be a set of literals, called items. Given a set of transactions D , where each transaction T is a set of items such that T ~ I, an association rule is an expression X -7 Y , where X ~ I , Y ~ I and XnY = 0. IXuYI Confidence = ------------IXI

(1)

IXuYI Support = -------------- (N is number of transactions) (2) N In other words, the confidence of a rule measures the degree of the correlation between item sets,while the support of a rule measures the significance of the correlation between item sets. The problem of mining association rules is to find all rules that have support and confidence greater than the user-specified minimum support and minimum confidence. In our work we are introducing some new terms also. First one is R hc (set of hiding counters) and second is Mconfidence (modified confidence) and third is Msupport (modified support).Let R hc = (R 1hc, R 2hc,...R mhc ), where R hc is a set of hiding counters for all rules. Another important terms which are being used in our algorithm is that we are slightly modifying the definition of support and confidence as hiding counter is being associated with the support and confidence. These modified terms are as below New modified confidence for the rule X -7 Y is As below IXuYI Mconfidence(X -7 Y) = ---------------(3) I X I+hiding counter of rule X -7 Y. New modified support for the rule X -7 Y is As below IXuYI Msupport (X -7 Y) = -----------------N+hiding counter of rule X-7Y.

(4)

The problem of mining association rule is to find all rules that have support and confidence greater then user specified minimum support threshold (MST) and minimum confidence threshold (MCT).As an example[l], for a given database in following table, a minimum support of 33% and a minimum confidence of 70%, nine association rules can be found as follows: B=>A(66%, 100%), C=>A (66%, 100%), B=>C (50%, 75%),C=>B (50%, 75%), AB=>C (50%, 75%), AC=>B (50%,75%), BC=>A (50%, 100~~), C=>AB (50%, 75%),B=>AC (50%, 75%).

III. PROBLEM STATEMENT

Table: 1

131

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on November 30, 2008 at 00:54 from IEEE Xplore. Restrictions apply.

TID T1 T2 T3 T4 T5 T6

modified confidence where rules containing X on LHS will be hidden. Procedure: Initially set the hiding counters of all the rules equal to o. Ilcheck for all sensitive elements. for each x in X where x E X

Items ABC ABC ABC AB A AC

{ I/Now check all the rules containing Iisensitive element x. for each rule R which contain x on LHS { IICheck whether Mconfidence of the rule Ilgoes below MCT or not. while (Mconfidence (R»=MCT) II increase the hiding counter of II rule R by 1 { Hiding_counter(R)=Hiding_counter(R)+1 } } }

The objective of privacy preserving data mining is to hide certain sensitive information so that sensitive information can not be discovered through data mining techniques. Given a transaction database, a minimum support threshold and minimum confidence threshold and set of sensitive items X, the objective is to modify database in such a way that no predictive association rule containing X on the left hand side will be discovered. So if in above example element A is sensitive then rules AB=>C (50%, 75%), AC=>B (50%, 75%) should not be discovered by data mining algorithm. IV. PROPOSED ALGORITHM To hide any specified association rule X-7Your algorithm works on the basis of Mconfidence (X-7Y) and Msupport (X-7Y).To hide the rule X-7Y(containing sensitive element X on LHS),our algorithm repeatedly increases the hiding counter of the rule X-7 Y until Mconfidence (X-7Y) goes below a minimum specified threshold confidence (MCT).As the Mconfidence (X-7Y) goes below MCT(minimum specified confidence threshold),rule X-7Y is hidden Le. it will not be discovered through data mining algorithm. IXuYI Mconfidence(X-7Y)= ---------------IX I+hiding counter of rule X-7Y.

End of procedure. Output the rules which do not contain sensitive elements on the left hand side.

Example

Suppose we have given a database of transactions [7] as below Table: 2

TID

TI T2 T3 T4 T5

IXuYI Msupport (X-7Y) = -----------------N+hiding counter of rule X-7Y Algorithm Input: 1: A source database D.

2: MST (Minimum Support Threshold). 3: MCT (Minimum Confidence Threshold).

4: A set of sensitive items X. 5: A set of hiding counters for all rules(which are initially set to zero). 6: New modified terms Mconfidence (X-7Y) and Msupport (X-7Y).

Items ABD B

ACD AB ABD

We have also given a MST of 60% and a MeT of 70% .We can see four association rules can be found as below A-7B (60%, 75%) B~A (60%,75%) A7D (60%, 75%) D7A (60%,100%) Now we have to hide D and B. By previous methods: We can see that simple by

Output: A transformed database D' with

132

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on November 30, 2008 at 00:54 from IEEE Xplore. Restrictions apply.

simple ISL algorithm if we want to hide D and B, we check it by modifying the transaction T2 from B to BD (Le. from 0100 to 0101) we can not hide the rule D7A. T1 T2 T3 T4 T5

ABD B ACD AB ABD

Msupport

1101 0100 1011 1100 1101

T1 T2 T3 T4 T5

~ ~

1101 0101 1011 1100 1101

ABD B ACD AB ABD

)

) ) )

ABD ACD AB ABD

A7B (60% B~A (50% A7D (60% D7A (60%

~

~

So by above explanation we can see that rule D7A can not be hidden by ISL approach because by modifying T2 from B to BD (Le. from 0100 to 0101) rule D7A will have support and confidence 60% and 75% respectively. Now we will check it by DSR approach.... T1 T2 T3 T4 T5

o

o o o

B

Msupport

ABD B ACD AB ABD

Hiding Counter

75% 75% 75% 100%

First we hide B

(Hiding D7A by ISL approach)

T1 T2 T3 T4 T5

Mconfidence

A-7B (60% B-7A(60% A7D (60% D7A (60%

T1 T2 T3 T4 T5

1101 0100 1011 1100 1101

75%, ,60%, ,75%, ,100%,

Hiding Counter 0) 1) ~ rule is hidden 0) 0)

Now we hide D

ABD

B ACD AB ABD Msupport

A-7B (60% B7A (50% A7D (60% D7A (43%

~

Mconfidence

Mconfidence

,75% ,60% ,75% ,60%

Hiding Counter 0 ) 1 ) 0 ) 2 )~ rule is hidden

(Hiding D-7A by DSR approach)

T1 T2 T3 T4 T5

~

ABD B

ACD AB ABD

So we clearly see that our approach is hiding all the given sensitive rules successfully without any side effect. 70101 0100 1011 1100 1101

We see bye DSL approach rule D~ A is hidden as its support and confidence is now 40% and 66% respectively, but as a side effect the rule A-7D is also hidden. Similarly we can check same is the condition for B7A. Our Approach:. T1 T2 T3 T4 T5

ABD B

ACD AB ABD

v.

ANALYSIS AND CONCLUSION

As from our example we see that our approach is better in the way that it hides any rule which can not be hidden by some of the previous works. We see in the example that proposed method is hiding the given association rules (with sensitive items on the left hand side of the rule) without any side effect. Our algorithm is also simpler in the sense that we have to do only one step of modification as we are only incrementing the hiding counter each time (to decrease the confidence of sensitive rule) rather then checking all transactions again and again and ordering them in increasing or decreasing order as we had to do in some of the previous works (which work on the basis of reducing the support and confidence of the sensitive association

133

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on November 30, 2008 at 00:54 from IEEE Xplore. Restrictions apply.

Verykios, Elisa Bertino, Igor Nai Fovino Loredana Parasiliti Provenza, Yucel Saygin, Yannis Theodoridisl SIGMOD Record, Vol. 33, No.1, March 2004, Pages: 50 - 57.

rules). ACKNOWLEDGMENT

We express our profound gratitude to all faculty members of ABV-IIITM, Gwalior for their readiness at all the times to help us and whose critical suggestions, discussions and guidance can not be valued in words to the logical conclusion of this work .Again with a profound sense of gratitude, we record our indebtedness to all the colleagues. The nurturing and blossoming of the present work was mainly due to their valuable guidance, astute judgment, constructive criticism and an eye for perfection, Without their overwhelming interest, the present work would not have seen the light ofthe day.

[12] Text Book -Data Mining Concepts & Techniques- Jiawei Han, Micheline Kamber - Morge Kaufmann Publisher.

REFERENCES [1]

Shyue-Liang Wang, Yu-Huei Lee, Steven Billis, Ayat Jafari "Hiding Sensitive Items in Privacy Preserving Association Rule Mining" 2004 IEEE International Conference on Systems, Man and Cybernetics.

[2]

Vassilios S. Verykios, Ahmed K. Elmagarmid, Elisa Bertino, Yucel Saygin and Elena Dasseni"Association Rule Hiding", IEEE Transactions on Knowledge and Data Engineering, Vol. 16No. 4, April 2004.

[3]

Yucel Saygin, Vassilios S. Verykios, Chris Clifton "Using Unknowns to Prevent Discovery of Association Rule" SIGMOD Record, Vol. 30, No.4, December 2001.

[4]

Chris Clifton, Don Marks "Security and Privacy Implications of Data Mining", In Proceedings of the 1996 ACM SIGMOD Workshop on Data Mining and Knowledge Discovery.

[5]

R. Agrawal and R. Srikant, "Privacy preserving data mining", In ACM SIGMOD Conference on Management of Data, pages 439450, Dallas, Texas, May 2000.

[6]

Vi-Hung Wu, Chia-Ming Chiang, and Arbee L.P. Chen, Senior Member, IEEE Computer Society Hiding Sensitive Association Rules with Limited Side Effects IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO.1, JANUARY 2007

[7]

R. Agrawal, T. Imielinski, and A. Swami, "Mining Association Rules between Sets of Items in Large Databases", In Proceedings of ACM SIGMOD International Conference on Management of Data Washington DC, May 1993.

[8]

S. Oliveira, o. Zaiane, "Algorithms for Balancing Privacy and Knowledge Discovery in Association Rule Mining", Proceedings of 71 h International Database Engineering and Applications SYmposium (IDEAS03), Hong Kong, July 2003.

[9]

Wu, Y.H., Chiang, C.M., and Chen, A.L.P. Hiding sensItIve association rules with limited side effects. IEEE Transactions on Knowledge and Data Engineering, 2007,19(1):29-42.

[10] Fienberg, S. and Slavkovic, A. Preserving the confidentiality of categorical statistical data bases when releasing information for association rules. Data Mining and Knowledge Discovery, 11(2):155-180,2005. [11] State-of-the-art in Privacy Preserving Data Mining Vassilios S.

134

Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. Downloaded on November 30, 2008 at 00:54 from IEEE Xplore. Restrictions apply.

Suggest Documents