International Journal of Applied Engineering Research, ISSN 0973-4562 Vol. 10 No.76 (2015) © Research India Publications; http://www.ripublication.com/ijaer.htm
An access control frame work for micro data anonymization under privacy and accuracy constraints J.Sudha,
K.Yasmin Begum,
Department of Computer Science and Engineering, A.V.C College of Engineering, Mayiladuthurai,TamilNadu,India.
[email protected]
PG Student, Department of Computer Science and Engineering, A.V.C College of Engineering, Mayiladuthurai, TamilNadu, India.
[email protected]
Abstract - Access control mechanisms protect sensitive information from unauthorized users. The sensitive information, even after the removal of identifying attributes, is still susceptible to linking attacks by the authorized users. This problem has been studied extensively in the area of micro data publishing and the privacy definitions, e.g., k-anonymity, l-diversity, and variance diversity. Anonymization algorithms use suppression and generalization of records to satisfy privacy requirements with minimal distortion of micro data. For that both privacy preserving mechanism as well as access control framework is to be proposed. The permissions are to be given to the user based on role access control. First, formulate the accuracy and privacy constraints as the problem of K-anonymous Partitioning with Imprecision Bounds (K-PIB) and give hardness results. Second, introduce the concept of accuracy-constrained privacypreserving access control for relational data. Third, proposed heuristics can be used to approximate the solution of the K-PIB problem and conduct empirical evaluation. The heuristics proposed for accuracy constrained privacy-preserving access control are also relevant in the context of workload-aware anonymization. Keywords: Access control mechanisms, Imprecision bounds, Kanonymity, Micro data, privacy preserving.
have control over who is allowed to see their personal information and for what purpose. Increasingly, organizations want the ability to define a privacy policy [1] that describes such agreements with data subjects. The privacy protection mechanism ensures that the privacy and accuracy goals are met before the sensitive data is available to the access control mechanism. The permissions in the access control policy are based on selection predicates on the QI attributes. The policy administrator defines the permissions along with the imprecision bound for each permission/query, user-to-role assignments [2], and role-to-permission assignments. K-anonymity has been proposed to reduce the risk of this type of attack. The primary goal of k anonymization is to protect the privacy of the individuals to whom the data pertains. Numerous recoding models have been proposed in the literature for k-anonymization [6], and often the “quality” of the published data is dictated by the model that is used. The main contributions are a new multidimensional recoding model and a greedy algorithm for k-anonymization, an approach with several important advantages. II. P ROBLEM DEFINITION
I. INTRODUCTION
Given a relation T = {A1, A2, A3…..An} where Ai is an attribute, T_ is the anonymized version of the relation T. Assume that T is a static relational table. The attributes can be of the following types
Organizations
collect and analyse consumer data to improve their services. Access Control Mechanisms (ACM) [1], [9] is used to ensure that only authorized information is available to users. However, sensitive information can still be misused by authorized users to compromise the privacy of consumers. The concept of privacy-preservation [3] for sensitive data can require the enforcement of privacy policies or the protection against identity disclosure by satisfying some privacy requirements. The anonymity techniques [7] can be used with an access control mechanism to ensure both security and privacy of the sensitive information. The privacy is achieved at the cost of accuracy and imprecision [4] is introduced in the authorized information under an access control policy. Data privacy is an important area for future research. One of the defining principles of data privacy, limited disclosure, is based on the premise that data subjects
Identifier
:
Attributes e.g., identifier and social security no, that can be uniquely identify an individual. These attributes are completely removed from the anonymized relation.
Quasi-identifier: Attributes, e.g., gender, zip code, birth date, that can potentially identify an individual based on other information available to an adversary. Quasi id attributes are generalized to satisfy the anonymity requirements.
50
International Journal of Applied Engineering Research, ISSN 0973-4562 Vol. 10 No.76 (2015) © Research India Publications; http://www.ripublication.com/ijaer.htm
Sensitive id
: Attributes e.g., disease or salary, that if associated to a distinctive individual will cause a privacy breach.
group exhibit homogeneity. The l-diversity [7] model adds the promotion of intra-group diversity for sensitive values in the anonymization mechanism. III. P RIVACY P RESERVING ACCESS CONTROL F RAMEWORK
A. K-Anonymity K-anonymity is a property possessed by certain anonymized data. The concept of k-anonymity [7] was first formulated by Latonya Sweeney as an attempt to solve the problem: Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be reidentified while the data remain practically useful. A release of data is said to have the k-anonymity property [5] if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release.
The accuracy and privacy constraints as the problem of kanonymous partitioning with Imprecision Bounds (k-PIB) [7] and give hardness results. Second, the concept of accuracyconstrained privacy-preserving access control for relational data. Third, the proposed heuristics can be used to approximate the solution of the k-PIB problem and conduct empirical evaluation. The specification of the imprecision bound ensures that the authorized data has the desired level of accuracy. The imprecision bound information is not shared with the users because knowing the imprecision bound [3] can result in violating the privacy requirement. The privacy protection mechanism is required to meet the privacy requirement along with the imprecision bound for each permission.
Suppression: In this method, certain values of the attributes are replaced by an asterisk '*'. All or some values of a column may be replaced by '*'. Generalization: In this method, individual values of attributes are replaced by with a broader category. For example, the value '19' of the attribute 'Age' may be replaced by ' ≤ 20', the value '23' by '20 < Age ≤ 30' , etc.
Three algorithms based on greedy heuristics are proposed. All three algorithms are based on kd tree [4] construction. Starting with the whole tuple space the nodes in the kd-tree are recursively divided till the partition size is between k and 2k. The leaf nodes of the kd-tree are the output partitions that are mapped to equivalence classes in the given table.
B. Attacks on k- anonymity Homogeneity attack leverages the case where all the values for a sensitive value within a set of k records are identical. In such cases, even though the data has been kanonymized, the sensitive value for the set of k records may be exactly predicted. Background knowledge attack leverages an association between one or more quasi-identifier attributes [8] with the sensitive attribute to reduce the set of possible values for the sensitive attribute. C. L-diversity L-diversity [8] is a form of group based anonymization that is used to preserve privacy in data sets by reducing the granularity of a data representation. This reduction is a tradeoff that results in some loss of effectiveness of data management or mining algorithms in order to gain some privacy. The l-diversity model is an extension of the kanonymity model which reduces the granularity of data representation using techniques including generalization and suppression [6] such that any given record maps onto at least k other records in the data.
Fig. 1. Access control framework
The specification of the imprecision bound ensures that the authorized data has the desired level of accuracy. The imprecision bound information is not shared with the users because knowing the imprecision bound can result in violating the privacy requirement.
The l-diversity model handles some of the weaknesses in the k-anonymity model [8] where protected identities to the level of k-individuals is not equivalent to protecting the corresponding sensitive values that were generalized or suppressed, especially when the sensitive values within a
51
International Journal of Applied Engineering Research, ISSN 0973-4562 Vol. 10 No.76 (2015) © Research India Publications; http://www.ripublication.com/ijaer.htm
The privacy protection mechanism is required to meet the privacy requirement along with the imprecision bound for each permission [3] and the focus is on relaxed enforcement. However the proposed methods for anonymization are also valid for strict enforcement because the proposed heuristics reduce the overlap between partitions and queries.
Select dimension and cut having least Overrall imprecision for all queries in Q if (feasible cut found) then create new partitions and add to CP else split CPi recursively along median till anonymity requirement is satisfied compact new partitions and add to P update Bqj according to ic QOj cpi, V Qj E Q return ( P )
In relaxed enforcement if the imprecision bound is violated for permission then that permission is not assigned to any role. The proposed query cut can also be used to split partitions using bottom- up (Rþ-tree) [6] techniques formulate the accuracy and privacy constraints as the problem of kanonymous Partitioning with Imprecision Bounds (k-PIB) and give hardness results.
In the Top-Down Heuristic 2 algorithm [9] (TDH2, for short), the query bounds are updated as the partitions are added to the output. This update is carried out by subtracting the value from the imprecision bound BQj of each query, for a Partition, say Pi, that is being added to the output. During the query bound update, if the imprecision bound for any query gets violated, then that query is put on low priority by replacing the query bound by the query size.
A privacy-preserving access control framework is shown in Fig.1, where the privacy protection mechanism ensures that the privacy and accuracy goals are met before the sensitive data is available to the access control mechanism. The access control policies define permissions for roles based on selection predicates. Privacy Protection Mechanisms (PPM) [3] use suppression and generalization to anonymized and satisfy privacy requirements. The attainment of the privacy goals is achieved at the cost of the precision of the data available to the authorized users. The access control mechanism needs to specify the level of imprecision that can be tolerated by the user for each permission. This specification of the imprecision bound ensures that the authorized information has the desired level of accuracy. Then, the privacy protection mechanism needs to meet the privacy requirement along with the imprecision bound for each permission.
The time complexity of the TDH2 algorithm is ( | | ), [9] which is not scalable for large data sets (greater than 10 million tuples). In the Top-Down Heuristic 3 algorithm (TDH3, for short), modified TDH2 so that the time ) can be achieved at the cost of complexity of ( | reduced precision in the query results. If a query cut results in one partition having a size greater than hundred times the other, then that cut is ignored. If a feasible query cut is not found, then the partition is split along the median . B. Median cut Median cut [4] is an algorithm to sort data of an arbitrary number of dimensions into series of sets by recursively cutting each set of data at the median point along the longest dimension. Median cut is typically used for color quantization.
IV.P ARTITIONING APPROACHES A. Top down Heuristics algorithm In TDSM, the partitions [4] are split along the median. Consider a partition that overlaps a query. If the median also falls inside the query then even after splitting the partition, the imprecision for that query will not change as both the new partitions still overlap the query.
For example, to reduce a 64k-colour image to 256 colors, median cut is used to find 256 colors that match the original data well. The tuples are partitioned according to the median cut and even after dividing the tuple space into four partitions there is no reduction in imprecision for the query.
Propose to split the partition along the query cut and then choose the dimension along which the imprecision is minimum for all queries. If multiple queries overlap a partition, then the query to be used for the cut needs to be selected. The queries having imprecision greater than zero for the partition are sorted based on the imprecision bound
C. Query cut A query cut is defined as the splitting of a partition along the query interval values. For a query cut using Query Qi, both the start of the query interval (aQij ) and the end of the query interval (bQij ) are considered to split a partition along the jth dimension. In the query cut, the imprecision is reduced to zero as partitions [5] are either non-overlapping or fully enclosed inside the query region.
Input : T,k,Q and Bqj Output : P Initialize set of candidate partitions ( CP T) For (CPi E CP) do Find the set of queries QO that overlap CPi Such that ic QOj cpi > 0 Select query from QO with smallest Bqj Create query cuts in each dimension Reject cuts with skewed partitions
V.ANONYMIZED TABLE E VALUATION A. Micro data Creation Micro data is a valuable source of information for the allocation of public funds, medical research, and trend analysis. However, if individuals can be uniquely identified in
52
International Journal of Applied Engineering Research, ISSN 0973-4562 Vol. 10 No.76 (2015) © Research India Publications; http://www.ripublication.com/ijaer.htm
for the quasi-identifier as at least three other tuples in the table.
the micro data, then their private information [1] (such as their medical condition) would be disclosed, and this is unacceptable. To avoid the identification of records in micro data, uniquely identifying information like id and social security numbers are removed from the table. Collect patient micro data through any hospital or internet and import to our application. These data place in Identifiers, Quasi Identifiers, Sensitive Information and this is a normal table ex: Voter Table, Patient Table.
C. Data Partition The whole tuple as one partition and then partitions are recursively divided till the time new partitions meet the privacy requirement. To divide a partition, two decisions need to be made, i) ii)
Choosing a split value along each dimension Choosing a dimension along which to split.
In the TDSM algorithm, the split value is chosen along the median and then the dimension is selected along which the sum of imprecision for all queries is minimum. The expression is derived by multiplying the height of the tree with the work done at each level. The median cut [4] generates a balanced tree with height and the work done at each level is jn.The partitions created by TDSM have dimensions along the median of the parent partition. The partition can be also based on both median and query cut. In the proposed heuristics in query intervals are used to split the partitions that are defined as query cuts.
Fig. 2. Micro data creation
B. Anonymization Creation Divide the attributes into two groups: the sensitive attributes (consisting only of medical condition) and the nonsensitive attributes (zip code, age, and nationality). The ID attribute is removed in the anonymized table and is shown only for identification of tuples. An equivalence class is a set of tuples having the same QI attribute values.
Fig. 4. Data partition Fig. 3. Anonymization creation
D. Data Access Control Creation The surveillance data is anonymized and shared with departments of health at each county. An access control policy is given in this module; that allows the roles to access the tuples under the authorized predicate, e.g., Role CE1 can access tuples under Permission P1.The epidemiologists at the
A 4-anonymous [7] table derived from the (here “*” denotes a suppressed value so, for example, “zip code = 1485*” means that the zip code is in the range [14850−14859] and “age=3*”means the age is in the range [30 − 39]). Note that in the 4-anonymous table, each tuple has the same values
53
International Journal of Applied Engineering Research, ISSN 0973-4562 Vol. 10 No.76 (2015) © Research India Publications; http://www.ripublication.com/ijaer.htm
state and county level suggest community containment measures, e.g., isolation or quarantine according to the number of persons infected in case of a flu outbreak.
For new user registration, a particular access code will be sent to an individual email. The query result is available only in decryption format. The user has to encrypt only by using the particular access code. The access code will be changed dynamically. V. RESULTS AND D ISCUSSION The experiments have been carried out on patient data sets for the evaluation of the proposed heuristics algorithm. The partition can be based on both query cut and median cut algorithm. The anonymized table is to be published for the research purpose. Normal Table
Fig. 5. Data access control creation
E. Data Publishing After successfully convert to anonymous table and also assign a role based access mechanism over internet application to publish the anonymous table to other research Centre over the world. These data’s are fully secured and publish to internet. Mainly this data can be accessed by the research institute. Partition chart can be created from the anonymized table. Precision can be set for each and every partition. For query evaluation the data does not overlap with other partitions.
Fig. 8. Normal table
The attributes in that data set contains identifier, quasi identifier and sensitive attributes. First ignore the identifier and apply generalization and suppression technique for partitioned the given data set. Anonymized Table
Fig.6. Data publishing
F. Data Accessing The research center to access the anonymous data through our access keys or our SSN number over government server. These users already assign some roles in data access control mechanism module.
Fig. 7. Data accessing
Fig. 9. Anonymized table
54
International Journal of Applied Engineering Research, ISSN 0973-4562 Vol. 10 No.76 (2015) © Research India Publications; http://www.ripublication.com/ijaer.htm
After completing the registration of the new user, corresponding access key and decryption key are sent to the given email. Using that key value only the particular user can get result from the corresponding partition of the anonymized table. This decryption key is dynamic in nature for each search a different value is to be generated
only authorized query predicates on sensitive data. The privacy preserving module anonymizes the data to meet privacy requirements and imprecision constraints on predicates set by the access control mechanism. This interaction can be formulated as the problem of k-anonymous Partitioning with Imprecision Bounds (k-PIB). The hardness results can be used for the k-PIB problem and present heuristics for partitioning the data to satisfy the privacy constraints and the imprecision bounds. For the future work incremental database and cell level access control is to be used. REFERENCES [1] Bertino, E., & Sandhu, R ( 2005 ). Database security - concepts, approaches, and challenges. Dependable and Secure Computing, IEEE Transactions on,2(1), 2-19. [2] Chaudhuri, S., Dutta, T., & Sudarshan, S. ( 2007, April ). Fine grained authorization through the predicated grants. In Data Engineering, 2007. ICDE 2007. IEEE 23 rd International Conference on ( pp. 1174-1183). IEEE. [3] Fung, B., Wang, K., Chen, R., & Yu, P. S. (2010 ). The Privacy preserving data Publishing ( PPDP ) : A Survey of the recent developments ACM Computing Surveys ( CSUR ), 42 (4), 14. [4] Iwuchukwu, T., & Naughton, J. F. ( 2007 ). K- anonymization as spatial indexing : Toward scalable and incremental anonymization. In Proceedings of the 33rd international conference on Very large data bases (pp. 746-757). VLDB Endowment. [5] LeFevre, K., Agrawal, R., Ercegovac, V., Ramakrishnan, R., Xu, Y., & DeWitt.D. ( 2004, August ). Limiting disclosure in hippocratic databases. In Proceedings of the Thirtieth international conference on Very large data bases - Volume 30 ( pp. 108-119 ). VLDB Endowment. [6] LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. ( 2006, April ). Mondrian Multi - Dimensional k - anonymity. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference on (pp. 25-25). IEEE. [7] Li, N., Qardaji, W. H., & Su, D. ( 2011 ). Provably private data Anonymization : Or ,k-anonymity meets differential privacy. CoRR, abs/1101.2604, 49, 55. [8] Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. ( 2007 ). L – diversity : Privacy beyond k - anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 3. [9] Pervaiz, Z., Aref, W. G., Ghafoor, A., & Prabhu, N. ( 2014 ). AccuracyConstrained Privacy - Preserving Access Control Mechanism for Relational Data Knowledge and Data Engineering, IEEE Transactions on, 26(4), 795-807.
Fig. 10. Partition 1 graph
Fig. 11. Partition 2 graph
Fig. 12. Partition 3 graph
Fig. 13. Partition 4 graph
VI. CONCLUSION An accuracy-constrained privacy-preserving access control framework for relational data has been proposed. The framework is a combination of access control and privacy protection mechanisms. The access control mechanism allows
55