anonymization methods do not take the big data processes into ... access framework for analysts, who wish to participate in data analytics in big data.
Multi-Dimensional Sensitivity-based Anonymization Method of Big Data M. Al-Zobbi, S.Shahrestani, C. Ruan School of Computing, Engineering & Mathematics., Western Sydney University, NSW, Australia Abstract Data analytics and their utilization in big data environments witness a rapid growth in the past few years. Several undesirable side-effects have appeared, in relates to data disclosure and privacy violations risks. This trend imposes finding privacy methods with a scale-up ability to cope with the big data growth. Data anonymization is one of the pioneer privacy solutions that can minimize such risks. However, the current anonymization solutions suffer from poor performance and high loss of gained information in big data environment. In this paper, we propose a novel privacy method named as MultiDimensional Sensitivity-Based Anonymization. The method resolves the performance and anonymization loss concern, and provides a multi-level access control. Many privacy methods were proposed to anonymize data before exposing sensitive information on the cloud. The contemporary anonymization methods do not take the big data processes into considerations. In this paper, we compare our proposed method with one of the recent proposed methods known as multi-dimensional top down specialization. The comparison shows limitations and contamination for the big data structure on applying the top down specialization method, in the contrast, our proposed method adapts the parallel distrusted structure during the anonymization operation. Our method provides a gradual access framework for analysts, who wish to participate in data analytics in big data. The framework is integrated with Role Base Access Control that maps the authorization roles between service providers and federation services. Sensitivity-Based Anonymization discriminates data refinement by providing multi-scalable levels of user’s access.
0
1. Introduction Big Data implies enormous volume of data utilization in large processing operations. . Big Data analytics is where advanced analytic techniques operate on big datasets [1]. Hence, analytics is the main concern in big data, and it may be exploited by data miners to breach privacy [2]. In the past few years, several methods that address the data leakage concerns have been proposed for conventional data [35]. The proposed methods provide remedies for variant types of attacks against data analytics process. Side attack is considered to be one of the most critical attacks [6]. This attack is prevalent in medical data, where the attacker owns partial information about the patient. The attacker aims to find the hidden sensitive information by logically linking between his/her own data and the attacked data. Side attack can be conducted by either manipulating the query, known as state attack, or running a malicious code that can transfer the other users output through the network, known as a privacy attack. However, a variety of attacks can be triggered by the adversary to interrupt analytics process by mounting the malicious code, which may cause infinitive loops operations or may eavesdrop on other user’s operations [7]. The most popular anonymity method is known as k-anonymity [8]. This method was proposed for conventional data. The k-anonymity is one-dimensional method, which highly disturbs the anonymized data and reduces the anonymized information gained [9]. To resolve this matter, variant anonymization methods were proposed such as ℓ-diversity [1], and (X, Y) Anonymity [2]. These extended methods, however, do not resolve the one-dimensional concern, where data is structured in a multi-dimensional pattern. Multi-dimensional LKC-Privacy method was proposed to overcome the one-dimensional distortion [10]. Later, more profound method was proposed, known as Multi-Dimensional Top-Down Anonymization. This method generalises data to the top most first, and then specializes data based on the best scores results [11]. The previously mentioned methods can operate efficiently in conventional data. However, big data specifications and operations concepts are different. Big data operates in a parallel distributed environment, known by MapReduce, where performance and scalability are the major concern [12]. Researchers amended the previously mentioned anonymization methods, so they can fit the new distributed environment. One of these methods is the Two-Phase Multi-Dimensional Top down Specialization method (Two-Phase MDTDS), which splits the large size of data into small chunks. This technique negatively affects the anonymized information, resulting in increased information loss. Moreover, each chunk of data requires n times of iterations to find the best score on specialization rounds [11]. The iterations create n rounds between map and reduce nodes. Map and reduce may be connected through the network on separate nodes; which results an unknown number of iteration times. Hence, this rigid solution may cause a high delay [13]. Also, the iteration locks both nodes till the end of the process, which will disturb the parallel computing principle. Another big data privacy method mutates between top-down specializations (TDS) and bottom-up generalization (BUG) in a hybrid fashion [14, 15]. The method calculates the value of K, where K is defined as workload balancing point if it satisfies the condition that the amount of computation required by MapReduce TDS or (MRTDS) is equal to that by MapReduce BUG or (MRBUG). The K value is calculated separately for each group of data set, so TDS operation is triggered when anonymity of k > K, while BUG operation is triggered when anonymity of k < K. However, the iteration technique of this method is quite similar to MDTDS. Also, finding the value K for each group of record consumes even more time. In this work, we propose a multi-dimensional anonymization method that supports data privacy in MapReduce environments. Our Multi-Dimensional Sensitivity-Based Anonymization method (MDSBA) is proposed and compared with the (MDTDS). MDSBA provides a bottom-up anonymization method with a distributed technique in MapReduce parallel processes. Our method enables data owners to provide multi-levels of anonymization for multi-access level of users. The 1
anonymization is based on grouping and splitting data records, so each group set or domain is anonymized individually. Finally, anonymized domains are merged together in one domain. Also, our method embeds Role Based Access Control model (RBAC) that enforces security policy on user’s access and running processes. RBAC is chosen over Mandatory Access Control (MAC) for its flexible and scalable user’s roles and access permissions, and for its popularity over the cloud. However, RBAC and mapping process will be discussed in a separate paper. Section four delves in MDSBA grouping and masking processes in details. This paper is structured as follows. The next section introduces some previous k-anonymity methods adapted for use in conventional data. Section 3 describes our proposed anonymization approach, MDSBA. The experimental evaluation is described in section. The last section gives our concluding remarks and the suggested future works.
2. Related Work Privacy methods concept relies on hiding the sensitive information from data miners. The hiding principle implies distortion techniques that promote a trade-off between privacy and beneficial data utility. The specialization technique relies on Quasi-Identifier (Q-ID), which tends to find a group of attributes that can identify other tuples in the database [16]. These identifiers may not gain 100% of data. However, a risk of predicting some data remains high. For instance, knowing the patient’s age, gender, and postcode, may lead to uniquely identifying that patient with a probability of 87% [16]. Many algorithms were developed to overcome this security breach. K-anonymity was proposed by Sweeney [17], who suggested a data generalization and suppression for quasi-identifiers (Q-ID). Kanonymity guarantees a privacy on releasing any record by adhering each record to at least k individuals, and this is correct even if the released records are connected to external information. The table is called k-anonymous; if at least, k – 1 of tuples are Q-ID equivalent records. This means, the equivalence group size on Q-ID is, at least, k [18]. K-anonymity has gained some popularity, but it has some subtle and severe privacy problem, such as; homogeneity and background knowledge attacks [5]. The homogeneity leverages if all k-anonymity values are identical. ℓ-Diversity is introduced to resolve the privacy breach in k-anonymity method [19]. This algorithm aims to reduce the attributes linkage. It is developed from the fact that some sensitive attributes [S] are more frequent than others in the group. So the ℓ-diverse is calculated using the entropy, by grouping the Q-ID and then calculating the entropy for the groups using the following formula [5] ℎ
,
∑
∈
ℎ
,
log
,
≥ log ℓ !
(1) " , and S denotes the sensitive attributes.
The previously mentioned methods anonymize data based on one-dimensional concept. This concept reduces information gained after anonymization, and may result to false statistics. Generally, Data analytics implements search methods to find a group of records, which can be a part of one-dimensional, two-dimensional, or multidimensional data [20]. A popular multi-dimensional method is proposed to overcome the one-dimensional k-anonymity method, known by LKC-privacy method. LKC-privacy can be applied to the multi-dimensional data, such as patient’s information. The general intuition of LKC-privacy ensures that Q-ID with a length of L and sensitive value of S is not greater than Class C. The idea is grouping length of records L in the data object T, by at least k records [3]. To enhance the LKC-privacy performance, more profound method was proposed, known by Top-Down Specialization (TDS) method [11]. The method reverses the LKC-privacy process, by generalizing records to the top-most level, and then specializing the highest score attributes. TDS, LKC, l-diversity and all other k-anonymity methods tend to trade-off The trade-off between information gained and 2
anonymization loss is presented as #
&'()*+,' - 12
$ = .')'/0)
[22]. This equation does not satisfy the
form matrix to capture the classification, therefore, Shannon’s equation is used for correctness. Shannon’s information theory is used to measure information gain [23]. Let T[v] denotes the set of records masked to the value v and let T[c] denotes the set of records masked to a child value c in child (v) after refining v. Thus, the InfoGain (v) is defined as 3
4
|;7 α . The value of ω can be calculated by finding the probability of the minimum and @
the maximum values in m number of n Q-ID’s. Hence, the maximum sensitivity factor is defined as the highest probable value between Q-ID’s, or: ωB+C = max
G0 ,
G1 …
GK
3
And the minimum sensitivity factor is defined as the product of all Q-ID’s probabilities, or: ωB,' = ∏B ,N2 7 G, 8
(4)
Based on equations (2) and (3); the value of ω can be found between ωmin and ωmax, as in equation (4): ωB+C − ωB,' ω = ωB,' + k − k> P Q k
5
Where k denotes the k-anonymity value, and k̄ denotes the ownership level
Equation (6) collates both terms of ω and aging factor τ to conclude the sensitivity equation ψ. The object’s sensitivity level degrades with the data age. The aging factor affects the sensitivity reversely. The older objects are less sensitive if compared with the newer ones. Hence, two factors determine the sensitivity level, the sensitivity factor ω, and the aging factor τ. Vℎ
ψ = |T + U| ψG
ℎ
$ W
$ ,
GXG
ℎ
(6)
Y Y
Equation 6 is used to anonymize all grouping domains, as explained in the next section. The masking process tends to find a close similar to the sensitivity value. For instance, if ψ=0.5; then any value falls between 0-0.5 is accepted, as described in Table 1. However, the closer probable value to the sensitive value is the better.
4
Table 1. Example of accepted probable values for k=5 Accepted sensitive value
Sensitive Value (ψ)
Ownership Level (k̄ )
0
0
1
0 - 0.5
0.5
2
0 - 0.05
0.05
3
0 - 0.005
0.005
4
0 - 0.0005
0.0005
5
3.2 The Object Aging Sensitivity (τ) The aging factor is affected by four different factors these are; the object obsolescence value Ø, the aging participation ρ, the object age y, and the sensitivity factor ω. The obsolescence value is defined as the critical age before the object sensitvity starts degrading. Thereby, the object aging value is constant, if the age y is smaller than the obsolescence value, or y < Ø. While the object aging value decreases linearly, if the age y is greater than the obsolescence value, or y≥ Ø. Thus, two separate terms are expressed for the age, and y≥ Ø, as described in Equation 7. The aging participation percentage ρ is pre-determined by the data owners.
ℎ
W50k, 50K >50K >50K >50K
Q-IDs Age Gender 60-70 Male 65-75 Male 70-80 Male 75-85 Male 80-90 Male 1525 Male 2535 Male
No. of Rec. 18 4 5 19 115
30-35
Male
>50K
313
35-40
Male
>50K
422 12
Sensitive Salary >50K >50K >50K >50K >50K
No. of Rec. 118 46 14 12 7
50K >50K >50K >50K
416 328 196 118
65-70
Male
>50K
46
70-75
Male
>50K
14
75-80
Male
>50K
12
80-90
Male
>50K
7
3545 4555 55-65 65-75 75-85 8595 1525 2030 2535
Male