Privacy Preserving via Interval Covering Based Subclass Division and Manifold Learning Based Bi-directional Obfuscation for Effort Estimation Fumin Qi1, Xiao-Yuan Jing1,2,*, Xiaoke Zhu1,3, Fei Wu1,2, Li Cheng1 1State
Key Laboratory of Software Engineering, School of Computer, Wuhan University, China of Automation, Nanjing University of Posts and Telecommunications, China 3School of Computer and Information Engineering, Henan University, China
2School
Corresponding author:
[email protected]
*
ABSTRACT
1. INTRODUCTION
When a company lacks local data in hand, engineers can build an effort model for the effort estimation of a new project by utilizing the training data shared by other companies. However, one of the most important obstacles for data sharing is the privacy concerns of software development organizations. In software engineering, most of existing privacy-preserving works mainly focus on the defect prediction, or debugging and testing, yet the privacypreserving data sharing problem has not been well studied in effort estimation. In this paper, we aim to provide data owners with an effective approach of privatizing their data before release. We firstly design an Interval Covering based Subclass Division (ICSD) strategy. ICSD can divide the target data into several subclasses by digging a new attribute (i.e., class label) from the effort data. And the obtained class label is beneficial to maintaining the distribution of the target data after obfuscation. Then, we propose a manifold learning based bi-directional data obfuscation (MLBDO) algorithm, which uses two nearest neighbors, which are selected respectively from the previous and next subclasses by utilizing the manifold learning based nearest neighbor selector, as the disturbances to obfuscate the target sample. We call the entire approach as ICSD&MLBDO. Experimental results on seven public effort datasets show that: 1) ICSD&MLBDO can guarantee the privacy and maintain the utility of obfuscated data. 2) ICSD&MLBDO can achieve better privacy and utility than the compared privacy-preserving methods.
Engineers usually try to build a model for estimating/predicting a new project by seeking training samples from other companies, when their own company lacks enough local data. However, except the limited publicly available datasets, e.g., Promise1, it is very hard to obtain useful data from other companies. The main reason for this phenomenon can be summarized as follows: there usually exist some sensitive attributes in the data to be shared, which may bring privacy disclosure for the data owners (e.g., the leakage of commercial secret), leading to that the data owners are unwilling to share their data. Therefore, privacy preserving has been one of the most important research topics in software engineering, and has attracted much attention from both academic and industrial communities [1-19]. To avoid privacy disclosure, data owners usually remove the sensitive attributes and use privacy-preserving methods to obfuscate the data before publishing their dataset. The commonly used general privacy-preserving methods include generalization and suppression based methods [1-8], clustering-based methods [9-14] and swapping-based methods [15-17, 20]. Recently, a few PPDS methods have been presented for software engineering [2126]. Most of these methods mainly focus on the applications such as defect prediction or debugging and testing. Specifically, Peters et al. [21-23] studied on the PPDS for defect prediction. Clause [24], Taneja et al. [25] and Lo D [26] investigated the privacypreserving problem in debugging and testing.
CCS Concepts
Privacy preserving is also needed in effort estimation. In practice, since many indicators (i.e., attributes) contained in the effort data are related to the properties of the product or the image of a team, e.g., Function points count (FP), Line of code (LOC) and number of function point (size), data holders don’t hope these attributes to be made public. In addition, Peters et al. [21] mentioned that: “In a personal communication, Barry Boehm stated that he was able to publish less than 200 cost estimation records even after 30 years of COCOMO effort”. All of these mean that the privacy threats have hindered people to share their effort data. Therefore, it’s urgent to investigate how to preserve the data privacy in effort data sharing.
•Software and its engineering → Empirical software validation. •Security and privacy → Privacy-Preserving protocols.
Keywords Effort estimation; Privacy-preserving; projection; Subclass division.
Locality
preserving
Permission to make digital or hard copies of all or part of this work for Permission to classroom make digitaluse or hard copies without of all or fee part provided of this work personal personal or is granted thatforcopies areor classroom granted without provided that copies areadvantage not made orand distributed not madeuseoris distributed forfeeprofit or commercial that for profit or commercial advantage and that copies bear this notice and the full citation copies bear this notice and the full citation on the firstbypage. on the first page. Copyrights for components of this work owned othersTo thancopy ACM must be honored. crediton is permitted. To copy otherwise, orto republish, otherwise, or Abstracting republish, with to post servers or to redistribute lists, torequires post on servers to redistribute to lists,and/or requiresaprior permission and/or a prior or specific permission fee.specific Request permissions fee. Request permissions from
[email protected].
1.1 Background In order to facilitate understanding privacy preserving in effort estimation and the protected attributes researched in this paper, we offer the following definitions.
from
[email protected]. ASE’16, ASE’16,September September3–7, 3–7,2016, 2016,Singapore, Singapore,Singapore Singapore cCopyright
2016 ACM. 2016978-1-4503-3845-5/16/09...$15.00 ACM 1-58113-000-0/00/0010…$15.00 http://dx.doi.org/10.1145/2970276.2970302 DOI: http://dx.doi.org/10.1145/2970276.2970302
1http://openscience.us/repo/effort/
75
Let S {( s1 , y1 ), , ( si , yi ), ( sn , y n )} denote a data set with n
In recent years, a few privacy-preserving works have been presented in software engineering [21-26]. However, these methods cannot be utilized to tackle the privacy-preserving problem in effort data sharing. Specifically, methods [21-23] utilize the nearest unlike neighbors to obfuscate the target data and remain the utility and privacy of obfuscated data. Methods [21-23] are designed to solve the sensitive attributes values disclosure problem for defect data, and it cannot directly be used for effort data, since the effort data is different with defect data (the defect data has the class labels while the effort data doesn’t have this attribute). Methods [24-26] are designed for software debugging and testing. These methods cannot be directly used for effort data due to the following reason: There exists an assumption in debugging and testing data, that is, the data owns detailed connection knowledge between parts of a system, yet this assumption doesn’t exist in effort data.
samples, where ( si , yi ) is the i sample of S , si represents independent attributes, and y i represents dependent attributes. All the attributes in S can be classified into the following one or more categories [1-17, 28]: th
Sensitive attributes (SA): Attributes we do not want adversaries to associate with a target. Explicit identifiers: Attributes that clearly identify individuals. Quasi-identifiers (QIDs): Attributes whose values when taken together can potentially identify an individual. Privacy threat refers to the unwanted disclosures of SA, explicit identifiers and QIDs. Privacy threats can be categorized into three types [23]: 1) sensitive attribute values disclosure, 2) identity disclosure or re-identification and 3) membership disclosure. Sensitive attribute occurs when a target is associated with information about their sensitive attributes. Re-identification, which occurs when an attacker with external information, can reidentify an individual from data that has been removed of personally identifiable information. Membership disclosure is another privacy threat that focuses on protecting a person’s micro data. Adversaries can identify the relationship between sample and target classes according to public data. Like in [23], we mainly evaluate ICSD&MLBDO against the first privacy threat, i.e., sensitive attribute disclosure, in this paper. The evaluation of ICSD&MLBDO against the other two types of privacy threats will be our future work.
Researches in [28-29] show that the performance of the estimator and classifier are influenced by data distribution. If the distribution of original data can be maintained in obfuscated data, the capability of estimator or classifier will be kept. Researches in [21-23] indicate that the class labels of data are helpful to maintain the boundaries of subclasses after obfuscation. This boundary information is beneficial to preserve the distribution of data after obfuscation. However, there are no class labels in the effort data. Intuitively, there may exist some relationships between samples having similar effort. Following this intuition, we conducted an experiment: we first categorize the samples with similar effort into the same subclass, and then investigate the data distribution of each subclass. We divide the observed samples into three equal subclasses according to the effort values, perform principal component analysis (PCA) [31] on these samples to obtain the principle components, and then use two major PCA features of samples to illustrate the distributions. Figure 1 illustrates the distribution of each subclass in the Kitchenham and Coc81 dataset [32-33]. We can see that the samples from the same subclass have a roughly similar distribution. This inspires us to design an efficient method to dig class labels from effort values (i.e., divide effort data into several subclasses according to the effort values). Furthermore, the effort value ranges of different subclasses are ordered and different subclasses have clear boundary information, which motivates us to use the subclass order for protecting the boundary of obfuscated data and maintaining the distribution of the data inherited from original data.
1.2 Motivation Sensitive attribute disclosure problem is one of the main obstacles to effort data sharing. The QIDs values are principal components for building an effort model, which are usually contained in the published data. If the data is published without any anonymization processing, the adversaries can deduce the values of related sensitive attributes by using the QIDs values in these data, and obtain relative advantages in bidding. For example, LOC is linked closely with QIDs in effort data and can be obtained according to the QIDs values with specific background knowledge, and then the adversaries can get ranges of hourly productivity by dividing the LOC by effort. To the best of our knowledge, the problem of sensitive attributes values disclosure in effort data sharing has not been well studied. Although there exist a number of general privacy-preserving methods [1-14], these methods cannot be directly employed to solve the privacy-preserving problem in effort data sharing effectively. Specifically, generalization and suppression based anonymization methods [1-8] use generalization or suppression with some rules to replace QIDs values. However, the utility of the data obfuscated by these methods will decline when most of the values tend to be consistent. Clustering-based methods [9-14] use the mean/center values of some attributes in a cluster, to replace the value of the corresponding attribute in the target sample from the same cluster. Thus these methods will be severely influenced by the clustering parameter k . Swappingbased methods [15-17, 20] replace the value of an attribute by using the other values in the value list of this attribute according to certain percentage and rules. These methods will be influenced by the swapping percentages, and thus their performances are unstable.
In this paper, we aim to answer the following research questions: RQ 1: In privacy preserving of effort data, how to maintain the utility of the privatized data? RQ 2: How to effectively preserve the privacy contained in the original effort data? Kitchenham
1000
500
0
-500
-1000 -12000
-10000
-8000
-6000
Coc81
2
subclass 1 effort range:[121-1148] subclass 2 effort range:[1160-2350] subclass 3 effort range:[2436-14226]
first p rinciple com pon ent
first princip le com ponent
1500
-4000
second principal component
-2000
0
subclass 1 effort range:[5.90-50] subclass 2 effort range:[55-237] subclass 3 effort range:[240-11400]
1 0 -1 -2 -3 -4 -5 -6 -500
-450
-400
-350
-300
-250
-200
-150
-100
-50
0
second principle component
Figure 1. An example of sub-class division for effort data
76
with each cluster containing at least k data points (i.e., records), and then published the final cluster centers along with some cluster size and radius information. Method in [10] uses the clustering idea to implement the k-anonymity and wants to find a set of clusters (i.e., equivalence classes), each of which contains at least k records. The clustering-based methods are sensitive to the setting of parameters k and improper selection of parameter k will influence the privacy and utility of obfuscated data.
1.3 Contribution The contributions of our study are summarized as following three points: 1. We are among the first to investigate the problem of privacy preserving in software effort estimation data sharing, and provide an effective solution for this problem. 2. We design an Interval Covering based Subclass Division (ICSD) strategy for effort data. ICSD digs a new attribute (i.e., class label) for effort data, which is helpful to maintain the distribution of original data after obfuscation. In addition, for other tasks in which the data has no the class label, ICSD can also be applied to dig class labels for their data, such that the quantitative problem can be transformed into qualitative problem.
Swapping-based anonymization methods select a part of QIDs values to replace the other partial QIDs values, which belong to permutation approach that dissociate the relationship between an insensitive QIDs and a numerical sensitive attribute [22]. EstivillCastro and Brankovic [15] proposed a method, which randomly swaps the class labels for privatizing data. In the Census Bureau’s version [42], records are swapped between census blocks for individuals or households that have been matched on a predetermined set of k variables. These methods are affected by the percentage of swapping, and their performances are instable.
3. We propose a manifold learning based bi-directional obfuscation algorithm (MLBDO) for effort data obfuscation. MLBDO uses two nearest neighbors, which are selected from the previous subclass and the next subclass of target sample by using manifold learning based selector, as the disturbance to obfuscate the target sample.
2.2 Privacy-Preserving Methods in Software Engineering
We conduct experiments on the public datasets from Promise repositories including NASA93 [34], Maxwell [35], Kitchenham [32], Kemerer [36], Coc81 [33], China [20, 34], and Albrecht [37]. The experimental results demonstrate that our approach, i.e., ICSD&MLBDO, is an effective privacy-preserving approach for effort data, and it can protect the privacy of original data and maintain the utility of obfuscated data simultaneously.
In recent years, a few privacy-preserving methods have been presented in software engineering [21-26]. These works mainly focus on defect prediction, testing and debugging. Representative privacy-preserving methods for defect prediction include LACE1 [21-22] and LACE2 with LeaF [23]. LACE1 is a collective name, including two methods MOPRH [21] and CLIFF&MORPH [22]. MOPRH uses the nearest unlike neighbors to obfuscate the target sample, and achieves promising results. Considering that the uninformative samples may affect the performance of the privacy result and the speed of processing, CLIFF&MORPH was presented, which employs an instance pruner to delete uninformative samples. Recently, LACE2 [23] is presented to tackle the privacy-preserving data sharing problem where data owners can incrementally share their owned data into the same data pool.
2. RELATED WORK 2.1 General Privacy-Preserving Methods To solve the privacy-preserving problem, a number of general privacy-preserving methods have been presented [38], including generalization and suppression based methods [1-8], clusteringbased methods [9-14] and swapping-based methods [15-17], etc. Generalization and suppression based methods use a less specific and more general value, which is faithful to the original value, to replace the target value. In the process of generalization, proper suppression strategies are usually incorporated to avoid all of the values are generalized into the maximal element (e.g., 8533 and 9633 are generalized to the same value ‘****’). For example, kanonymity [3] requires that each sample is indistinguishable with at least k 1 other samples with respect to the QIDs. However, kanonymity cannot ensure the privacy if the attacker has background knowledge of the domain [5, 7-8]. l-diversity [7] aims to solve the disadvantages in k-anonymity, which requires that the distribution of a sensitive attribute in each equivalence class has at least l “well-represented” values. However, researches in [6, 8, 22] demonstrate that l-diversity is insufficient to prevent the attribute disclosure. t-closeness [8] aims to keep the distance between the distributions of a sensitive attribute in a QIDs group and that of the whole table no more than a threshold t apart. Yet, t-closeness has the following shortcoming: it limits the relationship between Quasi-identifiers and sensitive attributes, and lacks computational procedures which allow to reach its goal with minimum data utility loss [22,38-40].
Due to the limited resource, companies usually outsource part of software to other companies. When the outsourced software enters the phase of testing, the subcontractors need the relevant data to test the software. However, the data to be provided to subcontractors may contain sensitive attribute values, which makes the data owners do not want to provide real samples for the testing. In order to solve this issue, Taneja et al. [25] proposed PRIvacy Equalizer for Software Testing (PRIEST), which combines a new data privacy framework with program analysis enabling business analysts to determine the output testing data. Budi et al. [26] proposed a kb e anonymity method, which creates the testing coverage data based on the concept of “subpath equivalence”. Clause and Orso [24] designed the Camouflage method, which introduces the path condition relaxation and breakable input conditions to generate several anonymized version sample of original failure-inducing input sample.
2.3 Manifold Learning Manifold learning is an efficient dimension reduction technique, which can reduce the dimension and preserve the non-linear structure of the data [43]. Manifold learning has been applied in many practical problems, such as human action recognition [45], voice recognition [46] and face recognition [47].
Clustering-based methods firstly divide the original data into several clusters, where the samples within the same cluster are related and those from different clusters are unrelated, and then replace the QIDs value of a target sample by using the mean/center QIDs value of its corresponding cluster. For instance, Aggarwal et al. [14] partitioned the records into several clusters
Locality Preserving Projections (LPP) [48] is a representative Manifold learning method. LPP firstly uses the information of
77
data points to establish a connection adjacency graph, and then computes a transformation matrix which maps the data points into a subspace, and gets the representation of the data in a lower dimensional space. In the low dimensional space, the intrinsic dimensionality of the data can be obtained and the distribution of the original data can be maintained.
Given an effort dataset X {( x1 , y1 ), , ( xi , yi ), , ( xn , y n )} with
3. OUR APPROACH
denote m intervals, at the anytime, s j f j , and then we aim to get
3.1 Overview of Our Approach
the following output: (1) A number y i in Y ,which is not covered by any interval in I , or (2) A minimum cardinality subset C of intervals I , which collectively covers all points in Y .
n samples, ( xi , yi ) denotes the i th sample, xi represents the independent attributes (QIDs) of the i th sample, y i represents effort of the i th samples, then the effort labels of X can be denoted by Y ( y1 , y 2 , y n ) . We use I {[s j , f j ] | j 1m} to
The privacy and utility are two important aspects that need to be considered when designing a privacy-preserving algorithm. The basic idea of our approach is as follows: we firstly divide the effort data into several subclasses with the designed ICSD strategy. Then, target samples in each subclass are obfuscated by utilizing the proposed MLBDO algorithm. Figure 2 illustrates an overview of our approach for privacy preserving problem in effort data sharing. More technical details will be introduced in subsection 3.2 and 3.3.
The division process of ICSD is as follows: Step 1: Calculate the tolerance error range of each y i , i.e., upper boundary and lower boundary of y i , according to Formula (1). The obtained ranges of all samples are represented by YR {[ y1l , y1u ], ,[ yil , yiu ], ,[ y nl , y nu ]}
yil yi * yi u yi yi * yi
(1)
u l where yi is the upper bound and yi is the lower bound.
xi
(i 1)th
hil1
hik1
xi
Step 2: Calculate the coverage number of each y i , i.e., how many ranges y i is covered by, according to Formula (2). Denote by C {C1 , , C i , , C n } the coverage numbers of all samples, where C i is the coverage number of y i .
(i 1)th
n
Ci Cij
obfuscated xij* xij * ( xij hil1 ) * ( xij1 hik1 )
(2),
j 1
1, if yi [ y lj , y uj ] where Cij . 0, otherwise
Figure 2. Illustration of the framework of our privacypreserving approach
3.2 Interval Covering Based Subclass Division Algorithm
Step 3: Label samples according to the ascending order of the C . For the i th sample that has not been categorized into any subclass, we classify all the samples covered by the tolerance error range of y i into a new subclass if these samples have not been labeled. The reason of using the ascending order is that: If a sample has higher covering number, then more samples may be covered by the range of the sample with higher possibility. Therefore, labeling samples according to the descending order of C will generate subclasses suffering from the class imbalance problem [50].
Researches in [21-23] indicate that the class labels of data are helpful to maintain the boundaries of subclasses after obfuscation. Considering that there are no class labels in effort data, it is necessary to dig a new attribute, i.e., class label, for effort data. Figure 1 indicates that the samples with similar effort have roughly similar distributions, which motivates us to design a subclass division strategy according to the effort values. In practice, since the final effort of a new project may be affected by some uncertainty factors, e.g., changes in funds and requirement, the estimation error is inevitable in effort estimation. Denoted by the estimation error, if the actual value is 500, the estimated value should fall in the interval of [500*(1- ), 500*(1+ )]. Therefore, it is reasonable to consider the estimation error in the process of designing the subclass division strategy.
Step 4: For the sample whose tolerance error range only covers itself, we offer two ways to process it: (1) discarding it, in our opinion, this kind of sample can be regarded as noise sample that may impair the effort estimation of a new project; (2) Classifying it into the nearest subclasses. Researches in [51-56] show that the error between the estimated effort and the actual effort is acceptable within 25%. In this paper, we set the error tolerance 0.25 (in reality, the researchers can modify the tolerance according to their own needs). To help understand our ICSD, we provide an example as follows:
Based on the above analysis, we design the following basic dividing criterion: given two samples whose efforts are y1 and y2 , respectively, if y1 * (1 ) y 2 y1 * (1 ) , these two samples can be classified into the same subclass. Obviously, the problem of subclass division based on this dividing criterion is actually an interval covering problem [49]. Therefore, we formulate our subclass division problem as the following interval covering problem:
We randomly select a part of samples from dataset NASA93 [34]. The effort values of the selected samples are shown in Figure 3 (a). The tolerance error used in this paper is 0.25 . First, we calculate the tolerance error ranges of each effort value according
78
from the previous and next subclasses, such that the influence of noise samples can be avoid. The definition of NUN can be found in Definition 1. Assume that after ICSD, the original dataset is divided into m sub-classes, m 1 .The obfuscation strategy is shown in Formula (3): xij * xij ( xij hil1 ) * , if i 1 j* j j k (3) xi xi ( xi hi 1 ) * , if i m j* j j l j k ( ) * ( ) * , 1 x x x h x h if i m i i i 1 i i 1 i
to Formula (1) and store the results in Figure 2 (b). Then we calculate the range numbers that each effort value is covered by according to Formula (2). For example, the effort 352.8 is covered by the tolerance error ranges of 444 and itself (i.e., 352.8 [264.6,441] and 352.8 [333,555] ), the coverage number of 352.8 is set as 2 (i.e., C1 2 ). The effort 72 is covered by tolerance error ranges of 72, 90 and itself, the coverage number of 72 is set as 3 (i.e., C 2 3 ). Repeat this process for each effort value, and the results is reported in Figure 2 (c). Finally, we label each sample according to the ascending order of the coverage numbers. For example, from Figure 2 (c), we can see that the samples with lowest coverage number (i.e., C i 1 ) are 24, 48, 444, 2400, 973, 8211, and from Figure 2 (b), we can see that the tolerance error ranges of 24, 48, 2400, 8211 only cover themselves, so we discard these samples; The tolerance error range of 444 covers the effort 352.8 and itself (i.e., 352.8 [333,555] ), we put the efforts 352.8 and 444 into a new subclass, whose class label is set as 444. In a similar way, the efforts 973 and 750 can be classified into another subclass with the class label 973. Next, we label the samples with C i 2 .We firstly check whether these samples have been classified into a subclass, and find that 352.8 and 750 have been classified into subclasses 444 and 973, respectively, and then efforts 352.8 and 750 are skipped. Repeating this process until all samples have been labeled or discarded. The division results of ICSD on example data are reported in Figure 3 (d). effort 352.8
72
72
24
90
where xij is the j th sample of the i th subclass in original data, xij * denotes the privatized sample of xij , hil1 and hik1 are two disturbance samples, which are the nearest neighbor samples from the (i 1)th and (i 1)th subclasses of original data respectively,
for obfuscating xij . and are random values to control the obfuscation degree for target sample. The values of and range from 0.05 to 0.20. Definition 1: Nearest unlike neighbor (NUN). Given a dataset S {S1 ,, Si ,, Sm } with m sub-classes, Si denotes the i th
subclass of S , sij Si denotes the j th sample in Si . We call the sample s lj S j , i j as the nearest unlike neighbor (NUN) sample of sij , if the slj is the nearest neighbor sample of sij in Sj .
In the designed obfuscation strategy, NUN samples are used as disturbances, therefore, we should select a proper NUN selector for our obfuscation algorithm. Euclidean distance based nearest neighbor selector [57] is one of the most popular nearest neighbor selectors in both academic and industrial communities. However, one weakness of the basic Euclidean distance selector is that if one of the input attributes has a relatively large range, then it can overpower the other attributes, leading to that the “true” NUN sample may be missed [58]. Hence, we should filter those attributes that may overpower the other ones when selecting the NUN samples.
48 444 2400 973 750 8211
(a) effort labels of partial samples in NASA93 dataset upper bound 441 effort 352.8 lower bound 264.6
90 72 54
90 72 54
30 113 24 90 18 68
60 555 300 1216 938 112064 48 444 2400 973 750 8211 36 333 1800 730 563 6158.3
(b) tolerance error sub-ranges of effort labels in (a) effort C
352.8 2
72 3
72 3
24 1
90 3
48 444 2400 973 750 8211 1 1 1 1 2 1
(c) coverage number of each effort label for (b) sub-classes effort
72
444
973
72,72,90
444,352.8
973,750
(d) the results of ICSD on (a)
As described in subsection 2.3, LPP is a representative manifold learning method, which is similar with some spectral graph theory [59] technologies (e.g., spectral clustering) used for software engineering. LPP can obtain the intrinsic dimensionality and preserves the structure of data. This property is beneficial for selecting a more precise sample. Therefore, we design a manifold learning based nearest unlike neighbor selector. Specifically, we use LPP as the basic method to map the original data into an “intrinsic dimensionality” space, and then select the NUN sample in the projection space.
Figure 3. An example of effort data subclass division via ICSD
3.3 Manifold Learning-Based Bidirectional Data Obfuscation Algorithm By using the division strategy ICSD, the dataset is divided into several sub-classes, with each sample in the dataset being classified into one subclass or discarded. In this subsection, we describe how we protect the privacy of the samples in each subclass and keep their data utility simultaneously.
Given a dataset X [ x1 , , xi , , xn ] experiencing the ICSD
Nearest unlike neighbor (NUN) based obfuscation algorithms have achieved interesting results in software engineering applications [21-23]. However, when these algorithms are used for obfuscating the effort samples in each subclass, they select NUN samples as disturbance from the whole dataset except the subclass to which the target sample belongs, and thus the selected disturbance sample may be a noise sample (sample that is close to the target sample, but has a significantly different effort value). This will result in that the target sample is close to the noise sample after obfuscation, which is harmful to the utility of privatized data. To solve this problem, we design a bidirectional obfuscation algorithm. Specifically, it utilizes the order of labels of subclasses, and selects NUN samples (disturbance samples)
process, LPP aims to find out a transform matrix A and an equal class Z [ z1 ,, zi ,, zn ] , where zi AT xi . The procedure of LPP [48] is formally stated as follows: 1) Constructing the adjacency graph: Let G denote a graph with m nodes. Nodes i and j are connected by an edge, if xi and x j are “close”. “Close” is measured by following two methods:
-neighborhoods, R , if xi x j
are “close”.
79
2
, then xi and x j
privatized data, we first divide the range of possible values of each attribute into n bins/sub-ranges by employing the Equal frequency binning (EFB) method [21]. Here, n is set as 4. Figure 4 (b) shows the equal frequency binned version of Figure 4 (a). Figure 4 (e) shows an equal frequency binned version of Figure 4 (d). Assume that the adversary sends the same query cplx=[1.151.3] to Figure 4 (b) and Figure 4 (e) respectively, where cplx=[1.15-1.3] means “Please return the sensitive attribute values of the samples with the cplx attribute value of [1.15-1.3]”. The Figure 4 (b) will return the sensitive values of 10# and 11# samples, i.e., (66.6-233] and (66.6-233]. The Figure 4 (e) will return the sensitive values of 2# samples, i.e., (6-10]. From the results, we can see that the returned results of Figure 4 (b) and (e) are different, which means that the ICSD&MLBDO successfully protects the privacy of the data under the requirement that query cplx=[1.15-1.3] and query size=1.
k -nearest neighbor. k N , xi and x j is “close”, if
xi nearest ( x j , k )
x j nearest ( xi , k ) ,
or
nearest ( x* , k )
denotes the k nearest neighbors of x* . 2) Choosing the weights: the weight wij between xi and x j has
two forms:
xi x j
2
Heat Kernel, wij e
Simple minded, wij 1 if and only if vertices i and j are
t
, tR ,
connected by an edge. The justification for the choice of weights can be traced back to [44]. 3) Eigenmaps: Compute the eigenvectors and eigenvalues for the generalized eigenvector problem: XLX T a XDX T a (4) where D is a diagonal matrix whose entries are column (or row, since D is symmetric) sums of W , Dii j W ji . L D W is
# 1 2 3 4 5
the Laplacian matrix. The i th column of matrix X is xi .
6 7
Let the column vectors a0 ,, al 1 be the solutions of Equation (4),
8 9 10
ordered according to their eigenvalues 0 l 1 . Thus, the embedding is as follows: xi zi AT xi , A (a0 , , al 1 ) ,
11
where zi is a l -dimensional vector, and A is a n l matrix. After obtaining the equal class Z , then we can obfuscate each sample xij of the i th sub-class X i with it. We firstly get two NUN samples’ indexes of xij from (i 1)th and (i 1)th subclasses of Z respectively, and then we select the two NUN samples hil1 and hil1 from S according to this two indexes.
cplx
acap
Pcap
kloc
effort
1.15 1.15 1.15 1.15 1 1 1.15 1.15 0.85 1.3 1.3
1 0.86 0.86 0.86 1 1 1 1 0.86 0.86 0.86
1 0.86 0.7 0.86 0.86 0.86 0.86 0.86 1 1 0.86
66.6 7.5 20 6 15 10 90 302 284.7 101 233
352.8 72 72 24 90 48 444 2400 973 750 8211
(a) randomly selected partial samples from NASA93
#
cplx
acap
pcap
kloc
effort
1 2 3 4 5 6 7 8 9 10 11
(1-1.15] (1-1.15] (1-1.15] (1-1.15] [0.85-1] [0.85-1] (1-1.15] (1-1.15] [0.85-1] (1.15-1.3] (1.15-1.3]
(0.86-1] [0.86-0.86] [0.86-0.86] [0.86-0.86] (0.86-1] (0.86-1] (0.86-1] (0.86-1] [0.86-0.86] [0.86-0.86] [0.86-0.86]
(0.86-1] [0.7-0.86] [0.7-0.86] [0.7-0.86] [0.7-0.86] [0.7-0.86] [0.7-0.86] [0.7-0.86] (0.86-1] (0.86-1] [0.7-0.86]
(10-66.6] [6-10] (10-66.6] [6-10] (10-66.6] [6-10] (66.6-233] (233-302] (233-302] (66.6-233] (66.6-233]
352.8 72 72 24 90 48 444 2400 973 750 8211
(b) EFB version of (a)
j
Finally, we can obfuscate the xi according to Equation (3). We call the bidirectional obfuscation algorithm with the manifold learning based nearest unlike neighbor selector as Manifold learning based Bidirectional Data Obfuscation (MLBDO) algorithm.
3.4 An Example of ICSD&MLBDO In this subsection, we provide a complete example to explain our ICSD&MLBDO approach. We randomly select a part of samples from NASA93 [34], and set the tolerance error as 0.25 . The selected samples are shown in Figure 4 (a). We firstly use ICSD to divide the data into several subclasses (the detail process of subclasses division is shown in Figure 3). The original data is divided into 3 sub-classes and the division result is shown in Figure 4 (c). Next, we use the MLBDO to obfuscate the samples in Figure 4(c), and the result of obfuscation is reported in Figure 4 (d).
#
cplx
2 3 5 7 1 9 10
1.15 1.15 1 1.15 1.15 0.85 1.3
acap
pcap
kloc
effort
0.86 0.86 7.5 0.86 0.7 20 1 0.86 15 1 0.86 90 1 1 66.6 0.86 1 284.7 0.86 1 101 (a) the remained samples of (a) after ICSD
subclasses
72 72 90 444 352.8 973 750
72 444 973
#
cp lx
acap
p cap
Kloc
effort
2 3 5 7 1 9 10
1.16 0.93 1.46 1.58 0.92 1.06 1.15
0.87 0.64 1.35 1.29 0.77 1.07 0.71
0.87 0.51 1.24 1.19 0.80 1.18 0.87
7.5 20.0 15.0 90.0 66.6 284.7 101.0
72 72 90 444 352.8 973 750
(d) the obfuscated data after ICSD&MLBDO
#
cp lx
acap
pcap
kloc
effort
2 3 5 7 1 9 10
(1.15-1.3] [0.85-1] (1.3-*) (1.3-*) [0.85-1] (1-1.15] (1-1.15]
(0.86-1] (-*-0.86] (1-*) (1-*) (-*-0.86] (1-*) (-*-0.86]
(0.86-1] (-*-0.7] (1-*) (1-*) [0.7-0.86] (1-*) (0.86-1]
[6-10] (10-66.6] (10-66.6] (66.6-233] (10-66.6] (233-302] (66.6-233]
72 72 90 444 352.8 973 750
(e) EFB version of (d)
Figure 4. An example of ICSD&MLBDO
Next, we test whether the privacy of the effort data has been protected successfully. We suppose that the KLOC attribute has been removed after obfuscation, and the adversary has the related background knowledge about NASA dataset. The test strategy is as follows: The adversary sends a query to original dataset and privatized dataset respectively, then both datasets will return a group of answers for this query; If the result returned from the original dataset is equal to that returned from the privatized dataset, the protection for this sample for this attack is regarded as a failure, and vice versa. In the testing process, for the original and
3.5 Answers for Two Research Questions Answers for RQ1: In privacy preserving of effort data, how to maintain the utility of the obfuscated data?
In order to achieve this purpose, we design the ICSD to classify the samples with similar effort into the same class, and the obtained ordered class labels provide a guideline for obfuscating the target sample. In the process of obfuscation, instead of selecting NUN samples from all of other subclasses, we select the NUN samples from the previous and the next subclasses
80
Points [32, 36, 60,62-63]. The brief properties of these datasets used in this paper are shown in Table 1. The metrics of these datasets are show in Figure 5.
according to the class labels, and use the obtained NUN samples as disturbances to obfuscate the target sample. In this way, the obfuscated target sample will not invade the boundaries of the other subclasses that have significantly different effort values, which means that the data distribution can be maintained after obfuscation.
COCOMO
Answering for RQ2: How to effectively preserve the privacy contained in the original effort data?
In order to achieve this purpose, we design the MLBDO algorithm to obfuscate the target sample. MLBDO uses two NUN samples, which are selected by using manifold learning based nearest unlike neighbor selector, as the disturbance samples to obfuscate the target sample bi-directionally. The bi-directional strategy can not only avoid the influence of noise sample but also increase the privacy of obfuscation.
3.6 Comparison with Related Works In this subsection, we provide a discussion about the differences between our privacy-preserving method and related privacy preserving methods [3,19-24]. Comparison with general privacy preserving methods: The main differences between our approach and general privacy preserving methods [3, 20] are two-folds: (1) These methods are not designed for tasks in software engineering, while our method is designed for the PPDS of effort data. (2) These methods don’t consider the influence of outlier samples, while our approach can cut these samples automatically, such that the remaining samples are more suitable for building a precise estimation/prediction model.
Number of attributes
Minimum effort value
Maximum effort value
Nasa93 Maxwell Kitchenham Kemerer Coc81 China Albrecht
93 62 145 15 63 499 24
18 26 5 15 17 18 8
8.4 583 219 23.2 5.9 26 0.5
8211 63694 113930 1107.31 11400 54620 105.2
rely
Required software reliability
duration
Nlan
data turn
Data base size
RAWFP Raw function points count
T01
Customer participation
Turnaround time
AdjFP
Adjusted function points
T02
Development environment adequacy
time
Time constraint for cpu
FPAdj
Transformation of RAWFP
T03
Staff availability
stor
Main memory constraint
enquiry
T04
Standards use
virt
Machine volatility
file
Function points(ufp) of external enquiry Ufp of internal logical files or entity references
T05
Methods use
tool
Use of software tools
Pdr_ufp
Normalized level 1 productivity delivery rate
T06
Tools use
dced
Schedule constraint
Npdr_afp
T07
Software’s logical complexity
aexp
Application experience
Normalized productivity delivery rate Function points(ufp)of external output
T08
Requirements volatility
pcap
Programmers capability
T09
Quality requirements
vexp lexp
Virtual machine experience
modp
Modern programing practices
cplx acap LOC
Process complexity
KSLOC
Analysts capability
App
Language experience
output
Maxwell
Npdu_ufp Productivity delivery rate
input afp
T10
Efficiency requirements
Function points(ufp) of input
T11
Installation requirements
Adjusted function points
T12
Staff analysis skills
Thousands of lines of code
T13
Staff application knowledge
Application type
T14
Staff tool skills
T15
Staff team skills
Line of code
har
Hardware platform
Function points
dba
Database
interface
Function points(ufp) of external interface added
ifc
changed
Function points(ufp) of changed functions
source
deleted
Function points(cfp) of deleted functions
telonuse
resource
duratio n
Duration
User interface
time
Time
Where developed
size
Application size(number of function points)
Telon use
Effort
Team type
4.2 Evaluation Measures The privacy-preserving methods should ensure that the privatized data has both favorable privacy and utility. To evaluate the privacy and utility of the privatized data, we employ the following measures.
4.2.1 Measure of Privacy We use the Increased Privacy Ratio (IPR) [23] to evaluate the privacy ability of a method. Given Q {q1 , q2 , , qN } denotes N queries. Informally, the IPR can be defined as follows Q 1 (5), P 100 * IPR (T ) 1 Ki Q i 0
Table 1. Brief properties of used data sets with different query size Number of samples
Maxwell Number of different development languages used
Figure 5. The metrics of the data sets used in this work. The red rectangle box outlines the sensitive attributes used in this paper.
Comparison with privacy preserving methods in software engineering: The main differences between our approach and these methods [21-26] are two-folds: (1) Different with these methods, which use one NUN sample as the disturbance to obfuscate the target sample, our method uses two NUN samples as the disturbance to obfuscate the target sample bi-directionally. By using the bi-directional strategy, our method can avoid using the noise sample as the disturbance, whose effort values are significantly different from those of target sample. The utilization of noise samples will affect the maintaining of data distribution. (2) When maintaining the utility of obfuscated data, we not only consider the influence of obfuscation ranges but also consider maintaining the data distribution after obfuscation.
Dataset
Function points Total elapsed time for the project
where T represents the dataset to be evaluated, and 1, if S max ( RTi ) Smax ( RT' i ) . Ki 0, otherwise. Here, S max ( RTi* ) is the highest frequency value of RTi* . RTi* is the results of i th query, which is a group of value from any dataset matches the i th query. For example, RTi' {(1 2],(1 2],[3 6]} denotes the results of i th query returned from privatized data, and then S max ( RTi' ) {(1 2]} . The higher IPR the better a privacypreserving method. The query generator used in this paper detailed in subsection 4.3.
4. EXPERIMENTS 4.1 Data Sets To evaluate the performance of our ICSD&MLBDO approach, we conduct extensive experiments on seven effort datasets, including NASA93 [34], Kitchenham [32], Kemerer [36], Coc81 [33], China [34, 60],Albrecht [37] and Maxwell [35]. The metrics of these datasets are based on COCOMO [54, 60-61] or Function
4.2.2 Measure of Utility Median Magnitude of Relative Error (MdMRE) and Pred(25) [5153] are two commonly used measures for evaluating the effort
81
estimation accuracy of estimators. In the experiments, we employ both measures to evaluate the utility of the privatized data. The definitions of MdMRE and Pred(25) are as follows:
KLOC, AFP and size can be converted to each other [37][64]. We choose the k -nearest neighbor as the “close” measure, and the “Heat Kernel” as the weight measure in LPP, respectively.
Given a sample xi with the actual effort being yi , if the predicted
In the experiment of utility, we randomly select 70% modules in each of effort dataset for training and the remained modules are used for testing. The random selection process for training and testing data may be biased and may affect the evaluation performance. Therefore, we repeat random selection 20 times and report the average estimation results.
effort is yi , the Magnitude of Relative Error (MRE) of xi can be calculated by MREi
yi yi yi
. Then, MdMRE of N samples
can be computed as MdMRE median( MRE1 ,, MREi ,, MREN ) . Pred(25) is defined as the percentage of estimated values falling within 25 percent of the actual values: 100 N 1, if MREi 0.25 . PRED(25) N i 1 0, otherwise
Table 3. Experimental settings used in this work Dataset Nasa93 Maxwell Kitchenham Kemerer Coc81 China Albrecht
For these two measures, the lower of MdMRE represents the better performance of estimator, and the higher Pred(25) represents the estimators being more precise. In this subsection, we introduce the query generator used in this paper. Assume the query size is 1, the process of the generator to create a query is as follows: (1) Randomly select an attribute from QIDs attributes (except sensitive attributes). For example, randomly select an attribute from Figure 4 (b), here, assume that the cplx attribute has been selected by the generator. Then, the generator will find out the distinct ranges from cplx, i.e., cplxranges {[0.85 1],(1 1.15],(1.15 1.3]} .
Compared methods. To benchmark our method, we compare our approach with four methods, including k-anonymity [3], swapping [22], clustering [30] combined with MLBDO (Clustering&MLBDO) and ICSD combined with MORPH [21] (ICSD&MLBDO). We implement the k-anonymity by following the Datafly algorithm [3], and create two versions of k anonymity, namely 2-anonymity and 4-anonymity. In swapping [22], for each QIDs attributes, a certain percent of values need to be replaced by any other distinguishable values in that QIDs. In the experiments, the used percentages include 10%, 20% and 40%. The Clustering&MLBDO method is used to evaluate the performance of the designed ICSD strategy. In experiments, KMeans [29] is employed as the clustering algorithm with the statistical significance level 0.0001 . The ICSD&MORPH is used to evaluate the performance of the proposed bi-directional obfuscation algorithm.
example, the generator randomly selects the [0.85-1] then the new created query is cplx[0.85 1] , which means “Please return the sensitive attribute values of the samples with the cplx attribute value of [0.85-1]”. Table 2 shows the examples of generated queries with different query sizes(1, 2) on the data of Figure 4 (b). Table 2. Examples of generated queries QIDs
Return
(10 66.6] , [6 10] , (233 302]
cplx[0.85 1] , acap[0.86 1]
(10 66.6] , [6 10]
EFB size 10 10 10 10 10 10 10
For this part, we perform experiments to evaluate the privacy and utility for our approach and the applicability of our privacypreserving approach for different effort estimators, respectively.
(2) Randomly select a range from cplxranges as a query. For
cplx[0.85 1]
Sensitive attributes KLOC SIZE AFP KSLOC LOC AFP AFP
4.5 Evaluation of Privacy and Utility for ICSD&MLBDO
4.3 Query Generator
Query size 1 2
Number of QIDs 16 24 3 13 15 16 6
Results of privacy and utility testing. In the experiments, the classic classification and regression trees (CART) [63] is employed as the baseline estimator. We conduct our approach and the compared methods on 7 datasets, and then compute the IPR, Pred(25) and MdMRE of each method. Due to the limited space, Figure 6 only provides the experimental results with query size=1. From Figure 6, we can see that the privacy and utility of ICSD&MLBDO are better than those of the competing methods. Compared with k-anonymity, our approach achieves significantly better utility performance. The reason is that k-anonymity uses the “generalization” strategy to protect the sensitive values, leading to that the information used for modeling becomes less. Compared with swapping [22], our approach can achieve more stable performance. The main reason is that swapping method replaces the values of sensitive attributes with the random strategy. Compared with ICSD&MORPH, ICSD&MLBDO achieves higher privacy performance. The main reasons, in our opinion, are two-folds: (1) MORPH uses the global NUN sample selected from all other subclasses as the disturbance to obfuscate the target sample, while MLBDO selects the NUN samples from previous and next subclasses, and thus the disturbance ability of the NUN
4.4 Experimental Settings Assume that a dataset has mi QIDs attributes, with the j th attribute has n j distinct sub-ranges. If the query size is p , we can get Cmpi ( n j ) p quires. It is unrealistic and unnecessary to enumerate all possible quires of all possible query sizes, especially when the number of QIDs attributes is very large. For example, the dataset NASA93 has 16 QIDs attributes, if each QIDs attribute has 10 distinct sub-ranges and query size is respectively 1, 2, and 4, then the generator will create C161 (10)1 C162 (10) 2 C164 * (10) 4 1,821, 260 queries. Thus, in our experiments, the selected query sizes include 1, 2, and 4, and up to 1000 queries are generated for each query size. The number of sensitive attributes and the size of EFB used in this paper are listed in Table 3. The sensitive attributes used in this paper include LOC (KLOC), AFP and size, because these attributes are very important influence factors for bidding, and the data owners may not want these data to be obtained by adversaries. In addition,
82
“tie” means “equal”, otherwise “lose”. We can see that the proposed approach makes a significant difference in comparison with other compared methods.
IPR(% )
normalour
Kitchenham
40
14.4
20 80
normalour
20
20 0
20
40 50 60 IPR(%)
80
s1
s2
s4
k2
k4
im
0 100
normal our
10
18.4
40
9.2
20 0 100
20
s1
s2
s4
k2
k4
s1
s2
s4
k2
k4
im
cm
0 s2
s4
k2
k4
20 10 normalour
im
50
80
40
23.7
60
15.8
40
7.9
20
0
cm
0
20
40 50 60 IPR(%)
s1
s2
s4
k2
k4
im
cm
normalour
Pred(25) of Kemerer 100
31.6
80
50
30 20 10 normalour
60
35.8
40
17.9
normal our
s1
s2
s4
k2
k4
im
0
cm
s2
s4
k2
k4
im
20
40 50 60 IPR(%)
80
k4
im
cm
k4
im
cm
k4
im
cm
40 30 20
normalour
cm
0 100
s1
s2
s4
k2
MdMRE of China
140 120
60
20 0
k2
10 s1
Pred(25) of China
53.7
s4
MdMRE of Kemerer
60
100
60
s2
50
China 80
s1
0
0 100
71.6
30 20
0
0 100
40
40 20
100 80 60 40 20
normalour
s1
s2
s4
k2
k4
im
cm
normalour
s1
s2
s4
k2
MdMRE of Albrecht
20
s1
80
30
Kemerer
Pred(25) of swapping 10% (s1) MdMRE of swapping 10% (s1) Pred(25) of swapping 20% (s2) MdMRE of swapping 20% (s2) Pred(25) of swapping 40% (s4) MdMRE of swapping 40% (s4)
60
40
normalour
40 50 60 IPR(%)
50 MdMRE
20 0
MdMRE of Maxwell 60
40 Pred(25)(%)
11.2 0
cm
40
60 Pred(25)(%)
60
Pred(25)(%)
27.6
40
MdMRE of Coc81
Pred(25) of Albrecht
80
80
im
70
20
normalour
100
40 50 60 IPR(%)
k4
60
20
cm
30
Albrecht
20
k2
0
36.8
0
s4
40
MdMRE
40
Pred(25)(%)
60
27
s2
60
22.4
MdMRE of Kitchenham
Pred(25) of Coc81
40.5
s1
80
100
13.5
MdMRE
cm
30
normalour
80
0
im
10
0 100
54
0
k4
40
Coc81
67.5
MdMRE
40 50 60 IPR(%)
k2
33.6
100 Pred(25)(%)
60
28.8
Pred(25)(%)
43.2
20
s4
50
Pred(25)(%)
MdMRE
80
0
s2
Pred(25) of Kitchenham 100
57.6
0
s1
80
MdMRE
20
0 100
44.8
MdMRE
80
MdMRE
40 50 60 IPR(%)
MdMRE
20
MdMRE
0
30
Albrecht
Pred(25) of Maxwell 100
35
MdMRE
0
Maxwell
40
Albrecht
(2) Bias of evaluation measures. Another bias is the MdMRE and Pred(25) measures used to report the imputation or estimation performances. Other measures, such as Mean Balanced Relative Error (MBRE) [34], the Mean Inverted Balanced Relative Error (MIBRE) [34], CLUSTER [55] and Standardized Accuracy [68] are not used. In this work, we employ the widely used MdMRE and Pred(25) measures to show the empirical evaluation for the application of software effort estimation.
25
10
China
China
(1) Bias of estimators. A bias in this study is the estimators we used for effort estimation. In experiments, we select three commonly used estimators to evaluate our approach. As to more other estimators, experiments might need to be done to evaluate our approach.
Pred(25)(%)
20
Coc81
Coc81
Followings are several potential threats to the validity with respect to experiments:
MdMRE of NASA93
20
Datasets
ICSD&MLDBO 2-anonymity 4-anonymity swapping-10% swapping-20% swapping-40% ICSD&MORPH Clustering&MLBDO
5. THREATS AND VALIDITY
45
30
20 Albrecht NASA93 Maxwell Kitchenham Kemerer
Figure 7. Privacy comparision of all methods on 7 datasets versus different query sizes.
Pred(25)(%)
11.8
IPR(% ) Datasets
Pred(25)(%)
40
China
55
Pred(25)(%)
60
23.6
Coc81
60
Pred(25)(%)
35.4
Datasets Query Size=4
50 NASA93 Maxwell Kitchenham Kemerer
MdMRE
40 MdMRE
50
80
Pred(25)(%)
100 Pred(25)(%)
MdMRE
Pred(25) of NASA93
NASA93
59
40 30
65
Results of applicability testing. To investigate whether the privatized data with our approach can be applied to multiple estimators, we design another experiment. We select another two estimators including automatically transformed linear model (ATLM) [54] and radial basis function networks (RBFN) [51, 64]. Tables 4 and 5 report the average effort estimation results of ICSD&MLBDO and the competing methods with these two estimators on seven datasets for the MdMRE and Pred(25) measures, respectively. In Tables 4 and 5, we also report the estimation results using original data, and the results are called as “normal”. From both tables, we can see that the performances of our approach are still better than those of the competing methods, which indicates that the privatized data by using our approach can well apply to different estimators. To statistically analyze the results given in Tables 4 and 5, we conduct a statistical test, i.e., Wilcoxon test [55-56], with 95 percent confidence to get the socalled win-tie-loss (w/t/l) results. “Win” means the results of our approach is significantly different with compared methods, and 47.2
60 50
NASA93 Maxwell Kitchenham Kemerer
The boxplot figures in Figure 6 show the variability of utility measures of different privatized data obtained by using different privacy methods, we can see that the variability of ICSD&MLBDO is better than that of compared methods. Figure 7 illustrates the experiment results under different query sizes (1, 2 and 4) on different datasets. We can see that our approach obtains higher privacy performances than the compared methods under all three query sizes.
Query Size=2
Query Size=1
100 90 80 70 60 50 40 30 20 0
IPR(% )
samples selected by MORPH may be lower than that of NUN samples selected by MLBDO. (2) The designed bi-directional obfuscation algorithm of MLBDO can provide more powerful obfuscating ability. Compared with Clustering&MLBDO, ICSD&MLBDO achieves higher utility. The reason is that ICSD considers the estimation error in the process of subclass division, and removes the outlier sample automatically. ICSD&MLBDO combines the advantages of ICSD and MLBDO, and therefore better performances in privacy and utility can be achieved.
im
cm
40 20 0
normal our
s1
s2
s4
k2
k4
im
Pred(25) of original data (normal) MdMRE of original data (normal) Pred(25) of ICSD&MLBDO (our) MdMRE of ICSD&MLBDO (our) Pred(25) of 2-anonymity (k2) MdMRE of 2-anonymity (k2) Pred(25) of 4-anonymity (k4) MdMRE of 4-anonymity (k4)
Pred(25) of Clustering&MLBDO (cm) MdMRE of Clustering&MLBDO (cm) Pred(25) of ICSD&MORPH (im) MdMRE of ICSD&MORPH (im) Privacy baseline with IPR=80% Pred(25) line of original data MdMRE line of original data
cm
Figure 6. Comparison of privacy-preserving methods using the CART estimator on MdMRE and Pred(25) measures with query size=1
The horizontal dashed thin line represents the Pred(25) measures, the dashed thick line represents the MdMRE measures, and the vertical line denotes the privacy baseline with IPR=80%.
83
Table 4. Comparison of utility results using two estimators with seven data sets on MdMRE Estimator ATLM
methods k-anonymity
params k=2 k=4 Swapping(N%) N=10 N=20 N=40 ICSD&MORPH Clustering&MLBDO ICSD&MLBDO Normal w/t/l statistical test RBFN k-anonymity k=2 k=4 Swapping(N%) N=10 N=20 N=40 ICSD&MORPH Clustering&MLBDO ICSD&MLBDO Normal statistical test w/t/l
NASA93 35.1±0.89 39.7±1.76 35.7±0.95 37.7±1.16 39±1.28 35.6±1 37±0.92 33.8±0.97 31.6±1.16 7/0/1 40.2±3.73 45.7±2.25 42.3±2.52 48.1±4.25 53±3.41 42.2±13.1 46.7±12.9 38.6±6.1 35±3.17 7/0/1
Maxwell 34.4±5.08 38.4±2.74 34.3±2.62 38.2±3.48 39.4±3.02 34.9±2.38 36.5±2.82 33.1±3.55 29.7±2.76 7/0/1 33.6±1.35 39.6±1.32 31.2±1.02 37±1.1 43.2±1.52 33.3±3.68 35.4±3.61 29.3±3.99 26.7±1.7 7/0/1
Kitchenham 33.3±3.6 33.9±1.94 33.2±2.06 34.8±2.16 39.2±2.09 33.2±1.86 34.1±1.95 32.3±2.06 30±1.88 7/0/1 25.6±5.41 35.1±4.97 27.4±4.46 35.3±4.5 41.3±4.43 26.3±5.25 28.3±5.29 23.6±5.36 20.4±4.46 7/0/1
Kemerer 10.5±6.72 11.3±5.95 10.3±5.23 13.1±6.33 12.1±5 12±6.22 11±4.22 9.7±5.53 8.8±5.46 7/0/1 15.5±16.9 19.8±30.2 14.5±10.8 16.1±8.73 17.7±10.3 12.7±7.36 13.7±7.31 12.3±8.55 11.1±9.56 5/0/3
Coc81 41±10.81 52.3±12.56 42.1±6.1 45.7±6.57 37.1±13.34 42.2±11.49 42±14.3 41.6±9.24 39.3±6.49 4/2/2 50.7±25.65 58.8±19.72 50.4±22.77 55.6±16.5 61.4±10.88 49.2±19.44 51.9±15.48 48.2±22.09 46.8±21.65 7/0/1
China 43.8±5.47 45.3±4.21 43.2±1.49 44.1±1.52 49.5±1.88 45.2±6.45 47.4±4.92 42.6±6.71 40±1.47 7/0/1 22.2±4.15 27.3±5.05 21.4±8.24 24.6±10.1 33.6±9.25 23.1±13.6 25.9±13.4 19.1±10.7 17.1±10.3 7/0/1
Albrecht 13.8±11.22 18.8±14.61 13.3±9.91 14.5±12.57 16.7±8.69 12.4±6.82 13.6±7.52 13.3±9.76 11.3±10.01 5/2/1 18.1±1.47 23.6±6.09 20.6±1.91 25.5±1.82 31.1±1.56 14.4±1.7 16.9±1.4 15.5±1.99 10.5±1.7 6/0/2
Table 5. Comparison of utility results using two estimators with seven data sets on Pred(25) Estimator
ATLM
methods k-anonymity
params
k=2 k=4 Swapping(N%) N=10 N=20 N=40 ICSD&MORPH Clustering&MLBDO ICSD&MLBDO normal w/t/l statistical test RBFN k-anonymity k=2 k=4 Swapping(N%) N=10 N=20 N=40 ICSD&MORPH Clustering&MLBDO ICSD&MLBDO normal statistical test w/t/l
NASA93
Maxwell
Kitchenham
Kemerer
Coc81
China
Albrecht
56.7±0.08 48.5±0.06 55.8±0.08 49.9±0.07 50±0.06 55.1±0.08 51±0.09 57.1±0.08 58.1±0.07 5/0/3 20.9±0.12 19.5±0.03 20.1±0.14 18.6±0.14 14±0.14 28.8±0.16 27.3±0.13 22±0.17 23.2±0.08 6/1/1
38.4±0.09 36.3±0.11 40.1±0.08 35.5±0.09 28±0.08 37.4±0.11 38.3±0.12 40.1±0.07 42.3±0.08 7/0/1 15.9±0.1 11.2±0.14 14.6±0.11 9.8±0.13 8.5±0.11 16.4±0.14 13.1±0.08 16.6±0.13 18±0.15 6/0/2
29.8±0.07 27.6±0.08 29±0.09 27.4±0.09 27.6±0.09 29.5±0.1 29±0.12 30±0.09 30.6±0.09 6/1/1 27±0.06 19±0.05 26.4±0.05 21.8±0.05 17.4±0.05 24.9±0.06 25.6±0.06 26.3±0.05 28±0.05 6/0/2
28±0.21 16±0.12 27±0.25 22±0.13 21±0.17 27±0.21 25±0.21 30±0.2 32±0.24 4/0/4 33.9±0.15 28.4±0.13 33.9±0.12 27.3±0.14 24.1±0.12 30.5±0.14 29.3±0.12 34.8±0.14 35.3±0.12 7/0/1
49.9±0.07 37.4±0.06 49.1±0.07 42.6±0.05 39.9±0.07 47.2±0.03 45.4±0.04 50.2±0.04 52.3±0.05 6/1/1 16.9±0.13 13.3±0.11 16.6±0.12 15.9±0.13 10.8±0.14 17.4±0.12 16.6±0.15 17.6±0.09 18.2±0.15 8/0/0
70.3±0.03 64.4±0.03 69.8±0.04 65.5±0.03 56±0.04 69.2±0.03 67.3±0.02 72.2±0.04 73.4±0.04 7/0/1 20.9±0.02 17.5±0.02 20.4±0.02 16±0.03 14.7±0.02 20.2±0.09 19.7±0.01 21.6±0.02 24.7±0.02 7/0/1
32.5±0.16 18.5±0.12 33.1±0.13 21.5±0.17 14.9±0.13 31.8±0.14 28±0.18 30.6±0.17 33.1±0.16 8/0/0 37.8±0.11 36.2±0.16 38±0.15 30.4±0.15 26.5±0.12 27.5±0.13 39±0.09 39.9±0.12 40.5±0.13 7/0/1
designed ICSD strategy to divide the original data into several subclasses, whose labels are ordered, and then it utilizes the designed MLBDO algorithm to obfuscate the target sample.
(3) Bias of compared methods. In the family of anonymization, there exist many anonymization methods, and it is hard to compare our ICSD&MLBDO with all of other anonymization methods. In this paper, we compare our method with four different privacy methods from both software engineering domain and general field. As to more other privacy methods, experiments might need to be done to compare our approach.
Experimental results on seven benchmark effort datasets demonstrate that our approach can make the effort data obtain favorable privacy and keep its utility simultaneously. For the future work, we would like to utilize more effort datasets to validate the effectiveness of our approach.
6. CONCLUSION AND FUTURE WORK
7. ACKNOWLEDGEMENTS
In this paper, we study the privacy-preserving problem on effort data. When a company lacks history data to build a model for estimating the effort of a new project, the engineers may try to request relevant data from other companies. However, considering the risk of sensitive attribute values disclosure, most companies would refuse these requests. In order to solve this problem, we propose the ICSD&MLBDO approach. ICSD&MLBDO uses the
The authors want to thank the anonymous reviewers for their constructive comments and suggestions. The work described in this paper was supported by the National Nature Science Foundation of China under Projects No. 61272273, No. 61572375, No. 61233011, No. 91418202, No. 61472178, the Chinese 973 Program under Project No. 2014CB340702.
84
[17] V. S. Verykios, E. Bertino, I. N. Fovino, et al. State-of-theart in privacy-preserving data mining. ACM Sigmod Record, 33(1): 50-57, 2004.
8. REFERENCES [1] K. LeFevre, D. J. DeWitt, R. Ramakrishnan. Mondrian multidimensional k-anonymity. In IEEE International Conference on Data Engineering (ICDE), pages 25-25, 2006.
[18] M. Grechanik, C. Csallner, C. Fu, et al. Is data privacy always good for software testing?.In IEEE International Symposium on Software Reliability Engineering (ISSRE), pages 368-377, 2010.
[2] K. Wang, P. S. Yu, S. Chakraborty. Bottom-up generalization: A data mining solution to privacy protection. In IEEE International Conference on Data Mining (ICDM), pages 249-256, 2004.
[19] T. Li, N. Li, J. Zhang, et al. Slicing: A new approach for privacy-preserving data publishing. IEEE Transactions on Knowledge and Data Engineering, 24(3): 561-574, 2012.
[3] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05): 571-588,2002.
[20] B. Fung, K. Wang, R. Chen, et al. Privacy-preserving data publishing: A survey of recent developments. In ACM Computing Surveys, 42(4): 14, 2010.
[4] L. Sweeney. K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05): 557-570, 2002.
[21] F. Peters, T. Menzies. Privacy and utility for defect Prediction: Experiments with morph. In ACM International Conference on Software Engineering (ICSE), pages 189-199, 2012.
[5] B. Fung, K. Wang, P. S. Yu. Top-down specialization for information and privacy preservation. In IEEE International Conference on Data Engineering (ICDE), pages 205-216, 2005.
[22] F. Peters, T. Menzies, L. Gong, H. Zhang. Balancing privacy and utility in cross-company defect Prediction. IEEE Transactions on Software Engineering, 39(8): 1054-1068, 2013.
[6] R. Chen, B. C. M. Fung, N. Mohammed, et al. Privacypreserving trajectory data publishing by local suppression. Information Sciences, 231: 83-97,2013.
[23] F. Peters, T. Menzies, L. Layman. LACE2: better privacypreserving data sharing for cross project defect Prediction. In ACM International Conference on Software Engineering (ICSE), pages 801-811, 2015.
[7] A. Machanavajjhala, D. Kifer, J. Gehrke, et al. l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1): 1-52,2007.
[24] J. Clause, A. Orso. Camouflage: automated anonymization of field data. In ACM International Conference on Software Engineering (ICSE), pages 21-30, 2011.
[8] N. Li, T. Li, S. Venkatasubramanian. T-closeness: Privacy beyond k-anonymity and l-diversity. In IEEE International Conference on Data Engineering (ICDE), pages 106-115, 2007.
[25] K. Taneja, M. Grechanik, R. Ghani, et al. Testing software in age of data privacy: a balancing act. In ACM SIGSOFT Symposium and European Conference on Foundations of Software Engineering (ESEC/FSE), pages 201-211, 2011.
[9] K. Honda, A. Kawano, A. Notsu, et al. A fuzzy variant of kmember clustering for collaborative filtering with data anonymization. In IEEE International Conference on Fuzzy Systems (FUZZ), pages 1-6, 2012.
[26] D. Lo, L. Jiang, A. Budi. Kb e -anonymity: test data anonymization for evolving programs. In IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 262-265, 2012.
[10] J. W. Byun, A. Kamra, E. Bertino, et al. Efficient kanonymization using clustering techniques. Springer Berlin Heidelberg, 2007
[27] A. Budi, D. Lo, L. Jiang. Kb-anonymity: a model for anonymized behavior-preserving test and debugging data. ACM SIGPLAN Notices, 46(6):447-457, 2011.
[11] H. Kasugai, A. Kawano, K. Honda, et al. A study on applicability of fuzzy k-member clustering to privacypreserving pattern recognition. In IEEE International Conference on Fuzzy Systems (FUZZ), pages 1-6, 2013.
[28] J. Brickell, V. Shmatikov. The cost of privacy: destruction of data-mining utility in anonymized data publishing. In ACM International Conference on Knowledge Discovery and Data Mining (ICKDDM), pages 70-78, 2008.
[12] J. Casas-Roma, J. Herrera-Joancomartí, V. Torra. Anonymizing graphs: measuring quality for clustering. Knowledge and Information Systems, 44(3): 507-528, 2015.
[29] S. J. Pan, Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10): 1345-1359, 2010.
[13] J. Vaidya, C. Clifton. Privacy-preserving k-means clustering over vertically partitioned data. In ACM International Conference on Knowledge Discovery and Data Mining (TKDDM), pages 206-215, 2003.
[30] G. Hamerly, C. Elkan. Learning the K inK-means. Technical Report CS2002-0716, University of California San Diego, 2002.
[14] G. Aggarwal, R. Panigrahy, T. Feder, et al. Achieving anonymity via clustering. ACM Transactions on Algorithms, 6(3): 49, 2010.
[31] I. Jolliffe. Principal component analysis. John Wiley & Sons, 2002.
[15] X. Xiao, Y. Tao. Anatomy: Simple and effective privacy preservation. International Conference on Very Large Data Bases (VLDB), pages 139-150, 2006.
[32] B. Kitchenham, S. L. Pfleeger, B. McColl, S. Eagan. An empirical study of maintenance and development estimation accuracy, Journal of Systems and Software, 64(1):57-77, 2002.
[16] R. C. W. Wong, J. Li, A. W. C. Fu, et al. (α, k)-anonymity: an enhanced k-anonymity model for privacy-preserving data publishing. In ACM International Conference on Knowledge Discovery and Data Mining (KDD), pages 754-759, 2006.
85
[51] E. Kocaguneli, T. Menzies, A. B. Bener, J. W. Keung. Exploiting the essential assumptions of analogy-based effort estimation. IEEE Transactions on Software Engineering, 38(2):425-438, 2012.
[33] A. P. Dempster, N. M. Laird, D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 1(39):1-38, 1977. [34] E. Kocaguneli, T. Menzies, J. W. Keung. On the value of ensemble effort estimation. IEEE Transactions on Software Engineering, 38(6):1403-1416, 2012.
[52] K. Dejaeger, W. Verbeke, D. Martens, B. Baesens. Data mining techniques for software effort estimation: a comparative study. IEEE Transactions on Software Engineering, 38(2):375-397, 2012.
[35] G. Boetticher, T. Menzies, T. Ostrand. PROMISE Repository of empirical software engineering data. West Virginia University, Department of Computer Science, 2007.
[53] K. Liu, L. Xu, J. Zhao. Co-Extracting Opinion Targets and Opinion Words from Online Reviews Based on the Word Alignment Model. IEEE Transactions on Knowledge and Data Engineering, 27(3):636-650, 2015.
[36] C. F. Kemerer. An empirical validation of software cost estimation models. Communications of the ACM, 30(5):416429,1987.
[54] T. Menzies, D. Port, Z. Chen, J. Hihn, S. Sstukes. Validation Methods for Calibrating Software Effort Models, In ACM International Conference on Software Engineering (ICSE), pages 587-595, 2005.
[37] J. E. Matson, B. E. Barrett, J. M. Mellichamp. Software development cost estimation using function points. IEEE Transactions on Software Engineering, 20(4): 275-287, 1994. [38] C. Dwork. Differential privacy: A survey of results. Springer Berlin Heidelberg, 2008.
[55] X. Jing, F. Qi, F. Wu, B. Xu. Missing data imputation based on low-rank recovery and semi-supervised regression for software effort estimation. In ACM International Conference on Software Engineering (ICSE), pages 607-618, 2016.
[39] J. Li, G. Ruhe. Decision support analysis for software effort estimation by analogy. In IEEE International Workshop on Predictor Models in Software Engineering (PROMISE), pages 6-6, 2007.
[56] J. Keung, E. Kocaguneli, T. Menzies. Finding conclusion stability for selecting the best effort predictor in software effort estimation. Automated Software Engineering, 20(4):543-567, 2013.
[40] D. Rebollo-Monedero, J. Forne, J. Domingo-Ferrer. From tcloseness-like privacy to postrandomization via information theory. IEEE Transactions on Knowledge and Data Engineering, 22(11): 1623-1636, 2010.
[57] P. E. Danielsson. Euclidean distance mapping. Computer Graphics and image processing, 14(3): 227-248, 1980.
[41] J. Li, Y. Tao, X. Xiao. Preservation of proximity privacy in publishing numerical sensitive data. In ACM International Conference on Management of Data (ICMD), pages 473-486, 2008.
[58] D. R. Wilson, T. R. Martinez. Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, (6): 1-34, 1997. [59] F. Zhang, Q. Zheng, Y. Zou, Ahmed E. Hassan. Crossproject defect prediction using a connectivity-based unsupervised classifier. In ACM International Conference on Software Engineering (ICSE), pages 309-320, 2016.
[42] S. L. Parker, T. Tong, S. Bolden, et al. Cancer statistics, 1996. CA: A cancer journal for clinicians, 46(1): 5-27, 1996. [43] J. B. Tenenbaum, V. De. Silva, J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500): 2319-2323, 2000.
[60] T. Menzies, A. Butcher, A. Marcus, T. Zimmermann, D. Cok. Local vs. global models for effort estimation and defect Prediction. In IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 343-351, 2011.
[44] M. Belkin, P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering, Advance in Neural Information Processing System. 14: 585-591, 2001. [45] J. Cheng, H. Liu, F. Wang, et al. Silhouette Analysis for Human Action Recognition Based on Supervised Temporal t-SNE and Incremental Learning. IEEE Transactions on Image Processing, 24(10): 3203-3217, 2015.
[61] B. W. Boehm, R. Madachy, B. Steece. Software cost estimation with Cocomo II with Cdrom. Prentice Hall, 2000. [62] B. Twala, M. Cartwright. Ensemble missing data techniques for software effort Prediction. Intelligent Data Analysis, 14(3):299-331, 2010.
[46] Y. Tang, R. Rose. A study of using locality preserving projections for feature extraction in speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1569-1572, 2008.
[63] T. Menzies, A. Butcher, D. Cok, A. Marcus, L. Layman, F. Shull, B. Turhan, T. Zimmermann. Local versus global lessons for defect prediction and effort estimation., IEEE Transactions on Software Engineering, 39(6): 822-834, 2013.
[47] J. Gui, Z. Sun, W. Jia, R. Hu, Y. Lei, S. Ji. Discriminant sparse neighborhood preserving embedding for face recognition. Pattern Recognition, 45(8):2884-2893, 2012 .
[64] A. J. Albrecht, J. E GaffneyJr. Software function, source lines of code, and development effort Prediction: a software science validation. IEEE Transactions on Software Engineering, SE-9(6): 639-648, 1983.
[48] X. Niyogi. Locality preserving projections. MIT Press, 2004. [49] M. Dubinko, R. Kumar, J. Magnani, J. Novak, P. Raghavan, A. Tomkins. Visualizing tags over time. In ACM International Conference on World Wide Web, pages193-202, 2006.
[65] A. Heiat. Comparison of artificial neural network and regression models for estimating software development effort. Information and Software Technology, 44(15):911-922, 2002.
[50] H. He, E. A. Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9): 1263-1284, 2009.
[66] F. Sarro, A. Petrozziello, M. Harman. Multi-objective software effort estimation. In IEEE International Conference on Software Engineering (ICSE), pages 619-630, 2016.
86