Knowl Inf Syst (2009) 20:157–185 DOI 10.1007/s10115-008-0167-x REGULAR PAPER
A distributed approach to enabling privacy-preserving model-based classifier training Hangzai Luo · Jianping Fan · Xiaodong Lin · Aoying Zhou · Elisa Bertino
Received: 28 August 2007 / Revised: 17 May 2008 / Accepted: 4 August 2008 / Published online: 26 September 2008 © Springer-Verlag London Limited 2008
Abstract This paper proposes a novel approach for privacy-preserving distributed modelbased classifier training. Our approach is an important step towards supporting customizable privacy modeling and protection. It consists of three major steps. First, each data site independently learns a weak concept model (i.e., local classifier) for a given data pattern or concept by using its own training samples. An adaptive EM algorithm is proposed to select the model structure and estimate the model parameters simultaneously. The second step deals with combined classifier training by integrating the weak concept models that are shared from multiple data sites. To reduce the data transmission costs and the potential privacy breaches, only the weak concept models are sent to the central site and synthetic samples are directly generated from these shared weak concept models at the central site. Both the shared weak concept models and the synthetic samples are then incorporated to learn a reliable and complete global concept model. A computational approach is developed to automatically achieve a good trade off between the privacy disclosure risk, the sharing benefit and the data utility. The third step deals with validating the combined classifier by distributing the global concept model to all these data sites in the collaboration network while at the same time limiting the
This project is supported by National Science Foundation under 0208539-IIS and 0601542-IIS, grants from AO Foundation and CERIAS, Shanghai Pujiang Program under 08PJ1404600, National Natural Science Foundation of China under 60496325 and National Hi-tech R&D Program of China under 2006AA010111. H. Luo · A. Zhou Shanghai Key Lab of Trustworthy Computing, East China Normal University, Shanghai, China J. Fan (B) Department of Computer Science, University of North Carolina, Charlotte, NC 28223, USA e-mail:
[email protected] X. Lin Department of Mathematical Sciences, University of Cincinnati, Cincinnati, OH 45221, USA E. Bertino Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA
123
158
H. Luo et al.
potential privacy breaches. Our approach has been validated through extensive experiments carried out on four UCI machine learning data sets and two image data sets. Keywords algorithm
Privacy-preserving classifier training · Synthetic samples · Adaptive EM
1 Introduction The rapid growth of the Internet has enabled many relevant knowledge-based collaborative applications. Through the Internet, large volumes of data can be easily acquired, combined and analyzed to extract relevant knowledge. Organizations, in a large number of domains, are today interested in the possibility of combining their data records so that more reliable and complete knowledge can be extracted accurately. Such an unprecedented opportunity is however hindered by privacy and confidentiality concerns [1–8]. Organizations are usually interested in the possibility of collaborating in order to derive complete knowledge which is of interest to them; however, in most cases, organizations cannot make directly available their data records to other parties for confidentiality reasons, for example when the data records disclose important business information of the organization, or for privacy reasons, for example when the data records refer to individuals. Such concerns have motivated intense research in the area of privacy-preserving data mining techniques [9–32]. Most existing techniques for privacy-preserving data mining can be categorized into three broad approaches: (a) The first approach is based on the use of geometric or statistical data transformation [9–16,25–32]. Statistical data transformation via data perturbation [9–16], microaggregation [27], and multiple imputation [28–30] has widely been used for data privacy protection. Recently, geometric rotation transformation has been proposed that transforms the confidential data records onto another geometric space [25]. One basic requirement for geometric and statistical data transformation is that the particular geometric or statistical properties for the original data records should be preserved effectively and efficiently. For data perturbation, the original data records are altered by using additional randomized noise before delivering them to the central site or to the data miner [9–16]. Obviously, perturbation can protect individual data records and data mining algorithms are still able to recover aggregate information or to build data mining models from the perturbed data records. However, data perturbation may result in information loss as well as in privacy breaches due to the disclosure of large amounts of perturbed data records [33–35]. Since this transformation-based approach requires to share large amounts of the transformed data records (i.e., data records after performing geometric or statistical transformation), it may be too expensive for distributed data mining. (b) The second approach is based on the use of secure multi-party computation (SMC) [17–24]. This approach assumes that the local sites storing the data records cooperate to learn the global data mining results without revealing their original data records. The SMC-based approach can provide perfect data privacy protection, but it is very expensive , since it requires the exchange of large amounts of encrypted data records. (c) The third approach is based on distributed computation for each query [36]. When a participant needs to compute a query, it first asks each participant compute a local solution for the query, then compute the global solution via an iterative protocol. This approach has two problems. Firstly, the query is disclosed to all participant, thus there is a potential privacy information breaking via queries. Secondly, even though the high cost training step
123
Weak Concept Model at Data Site 1
Benefit/Risk Analysis
Number of Mixture Components to Be Shared
Weak Concept Model at Data Site i
Benefit/Risk Analysis
Number of Mixture Components to Be Shared
Weak Concept Model at Data Site M
Benefit/Risk Analysis
Number of Mixture Components to Be Shared
159
Feedback
A distributed approach to enabling privacy-preserving model-based classifier training
Combined Classifier Training
Fig. 1 Our approach for privacy-preserving distributed model-based classifier training
is removed, each query must be processed by all participants, results in significant increasing of the overall cost. 1.1 Our approach In this paper, we have proposed a novel approach for enabling privacy-preserving modelbased classifier training that can overcome the drawbacks of the previous approaches. Our approach deals specifically with the problem of the distributed training of a model-based classifier when the data records are stored by multiple data sites, each of which requires its local data records to be maintained private. In addition, the statistics on the available training samples at each individual data site may not be representative of the principal statistical properties of the given data pattern or concept. In its essence, our approach consists of three major phases (see Fig. 1). In the first phase, each data site independently builds its own weak concept model (i.e., local classifier) for the given data pattern or concept, by using some local training samples. In the second phase, each site sends only its weak concept model to the central site. No local training samples are sent to the central site, thus greatly reducing both the data transmission costs and the potential privacy breaches. At the central site, synthetic samples are directly generated from the collected weak concept models by using Markov Chain Monte Carlo sampling technique [9,10]. Because the synthetic samples have the same statistical properties as the original data records, they can be used to replace the original data records for training the combined classifier (i.e., global concept model). Finally, the third phase validates the combined classifier by distributing the global concept model to all these data sites in the collaboration network. In order to limit the privacy breaches, we have also developed a computational approach to achieving a good trade off between the sharing benefit, the privacy disclosure risk, and the data utility. The paper is organized as follows. Section 2 gives a brief overview of the related works and the comparison with our proposed approach; Sect. 3 introduces our framework for customizable privacy modeling to achieve more flexible privacy protection. In order to develop solid foundations for privacy-preserving distributed classifier training, Sect. 4 and Sect. 5 introduce our adaptive EM algorithm for model-based classifier training and classifier combination. Section 6 presents our distributed approach for privacy-preserving model-based classifier training. Section 7 presents our experimental results for algorithm evaluation. Section 8 concludes the paper. 2 Related works and comparison Many techniques have been proposed for privacy-preserving data mining in the past, but the limitation of pages does not allow us to survey all these interesting works. Instead we try to
123
160
H. Luo et al.
emphasize some of these works that are most relevant to our proposed work. The references below are to be taken as examples of these related works, not as the complete list of these works in the relevant research areas. By using the perturbed data records for classifier training, Agrawal et al. used the randomized data distortion technique to learn decision trees [11]. A novel technique is also proposed by applying an expectation maximization (EM) algorithm on the perturbed data records to learn a model-based classifier [12]. Recently, Evfimievski et al. have also developed an approach to estimating the incurred privacy breaches when data perturbation is used [13,14]. Ma et al. have incorporated the perturbed data records to enable privacy-preserving learning of Bayesian Network structure and parameters [16]. Multiplicative data perturbation is also developed for more effective privacy protection [37,38]. There are two conflict requests for data perturbation: (a) strong noises are expected to protect the data privacy effectively; (b) strong noises may induce higher information loss. Thus data perturbation may result in information loss as well as in privacy breaches [33–35]. For distributed classifier training, it often requires to transmit large amounts of perturbed data records for reliable classifier training and thus the data transmission cost may be very expensive. The SMC problem was firstly introduced by [17]. Lindell et al. then applied such technique to the problem of privacy-preserving learning of the decision trees [18]. Du et al. have proposed several new SMC protocols for privacy-sensitive statistical data analysis [20,21]. Vaidya et al. have developed several SMC methods for privacy-preserving data mining applications [22,23]. Recently, Wright et al. have incorporated SMC approach to enabling privacypreserving learning of Bayesian Network structure and parameters [24]. As we have already mentioned before, such SMC-based approaches are very inefficient and are thus not adequate when large data sets are involved. Recently, data transformation has been used for data privacy protection. By using geometric rotation transformation, Chen et al. have proposed a novel approach to enabling privacypreserving training of rotation-invariant classifiers [25]. Without releasing the confidential original data records for classifier training, automatic synthetic data generation may provide an attractive solution to data privacy protection [27–30]. By incorporating synthetic samples for data privacy protection, Merugu et al. have proposed a new approach to achieving privacypreserving ensemble classifier training via model averaging [31]. However, there are three basic assumptions of most existing techniques for ensemble classifier training, such as voting [39], meta-learning [32], stacked generalization [39]: (a) the training samples at each data site are representative to reliably learn the local concept model with an acceptable accuracy rate in a certain input space; (b) all these local concept models are learned from the same training data set or the resampled versions of the same training date set and have significant variations in their overall performance (i.e., error diversity); (c) a large-scale validation data set is available for generating the meta-level training data set or voting the final decision. Unfortunately, all these three basic assumptions may not hold when dealing with privacypreserving distributed classifier training on heterogeneous data sets. Thus, there is an urgent need to develop new approach for privacy-preserving distributed model-based classifier training. It is also important to note that the critical information is actually task-specific and different subjects have different privacy preferences, and thus customizable privacy modeling and protection are strongly needed [25,33]. Based on these observations, we have proposed a customizable approach to enabling privacy-preserving distributed model-based classifier training. Our proposed approach has several major differences and advantages with respect to the existing approaches: (a) Our approach does not require to share large amounts of perturbed data records or encrypted data records for distributed classifier training, and thus the data
123
A distributed approach to enabling privacy-preserving model-based classifier training
161
transmission costs are reduced significantly. (b) Our approach only requires the weak concept models to be shared with the central site for combined classifier training; the potential privacy breaches are thus reduced drastically. (c) Our approach can support customizable privacy modeling, and thus more flexible privacy protection is effectively achieved. The major differences between our approach and the technique proposed by [31] are: (1) Our approach integrates both the weak concept models and the synthetic samples to enable more accurate learning of the global concept model. (2) An adaptive EM algorithm is proposed for learning both the weak concept models (i.e., local classifiers) and the global concept model (i.e., combined classifier) accurately by updating the mixture components automatically according to the real class distribution of the training samples. (3) A computational approach is developed to achieve a good trade off between the sharing benefit, the privacy disclosure risk, and the data utility automatically. Thus our approach can learn more reliable and complete global concept model (i.e., combined classifier) and results in higher classification accuracy. In addition, the privacy disclosure risk at the data record level is effectively controlled by performing customizable privacy.
3 Customizable privacy modeling All existing techniques for privacy-preserving data mining provide the same level of privacy for all subjects without catering to their personalized needs [25,33]. Thus they may offer insufficient privacy protection for some subjects while applying excessive privacy control to other subjects. According to Alan Westin, “privacy is the claim of individuals, groups, or institutes to determine for themselves when, how and to what extend information is communicated to others” [1]. Such definition of privacy emphasizes the fact that different subjects may have different privacy preferences and they may want to share different information with different data seekers. We thus need techniques able to support customizable privacy modeling, so that each data holder (i.e., one certain data site in the collaboration network) can specify its individual privacy and the degree of privacy protection under a given collaboration task and environment. In addition, a computational model is needed to quantify the customizable privacy precisely. As shown in Fig. 2, customizable privacy largely depends on six inter-related factors [40]: (1) data seeker and data holder (i.e., two data sites in the collaboration network)
Data Seeker
Data Holder
Privacy Judgement
Privacy Disclosure Risk
Data Utility
Benefit Analysis
Optimization
Benefit/Risk Optimization & Mixture Components Sharing
Data Application Task
Fig. 2 Our proposed approach to enabling customizable privacy modeling
123
162
H. Luo et al.
because the privacy concerns depend on the data sites involved in the collaboration; (2) data application task because the critical information is actually task-specific and heavily depends on the underlying application environment; (3) trust between the data holder and the data seeker; (4) benefit of data sharing; (5) privacy disclosure risk; and (6) data holder’s individual perception/judgment about privacy of the data being shared because privacy may often have different meanings for different subjects. In collaborative environments, data privacy is also related to a balance between the data holder’s concerns for privacy disclosure risks and the benefit in releasing its data. Therefore, the conventional interpretations of privacy as confidentiality or unavailability of individual pieces of information are inadequate in collaborative environments. Motivated by this observation, we propose to integrate all these inter-related factors to model customizable privacy in collaborative environments. In this paper, the data seeker is named as the central site and the data holder is one certain data site in the collaboration network. To achieve customizable privacy modeling, we need to address two key problems: (a) How to take into account the individual data holder’s privacy concerns? (b) How to quantify the privacy disclosure risk, the data utility, and the sharing benefit, and achieve a good balance among them? To address both these problems, we propose a computational approach to customizable privacy modeling for the specific task of privacy-preserving distributed modelbased classifier training. By incorporating synthetic data generation for privacy protection, our approach is able to automatically achieve an acceptable trade off between the data utility, the sharing benefit, and the privacy disclosure risk. Compared with the existing approaches for privacy modeling, our approach has the following advantages: (a) it can quantify the privacy under different contexts; (b) it can adapt to the benefit/risk optimization between the data seeker and the data holder and thus it can support more flexible privacy protection; (c) it can take the personal perception/judgment of the privacy into account and thus it is customizable, and (d) it is very suited for the collaborative environments.
4 Model-based classifier training The original data records c j = {X l , C j (X l )|l = 1, . . . , N } that are available at each data site are labeled for classifier training: positive samples that are relevant to the given data pattern or concept C j and negative samples that are irrelevant to the given data pattern or concept C j . Each labeled training sample is a pair (X l , C j (X l )) that consists of a set of attributes X l and the semantic label C j (X l ) for the corresponding training sample. To exploit the contextual relationships between the given data pattern or concept C j and the available training samples, we use the finite mixture model to approximate the underlying class distribution for the training samples that are relevant to C j [41]: P(X, C j , c j ) =
κj i=1
P(X |C j , θi )ωi ,
κj
ωi = 1
(1)
i=1
In the above expression, P(X |C j , θi ) is the ith mixture component to interpret one relevant class of the training samples that are used to characterize the statistical properties of the given data pattern or concept C j . c j = {κ j , θc j , ωc j } is the parameter tuple that includes the model structure, model parameters and weights, where: κ j is the model structure (i.e., optimal number of mixture components), θc j = {θi = (µi , σi )|i = 1, . . . , κ j } is the set of the model parameters (mean µi and covariance σi ) for these κ j mixture components,
123
A distributed approach to enabling privacy-preserving model-based classifier training
163
ωc j = {ωi |i = 1, . . . , κ j } are the relative weights among these κ j mixture components. Finally, X is the m-dimensional attributes that are used for characterizing the relevant training samples. Maximum likelihood (ML) criterion can be used to determine the underlying model parameters in Eq. (1), but it prefers complex models with more free parameters [41]; thus a penalty term is added to determine the optimal model structure. The optimal parameter set of model structure, weights, and model parameters c j = (κj , ωc j , θc j ) for the given data pattern or concept C j is then determined by: c j = arg max L(C j , c j ) (2) c j where: L(C j , c j ) = − X i ∈c log P(X i , C j , c j ) + log p(c j ) is the objective function, j m+κ +3 κ j − X i ∈c log P(X i , C j , c j ) is the likelihood function, and log p(c j ) = − 2j l=1 j
κ
κ (N +1)
N log N12ωl − 2j log 12 − j 2 is the minimum description length (MDL) term to penalize the complex models [41], and the MDL term is not dependent on any prior assumption of data distribution, N is the total number of training samples, and m is the dimensions of the attributes for the training samples X i ∈ c j . The estimation of maximum likelihood described in Eq. (2) is usually obtained by using the EM algorithm with a pre-defined model structure κ j [41–43]. However, different data patterns or concepts should have different model structures because they may relate to different numbers and types of data classes. Without organizing the distribution of mixture components according to the underlying class distribution of training samples, a mismatch may arise when there are too many mixture components in one sample area and too few in another. In order to effectively address such mismatch problem, we propose an adaptive EM algorithm to achieve more accurate model selection and parameter estimation simultaneously. Our adaptive EM algorithm performs automatic merging, splitting, and elimination to re-organize the distribution of the mixture components and modify the optimal number of the mixture components according to the real class distribution of the available training samples. To exploit the most suitable data classes for accurately interpreting the principal statistical properties of the given data pattern or concept C j , our adaptive EM algorithm starts from a large value of κ j and takes the major steps shown in Algorithm 1. With the given κ j , k-means clustering technique (i.e., k = κ j ) is used to select the robust initial values for the model parameters (i.e., means and covariance for each cluster). To avoid the local minimum problem, each cluster starts from multiple centers randomly.
Algorithm 1: Adaptive EM Algorithm Inputs: Training Samples c j , κ j = κmax ˆc Outputs: Model Structure and Model Parameters j Initialization is done by k-means clustering; for a given κmax 1. calculate the probabilities for three operations: merging, splitting, and elimination; 2. select one of these three operations as the optimal operation; 3. perform EM algorithm for simultaneous model selection and parameter estimation; repeat these operations until the algorithm converges.
To determine the underlying optimal model structure, we use two criteria to perform automatic splitting, merging, and elimination of mixture components: (a) fitness between one specific mixture component and the local distribution of the relevant training samples;
123
164
H. Luo et al.
(b) overlapping between the mixture components from the same concept model or different concept models. Our adaptive EM algorithm uses symmetric Jensen-Shannon (JS) divergence (i.e., intraconcept JS divergence) J S(C j , θl , θk ) to measure the divergence between two mixture components P(X |C j , θl ) and P(X |C j , θk ) from the same concept model P(X, C j , c j ). J S(C j , θl , θk ) = H (π1 P(X |C j , θl ) + π2 P(X |C j , θk )) −π1 H (P(X |C j , θl )) − π2 H (P(X |C j , θk ))
(3)
where H (P(·)) = − P(·) log P(·) is the well-known Shannon entropy, π1 and π2 are weights defining the relative importance of the two mixture components. In our current l k experiments, we set π1 = Nl N+N and π2 = Nl N+N , where Nl and Nk are the numbers of k k training samples for the corresponding mixture components P(X |C j , θl ) and P(X |C j , θk ). If the intra-concept JS divergence J S(C j , θl , θk ) is small, these two mixture components are strongly overlapped and may overpopulate the real distribution of the relevant training samples (i.e., mismatch with the relevant training samples); thus they are merged into a single mixture component P(X |C j , θlk ). In addition, the local JS divergence J S(C j , θlk ) is used to measure the divergence between the merged mixture component P(X |C j , θlk ) and the local density of the training samples P(X , C j , θ ). The local sample density P(X , C j , θ ) is modified as the empirical distribution weighted by the posterior probability. Our adaptive EM κ (κ −1) algorithm tests j 2j pairs of mixture components that could be merged and the pair with the minimum value of the local JS divergence is selected as the best candidate for merging. Two types of mixture components may be split: (a) The elongated mixture components which underpopulate the relevant training samples (i.e., characterized by the local JS divergence); (b) The tailed mixture components which overlap with the mixture components that are used to approximate the class distributions of the negative samples for interpreting other data patterns or concepts (i.e., characterized by the inter-concept JS divergence). To select the mixture component for splitting, two criteria are combined: (1) The local JS divergence J S(C j , Si , θi ) to characterize the divergence between the ith mixture component P(X |C j , θi ) and the local density of the training samples P(X , C j , θ ); (2) The inter-concept JS divergence J S(C j , C h , θi , θm ) to characterize the overlapping between the mixture components P(X |C j , θi ) and P(X |C h , θm ) from different data patterns or concepts C j and C h . If one specific mixture component is only supported by few training samples, it may be removed from the concept model. To determine the unrepresentative mixture component for elimination, our adaptive EM algorithm uses the local JS divergence J S(C j , θi ) to characterize the representation of the mixture component P(X |C j , θi ) for the relevant training samples. The mixture component with the maximum value of the local JS divergence is selected as the candidate for elimination. Eliminating the unrepresentative mixture components located at the boundaries of the concept models is able to maximize the margins among the concept models for different data patterns or concepts and result in higher classification accuracy. To jointly optimize the operations of merging, splitting and elimination, their probabilities are defined as: ⎧ Jm (i, k, θik ) = J S(C j , θik ) + ϕ J S(C j , θi , θk ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ϕ J S(C j ,Ch ,θi ,θm ) Js (i, m, θi ) = (4) J S(C j ,θi ) ⎪ ⎪ ⎪ ⎪ ⎪ ϕ ⎩ J (i, θ ) = e i J S(C j ,θi )
123
A distributed approach to enabling privacy-preserving model-based classifier training
165
where ϕ is a normalized factor and it is determined by: κj i=1
Je (i, θi ) +
κj κj
Jm (i, k, θik ) +
i=1 k=i+1
κ j κh
Js (i, m, θi ) = 1
(5)
i=1 m=1
The acceptance probability to prevent poor results of the merging, splitting, and elimination operation is defined by:
|L(C j , 1 ) − L(C j , 2 )| Paccept = min exp − ,1 (6) τ where L(C j , 1 ) and L(C j , 2 ) are the objective functions for the models 1 and 2 (i.e., before and after performing the merging, splitting or elimination operation) as described in Eq. (2), τ is a constant that is determined experimentally. τ is set as τ = 9.8 and it is uniform for all these data sets that are used in our current experiments. By optimizing these three operations jointly, our adaptive EM algorithm achieves the following advantages: (a) It does not require a careful initialization of the model structure and model parameters. By starting from a reasonably large number of mixture components, our adaptive EM algorithm is able to automatically select the optimal model structure to capture the essential structure of the data classes by performing automatic merging, splitting and elimination of mixture components. Thus, it is able to achieve a better approximation of the real class distribution for the given data pattern or concept by running the local search from many different starting points. (b) By integrating the negative samples to maximize the margins among the model-based classifiers for different data patterns or concepts, it is able to improve the prediction power and generalization ability of the model-based classifiers by achieving discriminative learning of finite mixture models. (c) It is able to address the mismatching problem effectively by re-organizing the distribution of mixture components and modifying the optimal number of mixture components according to the underlying class distribution of the available training samples. (d) By performing automatic merging, splitting, and elimination of mixture components, it is able to support more effective solution for distributed model-based classifier training (see Sect. 5).
5 Distributed model-based classifier training For a given data pattern or concept C j , each data site in the collaboration network can learn a weak concept model (i.e., local classifier) independently by using its own training samples (see Fig. 1). Our proposed technique for model-based classifier training as described in Sect. 4 is first used to select the optimal model structures and estimate the accurate model parameters for these M weak concept models. Because the training samples distributed at these M data sites may be heterogeneous and incomplete, it is impossible for any one of these M data sites to learn a global concept model individually for accurately interpreting the given data pattern or concept C j . To achieve distributed classifier training, large amounts of training samples with diverse statistical properties should be collected from these M data sites for combined classifier training at the central site. However, sending large amounts of training samples to the central site is too expensive and may also result in privacy breaches. To reduce the data transmission costs and the potential privacy breaches, each data site shares only its weak concept model P(X, C j , c j ) and the value Ni for the total number of training samples that are used to learn the corresponding weak concept model.
123
166
H. Luo et al.
To enable distributed model-based classifier training, synthetic samples are directly generated from the shared weak concept models at the central site by using Markov Chain Monte Carlo sampling technique [9,10]. The number of synthetic samples which are generated from the ith weak concept model is controlled by using the shared value Ni . We refer to these training samples that are directly generated from M weak concept models as synthetic samples because they are not obtained from the original data records. The synthetic samples have the same statistical properties as the original data records because both of them originate from the same mixture density function (i.e., same weak concept model), and thus such synthetic samples can be used to replace the original data records for combined classifier training [27–30]. In addition, the synthetic samples are also sufficiently different from the original data records and thus they are able to protect the privacy of the original data records. Thus, generating the synthetic samples from these M weak concept models at the central site can significantly reduce the privacy breaches and drastically reduce the data transmission costs. Because the training samples distributed at these M data sites are heterogeneous and incomplete, each of these M weak concept models is only able to characterize different useful aspects (i.e., different statistical properties at different input spaces) of the given data pattern or concept C j . Thus, the individual prediction outputs of these M weak concept models are too uncertain to be useful for learning or boosting a reliable combined classifier, and most existing techniques for classifier combination may fail as pointed out in [39]. In addition, the training samples that are distributed at these M data sites may be redundant; thus some of their mixture components may overlap, that is, common data classes may appear in multiple weak concept models because multiple data sites may have some common data records, for examples, different supermarkets may have common customers. For some interesting data classes (i.e., some interesting statistical properties for the given data pattern or concept C j ), each data site may not have enough training samples to learn the corresponding mixture components accurately. Thus, integrating the weak concept models shared from these M data sites may improve the estimation of such interesting data classes (i.e., mixture components) and achieve more accurate interpretation of the given data pattern or concept C j . Instead of using their unreliable prediction outputs for combined classifier training, we directly integrate these M weak concept models to achieve a better approximation of the principal statistical properties for interpreting the given data pattern or concept C j effectively. In addition, each data site also shares the total number of training samples (i.e., Ni ) that are used to learn the corresponding weak concept model, and Ni is further used to determine the relative weights for the mixture components from the ith data site in the global concept model. Based on these observations, our approach for combined classifier training is organized according to the following steps: (a) The weak concept models for all these M data sites are combined to obtain a “pseudo-complete” global concept model for interpreting the given data pattern or concept C j accurately and completely. (b) The synthetic samples are automatically generated by using these shared weak concept models and the number of synthetic samples generated from the ith weak concept model is controlled by the value of Ni (i.e., the total number of training samples that are used to learn the ith weak concept model). These synthetic samples generated from different weak concept models (i.e., combined synthetic samples) are integrated for combined classifier training. (c) Based on the available weak concept models shared by these M data sites, our adaptive EM algorithm is used to select the optimal model structure and estimate the accurate model parameters simultaneously by performing automatic merging, splitting, and elimination of mixture components. The mixture components with less prediction power on the combined synthetic samples are eliminated from the global concept model. The overlapped mixture components are merged as a single
123
A distributed approach to enabling privacy-preserving model-based classifier training
167
mixture component. The elongated mixture components that underpopulate the combined synthetic samples are split into multiple representative mixture components. By integrating all these M weak concept models shared from M data sites, the global concept model for interpreting the given data pattern or concept C j is defined as: ˆ cj ) = P(X, C j ,
M j=1
κj
Nj
M
j=1
Nj
P(X |C j , θl )ωl
(7)
l=1
where κ j is the total number of mixture components for the jth weak concept model from the jth data site, and N j is the total number of training samples that are available at the jth data site for learning the jth weak concept model. By integrating the normalization factors Nj M with the original weights ωl , the global concept model can be refined as: j=1
Nj
ˆ cj ) = P(X, C j ,
κ
P(X |C j , θl )βl ,
l=1
κ
βl = 1
(8)
l=1
ˆ c j = {κ, θl , βl |l = 1, . . . , κ} is the parameter tuple that includes the model structure, where model parameters and weights, κ = M j=1 κ j is the total number of the mixture components shared from these M data sites, κ j is the total number of mixture components for the jth N ωl is the re-normalized weight. weak concept model, βl = M j j=1
Nj
If one mixture component, P(X |C j , θm ), is eliminated, the global concept model for accurately interpreting the given data pattern or concept C j is then refined as: ˆ cj ) = P(X, C j ,
κ−1 1 P(X |C j , θl )βl , m = l 1 − βm
(9)
l=1
If two mixture components P(X |C j , θm ) and P(X |C j , θl ) are merged as a single mixture component P(X |C j , θml ), the relevant model parameters are updated as follows: βm µm + βl µl βml βσm + βl σl βm βl (µm − µl )(µm − µl )T = + 2 βml βml
κ = κ − 1, βml = βm + βl , µml = σml
(10) (11)
where θml = {µml , σml }, µml and σml are the mean and covariance for the merged mixture component P(X |C j , θml ). Thus, the global concept model for accurately interpreting the given data pattern or concept C j is refined as: ˆ cj ) = P(X, C j ,
κ−2
P(X |C j , θh )βh + P(X |C j , θml )βml
(12)
h=1
If one mixture component, P(X |C j , θh ), is split into two new mixture components, P(X |C j , θr ) and P(X |C j , θt ), the relevant model parameters are updated as follows: 1 1 βh , µr = µh + α 2 2 1 1 T µt = µh − α, σr = σt = σh − αα 2 4 κ = κ + 1, βt = βr =
(13) (14)
123
168
H. Luo et al.
where α is a pre-defined m-dimensional perturbation vector. Thus, the global concept model for accurately interpreting the given data pattern or concept C j is refined as: ˆ cj ) = P(X, C j ,
κ−1
P(X |C j , θh )βh + P(X |C j , θr )βr + P(X |C j , θt )βt
(15)
h=1
One of the approach to compute α is to use the principal axis of σh , because it has the largest deviation. However, the principal axis may not be the suitable split direction as addressed in [44]. Therefore, to find a better split direction, we compute the projection pursuit index value proposed in [45] along all eigenvectors ev j of σh , then the direction with the smallest index value is used to compute α: α = λj evj . Where evj is the eigenvector with the smallest index value, and λj is the relevant eigenvalue. By updating the mixture components automatically according to the real class distribution of the combined synthetic samples, our algorithm for combined classifier training is expected to derive a more reliable and complete global concept model that is able to interpret the principal statistical properties of the given data pattern or concept C j effectively. To validate the combined classifier, the global concept model is distributed to all these data sites in the collaboration network. Obviously, distributing the global concept model may also induce the privacy breaches, and a model privacy protection technique is needed to enable privacy-preserving distributed model-based classifier training. 6 Privacy-preserving distributed classifier training By treating these M data sites as M horizontally partitioned data sources with potential data overlapping, we propose a distributed approach to privacy-preserving model-based classifier training that automatically achieves a good trade off between data utility, sharing benefit, and risk of privacy breaches. 6.1 Model quality To support more effective training of model-based classifier, it is very important to develop new frameworks for evaluating the quality of the global concept model. The quality of the ˆ cj , ¯ c j ) is defined as: global concept model D( ¯ c j ) = D(P(X, C j , ˆ c j ), P(X, C j , ¯ c j )) ˆ cj , (
(16)
ˆ c j ) is the global concept model that is learned by using our approach, where P(X, C j , ¯ c j ) is the mean model [31], D(·, ·) is the JS divergence. The mean model is P(X, C j , defined as: ¯ cj ) = P(X, C j ,
M j=1
Nj M j=1
Nj
P(X, C j , c j )
(17)
where P(X, C j , c j ) is the weak concept model that is shared from the jth data site, N j is the total number of training samples that are used to learn the jth weak concept model P(X, C j , c j ). One can observe that the relative importance (i.e., weights) for different data classes in the mean model is simply determined by the sizes of the training samples at the corresponding data sites. On the other hand, the relative importance (i.e., weights) for different data classes in our global concept model is automatically determined by our adaptive EM algorithm according to the real class distribution of the training samples (i.e., the shared
123
A distributed approach to enabling privacy-preserving model-based classifier training
169
mixture components are updated automatically according to the real class distribution of the combined synthetic samples). Thus our approach is able to learn more reliable and complete global concept model and results in higher prediction power. To enable privacy-preserving distributed classifier training, it is also very important to enhance the individual data site’s ability to control other data site’s usage of its weak concept model because sharing the weak concept models (i.e., data mining results) may also result in privacy breaches [8]. Thus, a good trade off between the privacy disclosure risk and the quality of the global concept model (i.e., the accuracy of the global concept model for interpreting the principal statistical properties of the given data pattern or concept) can be achieved by controlling the number of mixture components to be shared with the central site. To obtain the maximum privacy, a strategy would be to share only part of the mixture components with the central site; however, this strategy decreases the quality of the global concept model and the performance of the combined classifier may also decrease. To enhance the quality of the global concept model and improve the classifier’s performance, a strategy would be to share all the mixture components, which however results in a higher risk of privacy breaches. Thus, there is the need to achieve a good balance between the privacy disclosure risk and the quality of the global concept model. 6.2 Automatic benefit/risk optimization In our proposed approach for distributed model-based classifier training, there are two steps that may induce privacy breaches: (a) sharing the weak concept models with the central site for combined classifier training; (b) distributing the global concept model to all these individual data sites in the collaboration network for classifier validation. To enable privacy-preserving distributed model-based classifier training, new techniques are needed to automatically estimate and control the potential privacy breaches, when the weak concept model and global concept model are shared among the data sites in the collaboration network. 6.2.1 Benefit/risk optimization for combined classifier training To enable customizable benefit/risk optimization for sharing the weak concept model in the combined classifier training procedure, we define two data sets for the jth data site in the collaboration network: (1) for the original data records that are available at jth data site; and (2) for the synthetic samples that can be generated from the weak concept model for the jth data site, when the weak concept model for the jth data site is shared with the central site. To customize and quantify the privacy disclosure risk (, ) for the jth data site, we have developed a computational approach to quantifying the following four different types of privacy disclosure risk: (1) re-identification disclosure risk δr (, ): the risk of disclosing the one-to-one relationship between the synthetic samples (i.e., generated from the jth weak concept model) and the original data records; (2) linkage disclosure risk ρ(, ): the risk of re-identifying the values of the original data records by linking the synthetic samples with other public-available data sets or the data records that are available at other collaboration data sites; (3) confidentiality-interval inference risk φ(, ): the risk of disclosing the tight ˆ bounds of the interval values of the original data records. (4) statistical inference risk δs (S, S): the risk of disclosing confidential data statistics. The re-identification disclosure risk δr (, ) is defined as: N Ns ρr , X i = Z j i=1 j=1 δi j (18) , δi j = δr (, ) = 0, otherwise N · Ns
123
170
H. Luo et al.
where N and Ns are the total numbers of the original data records and the synthetic samples, ρr is the privacy disclosure risk for re-identification of the ith original data record X i by using the jth synthetic sample Z j . The confidentiality-interval inference risk φ(, ) is defined as: φ(, ) =
N Ns 1 1 N · Ns log(λc |X i − Z j |)
(19)
i=1 j=1
where |X i − Z j | is the approximation accuracy by using the jth synthetic sample Z j to approximate the ith original data record X i , and the optimal value for such approximation accuracy is set as 0.001 and thus λc is set to 1,000 in our current experiments to avoid division by zero. The linkage disclosure risk ρ(, ) is defined as: ρ(, ) =
1 log(λc (, ))
(20)
where λc is set to 1,000 in our current experiments to avoid division by zero, (, ) is the information loss incurred by using the synthetic samples to approximate the original data records. (, ) =
N Ns |X i − Z j | 1 N · Ns 0.5|X i + Z j |
(21)
i=1 j=1
ˆ is defined as: The statistical inference risk δs (S, S) ⎧ ⎨ ρc , rc () = rc ( ) δs (, ) = ⎩ 0, otherwise
(22)
where rc () and rc ( ) are the confidential data statistics that can be extracted from the original data set and the synthetic sample set . From Eqs. (20) and 21), one can observe that the weak concept model with higher information loss will decrease the linkage disclosure risk because its synthetic samples have larger differences with the relevant original data records. The the privacy disclosure risk (, ) is defined as: (, ) = α1 δr (, ) + α2 ρ(, ) + α3 φ(, ) + α4 δs (, ), α1 + α2 + α3 + α4 = 1
(23)
where α1 , α2 , α3 , and α4 are the weighting factors denoting the relative importance between δr (, ), ρ(, ), φ(, ), and δs (, ). To enable customizable privacy modeling, each data site can select different values for these weighting factors according to its individual concerns of various types of the privacy disclosure risk. Because the re-identification disclosure is more critical than others, we set α1 = 0.4, α2 = α3 = α4 = 0.2 in our current experiments. Obviously, the weak concept model with higher quality may result in higher utility of its synthetic samples and lower information loss, thus the utility (, ) of the synthetic sample set is defined as the inversion of information loss [46,47]. (, ) =
123
1 1 + (, )
(24)
A distributed approach to enabling privacy-preserving model-based classifier training
171
By incorporating the original data set and the synthetic sample set for classifier training, the performance of the combined classifier may be different because of information loss. Thus such performance difference is used to quantify the sharing benefit ϒ(, ): ϒ(, ) = E(|L( ) − L( )|2 )
(25)
where |L( ) − L( )|2 is the performance difference when incorporating the original data set and the synthetic sample set for classifier training, E(·) represents the expectation value for |L( ) − L( )|2 [46,47]. L( ) is defined as the classification accuracy resulting from incorporating the synthetic sample set for classifier training, L() is the classification accuracy resulting when incorporating the original data set for classifier training. It is important to note that the quality of the synthetic sample set largely depends on the number of mixture components to be shared with the central site, that is, if more mixture components are shared, then more principal statistical properties of the original data records can be characterized precisely. On the other hand, the privacy disclosure risk (, ), the sharing benefit ϒ(, ), and the data utility (, ) also depend on the quality of the synthetic sample set . Thus the privacy disclosure risk (, ), the sharing benefit ϒ(, ), and the data utility (, ) for the jth data site explicitly depends on the number of mixture components to be shared with the central site. For the jth data site in the collaboration network, it is very important to determine the optimal number Noptimal of the mixture components to be shared with the central site; Noptimal is automatically determined by achieving a good trade off between the privacy disclosure risk (, ), the sharing benefit ϒ(, ), and the data utility (, ): Noptimal = arg max ϒ(, ) + λ(, ) (26) N ∈ [1, . . . , κ j ] subject to: (, ) < δ where 0 < λ ≤ 1 is a weighting factor and δ is the upper bound of the privacy disclosure risk accepted by the jth data site when the weak concept model is shared. By determining the optimal number of mixture components to be shared with the central site, our approach is able to perform a privacy-preserving distributed model-based classifier training while limiting the potential privacy disclosure risk effectively. It is important to note that our approach is able to achieve customizable privacy modeling and protection, and each data site in the collaboration network has full control on the optimal number of mixture components to be shared for combined classifier training. Thus each data site can have full control on its privacy disclosure risk by achieving a good trade off between the data utility, the sharing benefit, and the privacy disclosure risk. 6.2.2 Privacy-preserving classifier validation When the global concept model (i.e., combined classifier) is available, it is distributed to all the data sites in the collaboration network for classifier validation. Each data site can thus obtain the mixture components that are used to characterize the principal statistical properties for other data sites in the collaboration network. These mixture components in the global concept model can be categorized into two groups: (a) the common mixture components that also exist in the weak concept model for the jth data site; (b) the new mixture components that do not exist in the weak concept model for the jth data site. Because these shared mixture
123
172
H. Luo et al.
components are used to characterize the principal statistical properties for other data sites in the collaboration network, the dishonest data sites may incorporate the global concept model to infer other data site’s private data. By knowing the common mixture components from other M − 1 data sites, the dishonest data site may incorporate its own data records to infer the private information of the other M − 1 sites via linkage analysis, i.e., identifying the privacy-sensitive data records from the synthetic samples that can be generated from the common mixture components. Even if the dishonest data sites does not know the exact correspondences between these common mixture components and the relevant data sites (i.e., the common mixture components may be from one of M − 1 data sites or all of them), supporting such linkage analysis may still leak some private information. To avoid the misusing of the common mixture components, model perturbation is used by adding additional noise to the global concept model before it is distributed to the individual data sites. In our current implementations, Gaussian functions are used to represent the mixture components. The mixture component is interpreted by: P(X |C j , θi ) =
1 m 2
(2π) (σi )
1
1 2
e− 2 (X −µi )
T (σ )−1 (X −µ ) i i
(27)
where θi = {µi , σi } is the mean and covariance for the mixture component P(X |C j , θi ). By adding the additional multivariate noise function with zero mean and covariance dσi (shown in Fig. 3), the perturbed mixture component is then defined as: P(X |C j , θi ) =
1 m 2
(2π) [ σi ]
1
1 2
e− 2 (X −µi )
T [ σi ]−1 (X −µi )
(28)
where θi = {µi , σi } is the mean and covariance for the perturbed mixture component P(X |C j , θi ), σi = (1 + d)σi , the constant d is the parameter to adjust the extent of the additional multivariate noise. By adding the additional multivariate noise on the mixture components, the synthetic samples generated from the corresponding mixture components
Fig. 3 The original Gaussian mixture component with µ = 0 and = 1, and the perturbed mixture component with different noise strengths: d = 0.5 and d = 1
123
A distributed approach to enabling privacy-preserving model-based classifier training
173
are randomized, and thus our model perturbation technique is able to preserve accurate linkage analysis with acceptable effectiveness. In addition, adding the multivariate noises on the model level (i.e., on the level of mixture components) is much easy and effective than perturbing the data records directly. Model perturbation may provide privacy protection as well as result in the loss of model’s approximation accuracy (i.e., model quality). Thus the loss of model quality is defined as: ˆ cj , ˇ c j ) = D(P(X, C j , ˆ c j ), P(X, C j , ˇ c j )) D(
(29)
ˆ c j ) is the original global concept model and where D(·, ·) is the JS divergence, P(X, C j , ˇ c j ) is the perturbed global concept model, and they are defined as: P(X , C j , ˆ cj ) = P(X, C j , ˇ cj ) = P(X, C j ,
κ l=1 κ l=1
P(X |C j , θl )βl , P(X |C j , θl )βl ,
κ l=1 κ
βl = 1
(30)
βl = 1
(31)
l=1
ˇ c j ) and the ˆ cj , By determining the relationship between the loss of model quality D( strength d of the additional multivariate noises, it is able for us to trace the effectiveness of our model perturbation technique for privacy protection. Because each data site has a reasonable number of original data records, the dishonest data sites can first use their own data records to learn a prediction model for inferring other data site’s private information. However, each data site may not have enough data records to learn such prediction model with acceptable prediction power, and thus the dishonest data site may not obtain reliable inference of other data site’s privacy by only using its own data records. Distributing the new mixture components (i.e., principal statistical properties for other data sites in the collaboration network) may result in a new challenge for privacy protection. By treating the prediction model that is learned by only using its own data records as the weak concept model, the dishonest data site can generate large amounts of synthetic samples from the shared new mixture components to improve its prediction model (i.e., incorporating unlabeled samples (synthetic samples) for model-based classifier training [40, 48,49]). Because the dishonest data sites may not know the exact correspondences between the new mixture components and the relevant data sites, they can treat these synthetic samples generated from the shared new mixture components as the unlabeled samples. It is widely accepted that the unlabeled samples (i.e., synthetic samples generated from the shared new mixture components) can improve the classifier training significantly [40,48,49], thus distributing the new mixture components (i.e., global concept model) may potentially induce the probabilistic privacy breaches. In this paper, we argue that this widely-accepted conclusion is not true for at least two cases: (a) the dishonest data sites may not have representative data records that can be used to initiate the learning of such privacy predictor, i.e., they do not have the suitable data records that can be used to interpret the principal statistical properties of other data sites and predict their private data; (b) the dishonest data sites may have a limited number of such data records that can be used to learn the privacy predictor, but its prediction is not reliable. When one of these two cases arises, the dishonest data sites cannot incorporate these new mixture components to infer other data site’s privacy confidently. Our experimental results have also proved that our arguments are correct, and thus our approach can protect the privacy of the global concept model effectively.
123
174
H. Luo et al.
7 Algorithm evaluation Our experiments have been conducted on four real-world data sets from the UCI machine learning web site [50] and two image data sets: (1) Letter image recognition data, we call it “letter”; (2) Pen-based handwritten digits, we call it “pen”; (3) Landsat multi-spectral scanner image data, we call it “satimage”; (4) SPAM E-mail database, we call it “spam”; (5) Blood regions in our medical education videos, we call it “blood”; (6) Face regions in our medical education videos, we call it “face”. The test environments for these data sets are given in Table 1. Our experimental algorithm evaluation focuses on: (a) evaluating the performance differences between our adaptive EM algorithm and other existing techniques for model-based classifier training; (b) comparing the performance differences between our approach and other existing approaches for classifier combination; (c) evaluating the effectiveness of our privacy protection model when different sizes of unlabeled samples (i.e., synthetic samples generated from the shared global concept model) are used for inferring private data; (d) evaluating the performance of our approach for data privacy protection; (e) comparing the data transmission costs of our approach against other existing approaches for privacy-preserving distributed classifier training. The benchmark metric for the classifier evaluation includes precision ρ and recall . They are defined as: ρ=
ϑ ϑ , = ϑ +γ ϑ +ν
(32)
where ϑ is the set of true positive samples that are related to the given data pattern or concept and are classified correctly, γ is the set of true negative samples that are irrelevant to the given data pattern or concept and are classified incorrectly, and ν is the set of false positive samples that are related to the given data pattern or concept but are misclassified. In our adaptive EM algorithm, multiple operations, such as merging, splitting, and elimination, have been integrated to re-organize the distribution of mixture components, select the optimal model structure and construct more flexible decision boundaries among different data patterns or concepts according to their real class distributions. Thus our adaptive EM algorithm is expected to have better performance than the traditional EM algorithm and its recent variants [42,43]. In order to assess the actual benefits of our adaptive EM algorithm for classifier training, we have evaluated the performance differences between our adaptive EM algorithm and other existing techniques for model-based classifier training. As shown in Figs. 4, 5 and 6, we have tested the performance differences of the model-based classifiers by using different techniques for model selection and parameter estimation: the traditional EM algorithm by Table 1 The data sets used in experiments and their parameters
Name
Number of concepts
Training set size
Test set size
Letter
16
26
15,000
5,000
Pen
16
10
7,494
3,498
Satimage
123
Dimensions of attributes
8
6
4,435
2,000
Spam
10
2
2,299
2,302
Blood
7
2
3,311
3,314
Face
11
2
12,181
12,184
A distributed approach to enabling privacy-preserving model-based classifier training
175
Fig. 4 The classifier performance (i.e., precision ρ) by using different classifier training techniques for different data sets: a Spam; b Face
Fig. 5 The classifier performance (i.e., precision ρ) by using different classifier training techniques for different data sets: a Letter; b Pen
Fig. 6 The classifier performance (i.e., precision ρ) by using different classifier training techniques for different data sets: a Satimage; b Blood
starting with a small value of model structure κ (i.e., using only splitting for model selection); the traditional EM algorithm by starting with a large value of model structure κ (i.e., using only merging for model selection); the SMEM algorithm by integrating deterministic annealing and regularization MDL for model selection (i.e., using both merging and splitting for model selection, S M) [43]; our adaptive EM algorithm by integrating negative samples and new criteria for model selection and parameter estimation (i.e., combining splitting, merging, and elimination, S M + N eg.). From these experimental results, one can observe that our adaptive EM algorithm improves the classifier’s performance significantly. The reasons are: (a) The negative samples are incorporated to maximize the margins among the concept models for different data patterns or concepts; (b) The optimal model structure and model parameters are simultaneously estimated by using a single algorithm; (c) The optimal number of mixture
123
176
H. Luo et al.
Table 2 The comparison results by using different algorithms for combined classifier training (precision ρ and recall ) Names
Letter (%)
Pen (%)
Satimage (%)
Spam (%)
Blood (%)
Face (%)
Our approach
92.5
94.6
87.4
92.6
92.8
84.9
90.3
92.8
89.3
93.1
91.5
84.3
89.2
87.4
80.2
86.2
86.2
78.6
92.1
82.4
83.5
84.8
84.6
79.2
84.8
88.3
80.7
84.5
84.8
77.9
89.6
84.7
84.2
82.6
82.6
80.4
Stacked generalization Model averaging
components and the distribution of mixture components are automatically adapted to the real class distribution of training samples. To evaluate our approach for combined classifier training, we have also compared multiple approaches for classifier combination, e.g., our approach, stacked generalization (i.e., using the prediction outputs of these weak concept models for classifier combination) [39], model averaging used by [31]. The comparison results of their average performances on multiple data sets are given in Table 2. One can observe that using our adaptive EM algorithm for combined classifier training can obtain higher classification accuracy. By performing automatic merging, splitting and elimination of the mixture components shared from different data sites, our approach is able to learn more reliable and complete global concept model (i.e., combined classifier) and thus results in higher classification accuracy. From the results shown in Figs. 4, 5, 6 and Table 2, one can conclude that our algorithms have provided a solid foundation to enable more effective solution to privacy-preserving distributed model-based classifier training (i.e., our algorithms are able to learn more accurate weak concept models and global concept model). When the global concept model is distributed for classifier validation, the dishonest data sites may generate large amounts of synthetic samples from the shared global concept model to infer other site’s private data. Because the global concept model is able to characterize the principal statistical properties for other data sites in the collaboration network, such synthetic samples can be used to infer other site’s private data. The dishonest data sites may not know the exact correspondences between the shared mixture components (i.e., mixture components in the global concept model) and the relevant data sites; they can just treat these synthetic samples as the unlabeled samples to improve their predictors for private data inference [40]. To construct the predictor for private data inference, the dishonest data sites can first use their own data records (i.e., labeled samples) to learn a weak concept model for the predictor, and then the synthetic samples generated from the shared global concept model are incorporated to improve the predictor’s performance. In our experiments, we have obtained such performance improvement by incorporating different number of synthetic samples for predictor training, e.g., with different size ratios λ = NNLu between the unlabeled samples (synthetic samples) Nu and the labeled samples (original data records) N L . The prediction performance improvements for different data sets are given in Figs. 7, 8 and 9. From these experimental results, one can find that incorporating the synthetic samples (unlabeled samples) for privacy inference can improve the predictor’s performance with reasonable rate at the beginning [40,48,49]. However, such prediction can only tell the dishonest data site that other data site in the collaboration network also have the similar data samples that share the similar statistical properties with the labeled samples (i.e., original
123
A distributed approach to enabling privacy-preserving model-based classifier training
177
Fig. 7 The empirical relationship between the privacy predictor performance (i.e., precision ρ) and the ratio Nu λ = N for different data sets: a Spam; b Satimage L
Fig. 8 The empirical relationship between the privacy predictor performance (i.e., precision ρ) and the ratio Nu λ = N for different data sets: a Blood; b Face L
Fig. 9 The empirical relationship between the privacy predictor performance (i.e., precision ρ) and the ratio Nu λ = N for different data sets: a Letter; b Pen L
data records which the dishonest data site has). The dishonest data site can further infer that how many data sites in the collaboration network have such data records according to the values of the weights for the corresponding mixture components in the shared global concept model, i.e., larger weight values mean that more data sites have such data records and smaller weight values mean that fewer data sites have such data records. Obviously, the dishonest data site does not know which data site has such data records.
123
178
H. Luo et al.
Fig. 10 The empirical relationship between the privacy disclosure risk (, ) and the numbers of mixture components to be shared
Ideally, it is possible for the dishonest data sites to generate unlimited numbers of synthetic samples (i.e., unlabeled samples) from the shared global concept model to infer other site’s private data. However, each data site in the collaboration network does not have enough data records that are able to interpret the principal statistical properties of other data sites (i.e., this is also the major reason for them to collaborate for extracting more reliable and complete knowledge), and thus it cannot learn a reliable predictor with acceptable prediction accuracy rate. When only a limited number of such data records are available for predictor training, we have also obtained a significant decrease in the predictor’s performance when large amounts of synthetic samples are incorporated for predictor training, because large amounts of synthetic samples (i.e., unlabeled samples) may dominate the statistical properties and mislead the predictor. Thus, this empirical observation (i.e., decrease of prediction accuracy) has provided very convincing evidence for the efficiency of our proposed solution for protecting the privacy of the global concept model. It is very hard for the dishonest data sites to incorporate large amounts of synthetic samples to obtain reliable prediction results when the dishonest data sites have only a limited number of such data records to initiate the learning of the privacy predictor. To evaluate our approach to privacy-preserving distributed classifier training, we partitioned each test data set into three groups with different sizes and performed model-based classifier training on these three groups independently. Each group has full control on the balance between the privacy disclosure risk, the sharing benefit and the data utility. In addition, all these three issues, such as the privacy disclosure risk, the sharing benefit, and the data utility, largely depends on the number of mixture components to be shared with the central site. Based on this understanding, we have obtained the empirical relationship between the privacy disclosure risk and the number of mixture components to be shared. As shown in Fig. 10, sharing more mixture components may induce higher risk of privacy breaches. On the other hand, we have also obtained that sharing more mixture components may enhance both the data utility and the sharing benefit significantly as shown in Fig. 11. To address such conflict, we have proposed a computational approach to achieving a good balance between the privacy disclosure risk, the data utility, and the sharing benefit. One major advantage of our distributed approach for privacy-preserving model-based classifier training is that it is able to control the privacy disclosure risk effectively by sharing an optimal number of mixture components, and such optimal number of mixture components can be customizable for each data site in the collaboration network and can also be determined automatically. When the global concept model is distributed for classifier validation, the dishonest data sites may be able to identify the common mixture components accurately and generate large amounts of synthetic samples from these common mixture components to infer other site’s
123
A distributed approach to enabling privacy-preserving model-based classifier training
179
Fig. 11 The empirical relationship between the data utility and sharing benefit ϒ(, ) + λ(, ) and the numbers of mixture components to be shared, λ = 0.5
Fig. 12 The empirical relationship between the privacy disclosure risk (, ) and the numbers of synthetic samples generated from the shared common mixture components
private data. Based on this understanding, we have also obtained the empirical relationships between the privacy disclosure risk and the numbers of synthetic samples that can be generated from the shared common mixture components. As shown in Fig. 12, one can observe that the privacy disclosure risk sequentially decreases with the number of synthetic samples to be generated from the shared common mixture components. When more synthetic samples are generated, it becomes more difficult to identify the one-to-one relationships between the original data records and the synthetic samples. However, the privacy disclosure risk consists of four individual parts: re-identification risk, linkage disclosure risk, confidentiality-interval inference risk, and statistical inference risk. Generating more synthetic samples may also increase the adverse ability to predict the value intervals of the original data records (i.e., confidentiality-interval inference risk), induce higher linkage disclosure risk, and result in higher risk for statistical inference, and thus we also have a slight increase in the privacy disclosure risk when more synthetic samples are generated from the shared common mixture components. As shown in Fig. 13, the dishonest data site may incorporate the shared common mixture components to re-identify the original common data records in other data sites. Because the common mixture components are perturbed before they are shared (i.e., model perturbation), it is very hard for the dishonest data site to achieve accurate and reliable linkage analysis as shown in Fig. 13. Thus our model perturbation technique can prevent such linkage analysis effectively. Obviously, its effectiveness largely depends on the strength of the additional multivariate noises. To trace the effectiveness of adding additional multivariate noises (i.e., model perturbation) for model privacy protection, we have also obtained the empirical relaˆ cj , ˇ c j ) and the noise strength d. As shown in tionship between the loss of model quality D( Figs. 14 and 15, one can observe that strong additional noises may protect the model privacy
123
180
H. Luo et al. original mixture components at jth Data Site Original Data Records at the jth Data Site item 1 item 2
projection
item k projection
identified candidates on the original and perturbed mixture components
perturbed common mixture components
original common mixture components
Fig. 13 Incorporating linkage analysis and original data records to re-identify the synthetic samples generated from the shared common mixture components
ˆ c , c ) and the strength of Fig. 14 The empirical relationship between the loss of model quality D( j j additional multivariate noises d
effectively as well as result in significant loss of model quality. By tracing the joint impacts of the additional multivariate noises on the privacy protection and the loss of model quality, it is able for us to determine an optimal strength value d of the additional noises for model privacy protection. We have also compared the differences on the data transmission costs between our approach and the perturbation approach [12] and the SMC approach [24]. The data transmission cost is defined as the percentage between the number of shared samples and the total number of samples that are needed for achieving accurate classifier training. As shown in Fig. 16, one can conclude that our proposed approach can reduce the data transmission costs significantly because only the weak concept models are needed to be shared for combined classifier training.
123
A distributed approach to enabling privacy-preserving model-based classifier training
181
Fig. 15 The empirical relationship between the privacy disclosure risk (, ) and the strength of additional multivariate noises d
Fig. 16 The comparison on transmission costs between our proposed approach and the perturbation and SMC approaches for privacy-preserving distributed model-based classifier training
8 Conclusions and future works With our capacity of supporting customizable privacy modeling and protection, a novel approach is proposed for privacy-preserving distributed model-based classifier training. By sharing only the weak concept models and generating the synthetic samples at the central site, our proposed approach is able to reduce both the data transmission costs and the potential privacy breaches significantly. A computational approach is proposed to achieve a good trade off between the privacy disclosure risk, the sharing benefit and the data utility by automatically determining the optimal number of mixture components to be shared. Our experimental results on four UCI machine learning data sets and two image data sets are very convincing and show that our approach is highly effective. Our future works will focus on: (a) Incorporating benefit/risk negotiation in our distributed approach for privacy-preserving model-based classifier training, so that each data site can share different numbers of mixture components with other data sites or the same data site in different negotiation loops; (b) A theoretical analysis of the upper bound of the privacy disclosure risk when the global concept model or the weak concept models are shared. Acknowledgments The authors also thank the reviewers for their useful comments and suggestions to make this paper more readable and provide some recent references. One of the authors, Jianping Fan, wants to thank Dr. Wenliang (Kevin) Du and Dr. Ting Yu for their useful discussion on trust negotiation and negotiable privacy-preserving data mining.
123
182
H. Luo et al.
References 1. Westin AF (1967) Privacy and freedom. Atheneum, New York 2. Rosenthal A, Winslett M (2004) Security of shared data in large systems: state of the art and research directions. In: ACM SIGMOD 3. Thuraisingham BM (2002) Data mining, national security, privacy and civil liberties. SIGKDD Explor Newsl 4(2):1–5 4. Aggarwal G, Bawa M, Ganesan P, Garcia-Molina H, Kenthapadi K, Mishra N, Motwani R, Srivastava U, Thomas D, Widom J, Xu Y (2004) Vision paper: enabling privacy for the paranoids. In: VLDB, pp 708–719 5. Hore B, Mehrotra S, Tsudik G (2004) A privacy-preserving index for range queries. In: VLDB, pp 720–731 6. Deutsch A, Papakonstantinou Y (2005) Privacy in database publishing. In ICDT, pp 230–245 7. Sweeney L (2002) Achieving k-anonymity privacy protection using generalization and suppression. Int J Uncertainty 10(5):571–588 8. Kantarcioglu M, Jin J, Clifton C (2004) What do data mining results violate privacy. In: ACM SIGKDD 9. Liew CK, Coi UJ, Liew CJ (1985) A data distortion by probability distribution. ACM Trans Database Syst 10(3):395–411 10. Muralidhar K, Sarathy R (1999) Security of random data perturbation methods. ACM Trans Database Syst 24(4):487–493 11. Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: ACM SIGMOD, pp 439–450 12. Agrawal D, Aggarwal C (2001) On the design and quantification of privacy preserving data mining algorithms. In: ACM PODS 13. Evfimievski A, Srikant R, Agrawal R, Gehrke J (2002) Privacy preserving mining of association rules. In: ACM SIGKDD 14. Evfimievski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: ACM PODS 15. Wang K, Yu PS, Chakraborty S (2004) Bottom-up generalization: a data mining solution to privacy protection. In: IEEE ICDM 16. Ma D, Sivakumar K, Kargupta H (2004) privacy sensitive bayesian network parameter learning. In: IEEE ICDM 17. Yao A (1986) How to generate and exchange secrets. In: IEEE Symp. on Foundations of Computer Science, pp 162–167 18. Lindell Y, Israel R, Pinkas B (2000) Privacy preserving data mining. CRYPTO, pp 36–54 19. Goldreich O, Micali S, Wigderson A (1987) How to play any mental game- a completeness theorem for protocols with honest majority. In: STOC 20. Du W, Atallah MJ (2001) Privacy-preserving cooperative statistical analysis. In: 17th Annual Computer Security Applications Conference, pp 103–110 21. Du W, Han Y, Chen S (2004) Privacy-preserving multivariate statistical analysis: Linear regression and classification. In: SIAM Conference on Data Mining 22. Vaidya J, Clifton C (2002) Privacy preserving association rule mining in vertically partitional data. In: ACM SIGKDD 23. Vaidya J, Clifton C (2003) Privacy-preserving k-means clustering over vertically partitioned data. In: ACM SIGKDD 24. Wright R, Yang Z (2004) Privacy-preserving bayesian network structure computation on distributed heterogeneous data. In: ACM SIGKDD 25. Chen K, Liu L (2005) Privacy preserving data classification with rotation perturbation. In: IEEE ICDM, pp 589–592 26. Oliveira S, Zaiane OR (2003) Privacy preserving clustering by data transformation. In: SBBD 27. Domingo-Ferrer J, Mateo-Sanz JM (2001) Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans Knowl Data Eng 14(1):189–201 28. Fienberg SE, Makov UE, Steele RJ (1998) Disclosure limitation using perturbation and related methods for categorial data. J Official Stat 14(4):485–502 29. Raghunathan TJ, Reiter JP, Rubin D (2003) Multiple imputation for statistical disclosure limitation. J Official Stat 19(1):1–16 30. Crises G (2004) Synthetic microdata generation for database privacy protection. Technical report, CRISES Research Group, CRIREP-04-009 31. Merugu S, Ghosh J (2003) Privacy-preserving distributed clustering using generative models. In: IEEE ICDM
123
A distributed approach to enabling privacy-preserving model-based classifier training
183
32. Chan P, Stolfo S, Wolpert D, (eds) (1996) Working Notes of AAAI Workshop on Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithms, vol 36. AAAI/MIT Press, Cambridge 33. Kargupta H, Datta S, Wang Q, Sivakumar K (2003) On the privacy preserving properties of random data perturbation techniques. In: IEEE ICDM 34. Huang Z, Du W, Chen B (2005) Deriving private information from randomized data. In: ACM SIGMOD 35. Zhu Y, Liu L (2004) Optimal randomization for privacy preserving data mining. In: ACM SIGKDD, pp 761–766 36. Xiong L, Chitti S, Liu L (2007) Mining multiple private databases using a knn classifier. In: SAC 37. Kim J, Winkler WE (2003) Multiplicative noise for masking continuous data. Technical report, US Bureau of Census, Statistics Research Division technical report statistics 2003-01 38. Liu K, Kargupta H, Ryan J (2006) Random projection-based multiplicative perturbation for privacy preserving distributed data mining. IEEE Trans Knowl Data Eng 18(1):92–106 39. Ting K, Witten I (1999) Issues in stacked generalization. J Artif Intell Res 10:271–289 40. Fan J, Luo H, Hacid M-S, Bertino E (2005) A novel approach for privacy-preserving video sharing. In: ACM CIKM, pp 609–616 41. Figueiredo M, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24:381–396 42. McLachlan G, Krishnan T (2000) The EM algorithm and extensions. Wiley, New York 43. Ueda N, Nakano R, Ghahramani Z, Hinton GE (2002) Smem algorithm for mixture models. Neural Comput 12(9):2109–2128 44. Luo H (2007) Concept-based large-scale video database browsing and retrieval via visualization. Ph.D. thesis, The University of North Carolina at Charlotte, pp 58–60. http://hdl.handle.net/2029/87 45. Hyvarinen A (1998) New approximations of dioeerential entropy for independent component analysisand projection pursuit. In: Annual Conference on Neural Information Processing Systems, vol 10, pp 273–279 46. Gomantam S, Karr AF, Sanil AP (2005) Data swapping as a decision problem. J Official Stat 13(4):635– 655 47. Lamber D (1993) Measures of disclosure risk and harm. J Official Stat 9:313–331 48. Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2-3):103–134 49. Joachims T (1999) Transductive inference for text classification using support vector machine. In: ICML 50. Hettich S, Blake C, Merz C (1998) Uci respository of machine learning databases. Technical report. http:// www.ics.uci.edu/~mlearn/
Author Biographies Hangzai Luo received the BS degree in computer science from Fudan University, Shanghai, China, in 1998. At the same year, he joined Fudan University as Lecturer. At 2002, he joined University of North Carolina at Charlotte to pursue his Ph.D. degree on Information Technology. He got his Ph.D. degree at 2006 and joined East China Normal University as Associate Professor at 2007. His research interests include computer vision, video retrieval, and statistical machine learning. He got second place award from Department of Homeland Security at 2007 for his excellent work on video analysis and visualization for homeland security applications.
123
184
H. Luo et al. Jianping Fan received his MS degree from in theory physics from Northwestern University, Xian, China in 1994 and his Ph.D. degree in optical storage and computer science from Shanghai Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Shanghai, China, in 1997. He was a Postdoc Researcher at Fudan University, Shanghai, China, during 1998. From 1998 to 1999, he was a Researcher with Japan Society of Promotion of Science (JSPS), Department of Information System Engineering, Osaka University, Osaka, Japan. From September 1999 to 2001, he was a Postdoc Researcher in the Department of Computer Science, Purdue University, West Lafayette, IN. At 2001, he joined the Department of Computer Science, University of North Carolina at Charlotte as an Assistant Professor and then become Associate Professor. His research interests include image/video analysis, semantic image/video classification, personalized image/video recommendation, surveillance videos, and statistical machine learning.
Xiaodong Lin is an associate professor of mathematics at University of Cincinnati. He is on an academic leave at the Statistics and Applied Mathematics Science Institute during 2003–2004. He has a Ph.D. and M.S. from Purdue University and a bachelor’s degree from the University of Science and Technology of China. His research interests include data mining, statistical learning, machine learning and privacy-preserving data mining.
Aoying Zhou is currently a professor in Computer Science at East China Normal University, Shanghai, where he is also chairing the Institute of Massive Computing. He won his Ph.D. degree from Fudan University in 1993, his master and bachelor degree from Sichuan University, Chengdu, in 1988 and 1985 respectively. He is now serving as the vice-director of ACM SIGMOD China and Database Technology Committee of China Computer Federation. He was invited to join the editorial boards of some prestigious academic journals, such as VLDB Journal, Journal of Computer Science and Technology (JCST), etc. He served or is serving as PC members of ACM SIGMOD07/08, WWW07/08, SIGIR07/08, EDBT06, VLDB05, ICDCS05, etc. He was the conference co-chair of ER’2004 and the PC co-chair of WAIM’2000.
123
A distributed approach to enabling privacy-preserving model-based classifier training
185
Elisa Bertino is a professor of computer science in the Department of Computer Sciences, Purdue University and the Research Director of the Center for Education and Research in Information Assurance and Security (CERIAS). From 2001 to 2007, she was a co-editor in chief of the Very Large Database Systems (VLDB) Journal. Her main research interests include security, privacy, digital identity management systems, database systems, distributed systems, multimedia systems. She is a fellow of the IEEE and the ACM and a Golden Core member of the IEEE Computer Society. She received the 2002 IEEE Computer Society Technical Achievement Award for her “outstanding contributions to database systems and database security and advanced data management systems” and the 2005 IEEE Computer Society Tsutomu Kanai Award for “pioneering and innovative research contributions to secure distributed systems.”
123