A Privacy-preserving Collaborative Filtering Scheme with ... - CiteSeerX

0 downloads 0 Views 186KB Size Report
Jun 11, 2006 - preserving scheme in which users perturb their ratings for each item ..... those marked rated) and then perform k-means to derive the ratings for ...
A Privacy-preserving Collaborative Filtering Scheme with Two-way Communication Sheng Zhang, James Ford, Fillia Makedon

Department of Computer Science, Dartmouth College {clap,

jford, makedon}@cs.dartmouth.edu

ABSTRACT An important security concern with traditional recommendation systems is that users disclose information that may compromise their individual privacy when providing ratings. A randomization approach has been proposed to disguise user ratings while still producing accurate recommendations. However, recent research has suggested that a significant amount of original private information can be derived from perturbed data in a randomization scheme. We suggest that a main limitation of the existing randomization approach is that perturbation is item-invariant—each item has a same perturbation variance. Based on this observation, we introduce a two-way communication privacypreserving scheme in which users perturb their ratings for each item based on the server’s guidance instead of using an item-invariant perturbation. Compared to the existing randomization approach, our new scheme can help users disclose much less private information at the same recommendation accuracy level.

Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: Online Information Services—Commercial services; K.4.4 [Computers and Society]: Electronic Commerce—Security

General Terms Security

Keywords Collaborative filtering, Privacy, Randomization, Two-way communication

1. INTRODUCTION Recommendation systems have become popular in the past decade as an effective way to help people to cope with information overload by recommending products or services

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EC’06, June 11–15, 2006, Ann Arbor, Michigan, USA. Copyright 2006 ACM 1-59593-236-4/06/0006 ...$5.00.

from a large number of candidates. Collaborative Filtering (CF) is one popularly used recommendation technique. It recommends to a user the items that people with similar tastes and preferences liked in the past. In traditional CF systems, a server first collects (explicit or implicit) ratings from users and then executes CF algorithms to make recommendations. Because data collected from users cover personal information about places and things they do, watch, and purchase, there is a serious threat to individual privacy. User data are valuable and have been sold by companies that have suffered bankruptcy. Moreover, a significant number of people are not willing to provide their personal information even if they can benefit in return, as indicated by a recent survey [7]. This motivates the goal of preserving user privacy in CF systems. There are two approaches for privacy-preserving collaborative filtering in the literature: a secure multi-party computation approach [5, 6] and a randomization approach [16, 17]. For the first of these, Canny proposed a privacy-preserving scheme for a Singular Value Decomposition (SVD) based CF algorithm [5, 6]. The idea is to reduce the SVD computation to iterative vector additions, and use homomorphic encryption to allow the sums of encrypted vectors to be computed and decrypted securely. There are two concerns regarding this scheme: one is high computation and communication costs incurred by frequently used cryptographic operations. The other concern is that the number of users should be known in advance and all users need to synchronize to make sure they work on the same data during the whole process. For the second approach of privacy-preserving CF, Polat and Du proposed random perturbation CF schemes for a Pearson correlation-based algorithm [16] and an SVD-based algorithm [17]. The idea of these two schemes is to preserve user data by adding random noise while making sure that the random noise still preserves enough of the “signal” from the data so that accurate recommendations can still be made. Recent research work [14, 15] has pointed out that randomization techniques might not preserve privacy as much as had been believed. Our previous study [22] introduced two data reconstruction methods (a k-means clustering based method and an SVD-based method) for reconstructing original rating data from a randomly perturbed rating matrix. Our experiments indicated that both methods could derive a considerable amount of the original information. In this paper, we suggest that the main limitation of the current CF perturbation scheme is that perturbation is item-invariant—the same perturbation level is applied to

all items. The use of an item-invariant approach has two drawbacks: first, applying the same perturbation level to the items that are the most important for learning user patterns will impair recommendation accuracy (in other words, the perturbation is too large). Second, for items that are the least critical for recommendations, a fixed level of perturbation unnecessarily compromises privacy (in other words, the perturbation is too small). We observe that the reason behind this limitation is the lack of communication from the server (data analyzer) to users (data providers). Because a data provider does not know the importances of items and correlations among items, she has no choice but to perturb data in an item-invariant way. Motivated by the above observation, we propose a twoway communication perturbation scheme for collaborative filtering. In this new scheme, a server first sends perturbation guidance to a data provider; the data provider uses the received guidance to perturb her rating vector and then sends perturbed data back to the server. Generally speaking, the perturbation guidance helps the data provider to send the necessary and minimum information that the server needs for use in making accurate recommendations. The idea of two-way communication privacy-preserving schemes was previously introduced in association rule mining [20] and data classification [21]. We extend this idea to the application of collaborative filtering in this work. We give theoretical bounds on both recommendation accuracy and privacy disclosure for this new scheme. Experiments conducted on a real rating data set demonstrate that our scheme preserves more private information than the current perturbation scheme at a same accuracy level. Moreover, our scheme allows data providers to choose different levels of privacy preservation. Therefore, it is suitable for various data providers with different accuracy and privacy needs. The rest of the paper is organized as follows. Section 2 summarizes related work. Section 3 gives an overview of the current randomization scheme in collaborative filtering and identifies its key limitation. In Section 4, we present our new perturbation scheme and give an analysis on its performance (accuracy, privacy, and runtime efficiency). Experimental results are presented in Section 5. Finally we conclude our work in Section 6.

2. RELATED WORK The random perturbation technique was first introduced into privacy-preserving data mining by Agrawal and Srikant [2] and was extended by Agrawal and Aggarwal [1]. In [1], the authors introduced a measure of privacy that is a function of the mutual information between the original data distribution and the randomly perturbed data distribution. Evfimievski et al. presented another measure called “privacy breaches” to quantify the preservation of privacy [10, 11]. Intuitively, a privacy breach level is the probability that a property of the original data record is revealed given the randomized data record. They showed that privacy breaches can occur even when the mutual information is small. More recently, two studies have focused on reconstructing the original data from the randomized records. Kargupta et al. pointed out that arbitrary randomization is not safe because it is easy to breach the privacy protection it offers [15]. They proposed a random matrix-based spectral filtering technique to recover the original data from per-

turbed data. Their experimental results demonstrated that in many cases random data distortion preserves very little data privacy. Huang et al. [14] proposed using two data reconstruction methods (principal component analysis and Bayesian estimation). They found that the original data can be reconstructed more accurately when the data correlations are higher. The authors also introduced a modified randomization scheme in which they force the correlation of random noises to be “similar” to the original data. This scheme implicitly requires data providers to know statistical information about the overall data in advance. Therefore, it is similar in spirit to the two-way communication schemes proposed in [20, 21] and this work.

3.

A RANDOMIZED CF SCHEME

In this section, we review the existing randomized CF scheme proposed in [16, 17] and analyze the limitations associated with it.

3.1

Scheme Overview

In this scheme, there are n users connected to one server; each user has a rating profile consisting of ratings given to the same set of m items. If a user has not given a rating to an item, the corresponding entry in her rating profile is missing. To ensure privacy, each user sends to the server disguised data instead of original ratings. The server collects the disguised information from all users to make predictions and recommendations. The following are the steps of data disguise proposed in [16, 17]. 1. The server and users decide the standard deviation (denoted σ) of a Gaussian distribution used to generate random noise. 2. Each user computes her rating average and standard deviation, and then calculates z-scores (distances from the average divided by the standard deviation) for those items that she has rated. 3. Each user creates random values drawn from the Gaussian distribution with zero mean and standard deviation σ. She adds these random values to her z-scores to generate disguised z-scores, and then sends the disguised z-scores to the server. 4. After the server obtains the disguised z-scores from all users, it can apply the approach in [16] to make predictions using the Pearson correlation-based CF algorithm proposed in [12]. Alternatively, it can apply the approach in [17] to make predictions using the SVDbased CF algorithm proposed in [19]. Note that a uniform distribution can also be used in this randomization scheme; the Gaussian distribution case is discussed here only for simplicity. It is also important to note that randomization is applied only to rated entries in this scheme; there is also a variant version in which randomization is applied to all entries.

3.2

Limitations

Our previous analysis [22] showed that a considerable number of ratings can be derived in the above randomization scheme. For the sake of completeness, we briefly describe how our k-means based reconstruction method works. This helps to illustrate the limitations in the existing scheme.

Given a vector of disguised z-scores corresponding to those rated items from a user, suppose a malicious server wants to derive the original ratings. Assume that there are k possible ratings (a1 , · · · , ak ) used in the recommendation system, so that the number of possible z-scores in the user’s profile is also k; denote these k z-scores as z1 , · · · , zk . Denote the disguised z-scores for the rated items from this user as d1 , · · · , dh if she has given ratings to h items. In order to derive the original rating corresponding to di , the server intends to find an a∗ from a1 to ak that maximizes the posterior probability Pr(a|di ). Since Pr(aj |di ) = Pr(zj |di ) for each j, the server can first compute a z ∗ from z1 to zk that maximizes Pr(z|di ) and then derive a∗ from z ∗ . If the server knows the values of z1 , · · · , zk , the solution to maximizing Pr(z|di ) can easily be obtained by finding the z ∗ that is nearest to di . Now the server can estimate z1 , · · · , zk from d1 , · · · , dh by maximizing the following log-likelihood: log Pr(d1 , · · · , dh |z1 , · · · , zk ) =

log Pr(di |z1 , · · · , zk ). i

The above equation satisfies because each di is conditionally independent of the others. Recall that every di is the sum of a possible (undisguised) z-score and a random value generated from N (0, σ 2 ). Therefore, every di can be thought of as a value generated by mixture models, in which the number of components is k and the jth component is N (zj , σ 2 ). The Expectation-Maximization (EM) algorithm [8] is a general approach to computing the parameters of mixture models. Here, we use a simplified and constrained version—the kmeans clustering algorithm. Define Pr(di |z1 , · · · , zk ) =

max

z∈{z1 ,··· ,zk }

Pr(di |z),

and we then have log Pr(di |z1 , · · · , zk ) i

log

= i

=

max

i

=

max

z∈{z1 ,··· ,zk }



z∈{z1 ,··· ,zk }

1 2σ 2

Pr(di |z)

−(di − z)2 +C 2σ 2

min

i

z∈{z1 ,··· ,zk }

(di − z)2 + C.

In the above equation, C denotes a constant. Therefore, we can obtain the following equation: arg max log Pr(d1 , · · · , dh |z1 , · · · , zk ) z1 ,··· ,zk

= arg min

z1 ,··· ,zk

min

i

z∈{z1 ,··· ,zk }

(di − z)2 .

The objective function in the right side of the above equation is exactly the objective function for k-means clustering with a Euclidean squared distance measure. Therefore, the server can estimate z1 , · · · , zk as the k obtained centroids (sorted in order) when k-means is applied on the disguised z-scores d1 , · · · , dh . Sequentially, an entry with the disguised z-score assigned into the jth cluster will be reconstructed as the jth possible rating aj . This rating reconstruction method can also be extended to continuous-valued ratings, and related details can be found in [22].

For the variant version in which randomization is applied to all (rated and unrated) entries, there is a naive method to identify a subset of the rated items. Since the original zscore for each unrated item is 0, the corresponding disguised z-score is thus generated by N (0, σ 2 ). Therefore, the server can mark those entries with disguised z-scores that lie in a range of [−cσ, cσ] as unrated entries and mark all the other entries as rated. The server can choose a relatively high value of c (e.g., c = 3) in order to ensure precision (the ratio of the number of items correctly marked rated to those marked rated) and then perform k-means to derive the ratings for those marked rated items. It is obvious that the reconstruction accuracy of this data reconstruction method will be affected by the perturbation variance (σ 2 ) that is used. The reconstruction error will become smaller when the variance decreases and vice versa. In the current perturbation CF scheme, the randomization is item-invariant. Each item, no matter how useful it is to collaborative filtering, has a same perturbation level (noise variance). This item-invariant scheme reduces both accuracy and privacy preservation. For an item that plays an important role in collaborative filtering, a small increase in its perturbation level may lead to a much larger accuracy error than a same increase in another item’s perturbation level does. On the other hand, for an item that is not so useful for collaborative filtering, a uniform perturbation level reduces privacy for only a tiny accuracy gain. Note that applying a low perturbation level on an item not only reduces its own privacy but also reduces privacy for other items when the above reconstruction method is used. Item-invariant perturbation is inherent in the current oneway communication scheme. Since each user does not know the features of the overall data, she has no option but to use the same perturbation variance on all items. This observation motivates us to apply a two-way communication scheme introduced in [20, 21]. Another possible solution to help users learn more information of overall data is to allow them to communicate with each other. However, this solution requires synchronizing all users and building trust relationships among them.

4.

A NEW SCHEME

This section proposes our two-way communication perturbation scheme. We first discuss the communication protocol, how the server generates perturbation guidance, and how each user perturbs her ratings using the received perturbation guidance; then a performance analysis of accuracy and privacy is given.

4.1

Communication Protocol

Table 1 lists the notation used in our discussion and Figure 1 shows the communication protocol of our scheme. We assume that users sequentially connect with a server to provide their rating data. When a user intends to provide her normalized rating profile p, she first sends a message to the server. The server then responds with the current disclosure level k. This level guarantees that the server will get the necessary information from users for an accurate collaborative filtering. If the user agrees on this disclosure level, the server will send the current perturbation guidance Vk to her. After verifying the validity of the received perturbation guidance, the user computes the perturbed rating vector R(p) and transmits it to the server. The server then

m n A A Ac k Vk p R(p) 



vectors and right singular vectors respectively, and S is a diagonal matrix composed of singular values s1 , · · · , sm in decreasing order. The server then decides the disclosure level k as the minimum k that satisfies sk /s1 ≤ µ, in which µ is a predetermined parameter. The server can choose a small µ in order to provide highly accurate recommendations or choose a large µ to help protect users’ privacy. The current perturbation guidance Vk consists of the top k right singular vectors, which are the first k columns of V . Thus, Vk is an mby-k matrix, and VkT Vk is a k-by-k identity matrix. Before the first interaction with a user, Ac can be composed of data provided by privacy-careless users or Vk can be generated from a small group of users using the privacy-preserving scheme in [5]. The intuition of choosing the disclosure level and perturbation guidance in our scheme is to preserve information in AT A. By Eq. (1), we have

Table 1: Notation Number of items Number of users Whole original rating matrix (n-by-m) Whole perturbed rating matrix Current perturbed rating matrix Current disclosure level Current perturbation guidance (m-by-k) A user’s normalized rating profile A user’s perturbed data user



server 1. request 2. disclosure level 3. agreement

wait if disagree

sim(i, j) =

4. perturbation guidance 5. perturbed data

(AT A)i,j . T i,i (A A)j,j



(2)

(AT A)

This equation indicates that cosine similarities between items are determined by the values in AT A. In other words, the accuracy of sim(i, j) obtained from perturbed data A can be improved by a better approximation AT A to AT A. With basic linear algebra, it is obvious that Vk is composed of the first k eigenvectors (corresponding to the largest k eigenval



Figure 1: The communication protocol of the 2-way communication privacy-preserving collaborative filtering scheme.

T







ues) of Ac Ac since updates all received perturbed data. If the user does not accept the current disclosure level, she can repeat the request later because the disclosure level will decrease rapidly as more perturbed rating profiles are received. After obtaining the perturbed data from all users, the server conducts a CF algorithm to generate recommendations. Although our scheme can be extended to other CF algorithms with appropriate modifications, we focus on one in particular, the item-based algorithm introduced in [18], for ease of discussion. In this CF algorithm, the cosine similarity between item i and j is defined as sim(i, j) =

u

Au,i Au,j







2 u Au,i 

.

(1)

2 u Au,j 

In the above equation, u is the set of all users; each entry in A is normalized by first subtracting the row average (user average), and then dividing the difference by the row standard deviation. After similarities between items are computed, the prediction on a certain item for a user is computed as the weighted sum of the ratings given by this user on other items.

4.2 Computing Perturbation Guidance and Perturbations



T



Ac Ac = V SU T U SV T = V S 2 V T . 

T



In real cases, because the first k eigenvectors of Ac Ac converge to the first k eigenvectors of AT A quickly, Vk can be considered as an estimation of the first k eigenvectors of AT A. Therefore, the perturbation guidance Vk is a transformation matrix that embeds the original data into a kdimensional subspace where AT A retains maximum information from AT A. Furthermore, the disclosure level k ensures that the retained information is necessary for accurate learning. As we will show later, the level k balances the trade-off between recommendation accuracy and privacy and thus provides flexibility to users with various privacy concerns. A user with a high level of concern about privacy can choose a small k to preserve privacy while a user unconcerned about privacy can choose a large k to prioritize collaborative filtering. Once a user obtains the perturbation guidance Vk from the server, she first validates this guidance by checking whether VkT Vk is an identity matrix. Then, she generates perturbed data based on her original normalized rating profile as 



R(p) = pVk VkT .

We now show how perturbation guidance and perturbations are computed. At any given time, denote the current perturbed rating matrix kept in the server as Ac . To compute the current disclosure level and perturbation guidance, the server first performs singular value decomposition on Ac as 





Ac = U SV T , in which U and V are two matrices composed of left singular

The original data p is an m-dimensional vector in which each unrated entry is filled with 0, and the resulting perturbed data vector R(p) is also of length m. After receiving perturbed data from the user, the server updates Ac . To efficiently compute new Vk , the server can use incremental SVD algorithms [3, 4] that can quickly update singular vectors when there is a new row appended to the matrix. Out of concern for efficiency and privacy, the server can delay updating Vk and k until it receives a number of new perturbed profiles. 

4.3 Accuracy Analysis

4.4

A theoretical analysis of recommendation accuracy is given in this section. Recall that after the server collects perturbed data from all users, it computes the item similarity for each \j)) from AT A using Eq. (2). Later if a pair of i and j (sim(i, user wants to compute the prediction on a certain item, the server sends similarities corresponding to this item to the user. The user then computes the weighted sum using her ratings on other items.1 The analysis above indicates that the accuracy of predictions is determined by the accuracy of similarity computations. Given this observation, our accuracy analysis is focused on the error in the item similarity computation.

We consider the possible disclosure of private information in both communication paths—from users to the server and from the server to users. The privacy disclosure from users to the server is considered first. Assume that the server (attacker) can maliciously choose Vk to compromise users’ privacy. Let Wk be the manipulated perturbation guidance. We first show that the server cannot derive a better approximation of the original normalized rating vector p from the perturbed vector R(p) it receives. According to the perturbation process, R(p) = pWk WkT . Since det(Wk WkT ) = 0, p cannot be deduced from R(p) deterministically. Moreover, the Moore-Penrose pseudoinverse matrix of Wk WkT is itself because WkT Wk is equal to the identity matrix (as verified by the user). Thus, given R(p), the least square approximation of p is





Definition 1. The error in the similarity between item i and j is defined as the absolute difference between the similarity computed from the original data A and the one from \j) − sim(i, j)|. the perturbed data A, i.e., |sim(i, 

We first prove the following lemma in order to derive an upper bound on this error. Lemma 1. If sk+1 is the (k + 1)th singular value of A, we have 

kAi k − sk+1 ≤ kAi k ≤ kAi k

Privacy Analysis

R(p)Wk WkT = pWk WkT Wk WkT = pWk WkT = R(p). Therefore, no better approximation of p can be derived from the received R(p). We now give an analysis on the degree of privacy guaranteed to users. Definition 2. The degree of privacy is defined as the minimum conditional entropy of the original data A given the perturbed data A when the server uses an arbitrary mby-k matrix Wk as the perturbation guidance matrix. That is, the degree of privacy is 





|(AT A − AT A)ij | ≤ s2k+1 , where Ai is the ith column in A. Proof. As Vk is an estimation of the first k singular vectors of A, A = AVk VkT = Uk UkT A is the optimal rank k approximation to A. We have



min H(A|A),





kAi k = kUk UkT Ai k ≤ kUk k2 kUkT k2 kAi k = kAi k,

Wk

where H(·) denotes the information entropy. 

 



kAi − Ai k = k(A − A)i k ≤ kA − Ak2 = sk+1 . 



Moreover, AT A is the optimal rank k approximation to AT A by the following: 



AT A = Vk Vk AT AVk VkT = Vk Vk V S 2 V T Vk VkT = Vk S 2 VkT . As s2k+1 is the (k + 1)th singular value of AT A, we have 







|(AT A − AT A)i,j | ≤ kAT A − AT Ak2 = s2k+1 .





Because I(A; A) = H(A) − H(A|A) (where I(A; A) is the mutual information between A and A) and H(A) is fixed, our definition of the degree of privacy is consistent with using mutual information as a privacy loss measure, as in [1]. 

Theorem 2. In our scheme, the degree of privacy can be approximated as follows. 

min H(A|A) = Wk

s2 + · · · + s2n 1 log 2πe k+1 . 2 mn − 1 

Proof. The perturbed data A is equal to AWk WkT . As the rank of Wk is no larger than k, the rank of A is no larger than k either. It follows that 

Theorem 1. The upper bound of the error of the similarity between item i and j is given as follows. \j) − sim(i, j)| |sim(i,

Suggest Documents