Some Techniques for Privacy in Ubicomp and Context-Aware Applications John Canny University of California, Berkeley
[email protected]
Abstract. The emergence of ubiquitous computing opens up radical new possibilities for acquiring and sharing information. But the privacy risks from widespread use of location or environmental sensing are unacceptable to many users. This paper describes a new methodology that provides much finer control over information exchange: only the information needed for the collaboration is shared, everything else is protected, and protection is provably strong. This allows us to explore collaborative applications in ubicomp settings that are exciting but which would be difficult or impossible without the techniques we propose. Specifically, we are developing an ubiquitous information-sharing service. This service provides recommendations for places, events, and many other items and services, using recommendations from a community of users. The recommendations are both explicit from user ratings, and implicit by using log data to infer a user’s presence or use of a service. The services is intended for location-enabled devices like cell phones and PDAs with GPS.
1
Introduction
Our work here builds on the recent papers [1] and [2] that introduced techniques for aggregating private user data in a peer-to-peer setting. The applications presented there were to SVD (Singular Value Decomposition) and collaborative filtering, i.e. to providing recommmendations to a community of users based on their shared preferences. We believe such techniques are highly applicable to ubiquitous computing applications, and we explore that further in this paper. Ubiquitous computing can gather rich information about users and their activity, and we believe there are many useful ways to use this information, especially for collaborative work. For instance, by tracking their location, users might obtain recommendations about restaurants, shops, places to see and things to do. But gathering such information creates great risks to privacy. The general thread of our work is to explore cryptographic and AI techniques to compute from user data only the information needed for a particular task, and to protect the rest of the data. This paper makes three contributions over previous work: (i) it extends the collaborative filtering algorithm from to handle metadata as well as collaborative data, (ii) it gives a new and simplified method for privacy protection, (iii) it
describes the extra components needed for location-based collaborative filtering services. The result is the design of a system for collaborative location-based services. We have not fully implemented the service . But we describe the parts of the service that we have implemented, and give practical complexity analysis for the others. 1.1
Background
In [1], the author described techniques for encrypting data and aggregating it and revealing the result. The aggregation process described there was a peer-to-peer protocol, which succeeds if a large enough fraction of users are honest. It is also robust against a fraction of clients being offline. Each client acquires a copy of the aggregate by the end of the protocol, and can use it to get recommendations from their personal preferences. The aggregate provides good protection of the original user information. The paper [1] also presented a new algorithm for model-based collaborative filtering (CF) based on SVD. The model is a compact description of all users’ preferences from which individual predictions can be quickly made. Our goal in [1] was to come up with an algorithm which was compatible with the privacy-preserving protocol, and which still gave acceptable recommendations. Fortunately, the CF algorithm compared quite well with the best published algorithms to date on a standard test dataset. However, we felt the algorithm in [1] was not as accurate as it could be. In particular it did not deal well with very sparse data, and that means most CF data. This difficulty with sparseness is a problem shared by virtually all current CF algorithms. In [2] we described a new algorithm which is compatible with our privacy protocol and which is more accurate than other CF algorithms on standard test data. It is also very space-efficient, providing a ten- to one-hundred-fold (typical) compression of the original sparse user data. This compression has the secondary benefit of hiding user information in an information-theoretic sense. Finally, the algorithm has speed advantages over previous algorithms and it is incremental, adapting rapidly to small changes and additions to user ratings. The papers [1] and [2] provide the foundation for our exploration of collaborative filtering in ubicomp settings. We believe such ubiquitous sharing of knowledge has great potential. By adding user data for location, time and purchases if available, we can extend collaborative filtering from commerce to “everyday knowledgesharing”. To illustrate the possibilities we give some scenarios: Restaurant: Recommend a good nearby restaurant, specifying type of cuisine. Cafe: Recommend a good cafe nearby, which is open now. Safe walk: What is the safest route across the park? What hours are safest? Nice sunset: Recommend a good place to watch sunsets. Cheap Gas: Where is the cheapest gas near here? We describe techniques to these queries, based on information about location, time and possibly purchases. Some of these are implicit in the above (e.g. sunset
is characterized using a time query) activity data from location sensors, time and purchase records. Our solution is a step towards combining collaborative data with other types of metric data, where the metrics encode other notions of preference. 1.2
Paper Outline
In section 2 we review the collaborative filtering method from [2]. The experiments in [2] suggest that this algorithm is probably the most accurate CF algorithm on continuous ratings data, and it has other advantages in speed and storage. We then explain how to extend the method to make use of metadata. Metadata for an item can allow accurate recommendations when user ratings for the item are unavailable. For example, metadata for a restaurant would include type of cuisine, price range, professional reviews, and even the text of the menu. In section 3 of this paper, we describe how to perform aggregation (vector addition) of many users’ data while preserving privacy. This is the basic step in all our CF algorithms. Our presentation here is client-server, and is different from that given in [1] which was peer-to-peer. In fact the type of aggregation scheme is independent of the inference algorithms used on top of it. So the CF/metadata algorithm from this paper can be used with either peer-to-peer or client-server aggregation. The client-server version is much simpler, and it is the scheme that we are implementing at this time in a testbed. The peer-to-peer version has some important advantages, but its complexity has steered us toward the simpler version for the time being. In section 4, we explain how to make use of location, time and purchase data to provide a rich set of recommendations. We close with a summary of related work in progress and future work.
2
CF by Sparse Factor Analysis
In collaborative filtering we want to extrapolate ratings for a given user from their known ratings and the ratings of others. We use low-dimensional linear models to do this. The rationale for such models is given in [2]. CF data is extremely sparse. The EachMovie dataset described later, which contains only 3% of the possible ratings, is considered a “dense” dataset. Other datasets may have 0.1% or even 0.01% of the possible ratings. A good CF method should deal with missing data in a principled way, and not e.g. by filling missing values with defaults. For these reasons we chose a linear factor analysis model. Factor analysis (FA) [3] is a general probabilistic formulation of linear fit. Singular-Value Decomposition (SVD) and linear regression are special limiting cases of FA. Factor analysis and SVD are sometimes confused, but FA is the more accurate method in general. SVD and linear fit are appropriate in certain limiting cases. But they are not very accurate in CF applications. Furthermore, FA can work with sparse data without heuristic fill-in of missing values. Suppose there are n items to be rated, and m users. Suppose also that we model user preferences with a k-dimensional linear space (k is small, typically
4-20). Let Y = (Y1 , . . . , Yn ) be a random variable representing user ratings for n items, and Yj is the rating for item j. An observation (lower case) yi would be a particular set of ratings by user i, so e.g. yij would be user i’s rating of movie number j, say yij = 5 on a scale of 0 to 10. Let X = (X1 , . . . , Xk ) be a random variable representing a user’s k canonical preferences, which is a kind of user profile. For instance, X1 could be an individual’s affinity for blockbuster movies. Other Xj ’s would affinities for other movie types. We do not need to know the meanings of the dimensions at any stage of the algorithm, and the model is automatically generated. The linear model for user preferences looks like this: Y = ΛX + N
(1)
where Λ is an n × k matrix. N = (N1 , . . . , Nn ) is a random variable representing “noise” in user choices. The linear model we are looking for comprises Λ and the variances VAR(Nj ) of the noise variables. X is automatically scaled so that it has unit variance, and therefore we dont need to find VAR(X). For the time being, we assume all the VAR(Nj )’s are the same, VAR(Nj ) = ψ. We also assume that X and N have gaussian probability distributions. If we could observe X and Y at the same time we would have a classical linear regression problem. We can’t observe X, and instead we have a factor analysis problem. From the model, we can clarify the distinction between factor analysis and SVD or least squares. Factor analysis is a full probabilistic model, and models both the distribution of X and the distribution of N . SVD or least squares seek only to minimize the variance of N . This is a reasonable approximation only if the variance of X is large enough. In other cases, SVD or least squares will tend to overfit noisy data. In typical collaborative filtering applications, the variances of X and N are comparable, and SVD/least squares is much less accurate than factor analysis. There are several algorithms available for factor analysis. We chose to use an EM approach (Expectation Maximization) [4] for two reasons: firstly because it has a particularly simple recursive definition which can be combined with our privacy method, and secondly because it can be adapted to sparse data. Our approach to dealing with sparseness is based on [5]. We skip the details of the derivation, which are given in more detail in [2]. Because user i hasnt rated every item, we need to introduce an n × n “trimming” matrix Ii . The matrix Ii is a diagonal matrix with 1 in positions (j, j) where j is an item that user i has rated, and Ii is zero everywhere else. The matrix Ii allows us to restrict the formulas for user i to the items they have rated. We have Λi = Ii Λ Mi = (ψI + ΛTi Λi )−1 xi = Mi ΛTi yi
(2)
where I without the subscript is a k × k identity matrix. Using the derivation given in [2], to compute the Factor Analysis (Λ, ψ), we evaluate the quantities: Ai =
1 ni Ii
Bi =
1 T ni Ii yi xi
Ci =
1 T ni yi Ii yi
⊗ (xi xTi + ψMi ) (3)
where ⊗ is the Kronecker product (e.g. kron() in Matlab), and ni is the number of items actually rated by user i. Its important to notice that missing ratings “disappear” from equations (2) and (3). That’s because missing ratings in yi are always multiplied by a zero from the trimming matrix Ii . So the equations are independent of missing data, and there is no need to assign default values to the missing ratings. The user sends the results in encrypted form (see section 3) to the server. The server totals all the contributions, and computes: Pm −1 Pm L(Λ(p) ) = ( i=1 Ai ) i=1 L(Bi ) (4) P m 1 (p) (C − trace(Λ B )) ψ (p) = m i i i=1 where L(A) is the matrix A written as a vector with its columns stacked one on top of another (A(:) in Matlab). The server then decrypts the results and sends to all the clients. The iterative procedure goes in rounds, with back-and-forth communication between clients and the server using equations (2), (3) and (4). We call these the distributed EM-FA equations. The complete procedure is to randomly initialize Λ and ψ, run a few (say 10) iterations of linear regression (see [2]) to move Λ nearer the maximum likelihood value, and then to run iterations of EM-FA until convergence is obtained. In practice, the EM-FA procedure converges so reliably that we use a fixed number of iterations (15-25 typically) of EM-FA. From then on, we continue to run iterations periodically, which will keep the model fit to the data as users change or add preferences. 2.1
Obtaining Recommendations
The result of the iteration above is a model Λ, ψ for the original dataset y. It can be used to predict a user’s ratings from any subset of that user’s ratings. Let yi be user i’s ratings. User i should download Λ, ψ and then locally compute xi using equation (2). From xi , we compute Λxi which is a vector of expected ratings for all items. This includes predictions for ratings of all items which user j has not rated. 2.2
Adding MetaData
Metadata is very useful for determining similarity between items. This can help make collaborative filtering more accurate when the user ratings data is sparse
[6]. Several other authors have developed means for combining ratings and metadata [7–9]. Importantly, metadata also allows ratings estimates for items have not been rated at all by other users. This is inevitably the case for new items. Metadata comprises descriptors such as category labels, synopses, abstracts, or if the item is itself a text, the full-text of the item (“metadata” is not normally used to describe full-text, but it is the best descriptor for us). We call each such descriptor an “attribute”. Let zij be the value of the ith attribute for item j. Some attributes may be binary-valued, in which case we use 1 to represent true, and 0 to represent false. For instance, we captured movie genre data from the web, where movies were classified in 20 genres. The classification was not unique, so some movies were assigned to more than one genre. Each genre corresponds to an attribute. So zij = 1 if movie j was classified in genre i, and zij = 0 otherwise. If there are N attributes in total, then i ranges over 1, . . . , N . Each attribute can be thought of as virtual user who has a pure preference for that attribute. i.e. as a user who gives items with the attribute their maximum vote, and all other items zero. This allows us to model attributes by simply adding rows to the matrix yi j before we compute the factor analysis. There are some extra details to make this work, but this is the basic approach. As we shall see later, it gives very useful improvements in accuracy. It requires only slight modification of the EM-FA algorithm. And it preserves the sparseness of both the ratings data and the metadata. There is another important advantage to this approach. Because EM-FA handles missing data, it can immediately handle incomplete or missing metadata. e.g. suppose we have two sets of movies M1 and M2 , and that they have been assigned to genres in two different movie sites S1 and S2 . Unless the genre classifications happen to be the same, movies in M1 that are not also in M2 will not have a genre assigned from S2 ’s genre system. This is exactly the kind of missing information EM-FA is designed to handle. Both genre classifications can be used, and movies that have not been assigned to a genre in one of the classifications are treated as missing data. This allows liberal use of metadata without having to worry about its availability for all items. The key step to making this work is to adjust the model variance for the metadata items. In general, metadata will have much lower noise than normal user ratings. The noise can be estimated by running the EM-FA algorithm on an array of metadata, and recording its noise estimate. Whereas before we assumed that the noise variance ψ was the same for all users, it now becomes a function ψi of the attribute in question. But in most cases, ψi will be the same for a set of attributes. For instance, all the genre attributes should have the same ψi . Also, ψi is a property of the metadata and not a particular set of user data, so it can be estimated ahead of time. Modifications are needed to equations (2) and (3) for the quantities affected by ψi and they become: Mi = (ψi I + ΛTi Λi )−1 Ai =
1 ni Ii
⊗ (xi xTi + ψi Mi )
(5)
Substituting the noise variance of attribute i to ψi and evaluating the rest of the equations as before gives the combined model using user ratings and metadata. 2.3
Experiments
The most common evaluation metric for CF is the MAE or Mean Absolute Error between predicted and actual ratings for a set of users. We used MAE exclusively in our experiments for several reasons. It was justified by our own experience and by the observation from [10] that this measure and others “track each other closely”. All the code for our algorithms, along with our implementations of earlier algorithms such as Pearson correlation are available in MATLAB from the project website www.cs.berkeley.edu/~jfc/’mender/. For evaluation, we used the EachMovie dataset from Compaq Corporation. Originally created by Digital Equipment Corp., EachMovie was compiled from a movie recommendation site on the web. Eachmovie comprises ratings of 1628 movies by 72916 users. The dataset has a density of approximately 3%, meaning that 97% of possible ratings are missing. A number of experiments with EachMovie were presented in [2]. Those experiments show that EM-FA is probably the most accurate CF algorithm developed to date. The metadata was genre data obtained from the Internet Movie Database. Each movie was classified into one or more of 20 genres. Thus 20 virtual users were created to model the metadata. These rows were dense - there were no missing values because every movie was known to be either in or not in each genre. We partitioned the ratings data into training and testing sets. The training set YT included a number of actual users which was either 0, 50, 100, or 500 (see figure 1). YT also included the 20 metadata rows for the genre data. The Factor Analysis model was then trained on YT. Then the model was used to predict ratings for a disjoint set YP of 10000 users. For each user in YP, one of their ratings was witheld at random. The model was used to predict this rating, and then the absolute error between predicted and actual rating was computed. We made several passes (typically 10) over the dataset YP in order to get accurate MAE estimates. 2.4
Discussion
Figure 1 shows that including metadata gives substantial improvement in MAE error when few ratings are available. To make sense of these numbers its useful to have a “baseline” predictor. In this case, the per-user average, which is the average of a user’s known ratings, can be used to predict missing ratings. The MAE for per-user average is 1.28 on the Eachmovie dataset. The figure shows plots of collaborative data + metadata for model dimension 6, and collaborative data only for dimensions 3 and 6, which are appropriate for this many users [2]. We also tried collaborative+metadata with model dimension 10, but the plot was identical with dimension 6. The MAE using metadata only is 1.08, an improvement over baseline of about 15%. So even with no user data at all, metadata provides useful per-
MAE 1.10 Collab + Metadata, k=6 1.08
Collab only, k=6 Collab only, k=3
1.06
1.04
1.02
1.00
0.98
0.96 50
100
200
500
Number of Users
Fig. 1. MAE for EM-FA algorithm with and without metadata.
sonalization information. With 50 users and model dimension k = 6, the MAE without metadata is 1.10. With metadata it is 1.05, an improvement of 5%. This may not sound like much but collaborative filtering algorithms typically give improvements of only 10-30% over baseline predictors. To look at it another way, the MAE with metadata for 50 users is better than the MAE without metadata for 100 users. And the MAE with metadata for 100 users, is comparable to MAE without metadata for 200 users. Finally, the MAE score with 200 users and metadata matches the MAE scores reported in [11] and [12] for Pearson correlation prediction with 5000 users. So even simple metadata provides useful prediction. Larger gains may be possible with more elaborate metadata. 2.5
Related Work on CF
Two recent survey papers on collaborative filtering [11], [12], compared a number of algorithms for accuracy on widely-available test data. Generally speaking, they found that correlation-based schemes did as well as any previous algorithm. Since the surveys, there have been a few papers which gave comparable or better results than Pearson correlation on some datasets. The first uses SVD [10], which gives a linear least-squares fit to a dataset. SVD was used in our first paper [1] on CF with privacy. There are differences in the method of generating recommendations from the SVD however, and our scheme from [1] is based on
a MAP (Maximum A-Posteriori probability) estimator, and was more accurate than the scheme from [10] in experiments. The latest version of our method (EMFA) differs from both [1] and [10] by using the same probabilistic formulation to generate recommendations from a model, and to construct the model itself. It is always more accurate than either of the SVD schemes. Another recent paper uses a probabilistic method called “personality diagnosis” (PD) [13], and gives better accuracy than correlation but requires all the user ratings to generate recommendations. A number of recent papers have considered the use of metadata and ratings data together. In [7], the authors present a hybrid recommender. In [8], the authors use content-based agents to fill in missing ratings data. The paper [6] uses separate collaborative and content-based (metadata) recommenders and then combines the results with a weighted average. A very recent paper [9] uses an probabilistic aspect model to combine collaborative and metadata.
3
Preserving Privacy
Having shown that we can reduce factor analysis to an iteration based on vector addition of per-user data Ai , Bi and Ci , we next sketch how to do vector addition with privacy. Putting both procedures together gives us factor analysis with privacy. The scheme we use for vector addition differs from [1] which gave a peer-to-peer protocol. We assume that a fraction α of users are honest. The value α must be at least 0.5. The goals of the protocol are that: The server should gain no information about an individual user’s data yi , except that the user’s data is within the allowed range (no ballot-stuffing). The method uses a property of several common encryption schemes (RSA, Diffie-Hellman, ECC) called homomorphism. If M1 and M2 are messages, and E(.) is an encryption function, it turns out that E(M1 )E(M2 ) = E(M1 + M2 ) where multiplication is ring multiplication for RSA, or element-wise for DH or ECC. By induction, multiplying encryptions of several messages gives us the encryption of their sum. This seems to get us halfway there - we can add up encrypted items by just multiplying them. But how to decrypt the total? The decryption scheme is somewhat more involved. It relies on key-sharing. The key needed to decrypt the total is not owned by anyone. It does not exist on any single machine. But it is “shared” among all the users. Like a jigsaw puzzle, if enough users put their shares together, we would see the whole key. There is some redundancy for practical reasons - we would not want to require all the users to contribute their shares in order to get back the key, or we could probably never get it back. Because the item that has been shared among the users is a decryption key, they can use it to create a share of the decryption of the total. To clarify this, everyone has a copy of the encrypted total E(T ). Each person can decrypt E(T ) with their share of the key, and the result turns out to
be a share of the decryption of T . By putting these shares together, the users can compute T . Now to the method in detail. We assume that each of m users has a M -vector of data Gi ∈ IRM for i = 1, . . . , m. The algorithm from last section requires us to total Ai , Bi and Ci , so Gi is one of those quantities, and M is set to the size of that quantity. We assume that every user data item Gij is integer, and restricted to a small number of bits, say 10 bits. Since user ratings are typically quantized to very few bits and are very noisy, this is not a significant restriction. We further assume that a fraction α > 0.5 of users Pmare honest. The goal is to compute the vector sum G = i=1 Gi without disclosing any individual user data. Preserving user privacy is not enough to make this protocol realistic. If users know their data is protected, they could make arbitrarily large contributions to the total in order to bias it. We prevent this by requiring users to give a zero-knowledge proof that their data is in the right range. We explain this is section 3.3. Our scheme follows the general architecture of the election scheme of Cramer, Gennaro, and Schoenmakers [14]. We will describe only ElGamal encryption, although the scheme can be extended to other public-key systems using ideas from [15]. 3.1
Key Sharing
The goal of this step is to create a globally-known El-Gamal public key, and a matching private key which is held by no-one and instead secret-shared among all the clients. The key generation protocol of Pedersen [16] does this. After applying Pedersen’s protocol, each client has a share si of the decryption key s, and s can be reconstructed from any set of at least t + 1 shares. We need a bit more information about the fields used. Let p and q be large primes such that q|p − 1, and let Gq denote the subgroup of ZZ ∗p of order q. In normal El-Gamal encryption, a recipient chooses a g ∈ Gq and a random secret key s, and publishes g, h = g s as their public key. In our case, the secret key s is held by no-one and instead secret-shared among all the players. We assume that p, q, g, h are known to all participants after Pedersen’s protocol, as well as another generator γ ∈ Gq needed for homomorphisms described in the next section. γ can be chosen randomly by the server and broadcast to all clients. We also assume that each user publishes a public key g, hi = g si corresponding to si , which is needed to verify their decryption of data. We choose the encryption threshold to be greater than the number of untrusted users, which is (1 − α)n. Taking α = 0.6 for instance, gives us a threshold of t = 0.4n which allows the scheme to work correctly even when a significant fraction of clients are offline. 3.2
Value Encryption/Homomorphism
Each user i has M data values Gij , j = 1, . . . , M . To encrypt, user i chooses M random values rij , j = 1, . . . , M from ZZ q . The encryption of the data is then Γij = (xij , yij ) = (g rij , γ Gij hrij )(mod p)
for j = 1, . . . , M . In other words, each value is a standard El-Gamal encryption of the exponentiation of a vote: γ Gij . User i sends these M values to the server. The key property of this map is that it is a homomorphism. This homomorphism allows us to compute the encryption of a sum of votes by simply multiplying the encryptions. That is: m m m X X Y h( Gij , rij ) = h(Gij , rij )(mod p) i=1
i=1
i=1
for j = 1, . . . , M . So after the server has multiplied together all the Gi ’s it has received, it has an El-Gamal encryption of the sum of the Gi ’s. 3.3
ZK Proof of User Data Validity
The weakness with the scheme so far is that because the clients know that noone can see their input, they may try to bias the result by sending arbitrarily large values to the server. To prevent this, we require clients to prove that their input is valid before it will be accepted. They do this using Zero-Knowledge Proofs (ZKPs). The ZKP is nothing more than an auxiliary set of large integers which proves that (Γi1 , . . . , ΓiM ) represent a valid input. An expensive way to do this is to give a ZKP for every element Gij that bounds its size. This is neither efficient nor desirable. The amount of influence a single user has over the aggregate is more accurately modeled by the 2-norm of Gi . The squared 2-norm is just the sum of the squares of the elements of Gi . We can bound the 2-norm of Gi by bounding the sum of the squares of the elements of Gi . By proving this bound in zero-knowledge, we prove that Gi is valid, without disclosing any other information about it. The bound uses ideas from [15], and is given in Appendix II of [1]. As shown in that paper, the ZKP bounding the size of Gi requires 7 large integers per element, which is 7M large integers. Of the values to be totaled Ai , Bi and Ci , Ai has the most elements (k 2 n). So the size of the ZKP that proves on user’s contribution is valid is about 7k 2 n large integers. 3.4
Tallying and Threshold Decryption
The server computes for each item j = 1, . . . , M the product of all the homomorphic images that it receives: Xj =
m Y
xij
Yj =
m Y
yij (mod p)
i=1
i=1
and we notice that Yj = γ Tj hRj and Xj = g Rj where Tj =
m X i=1
Gij
and
Rj =
m X
rij
i=1
so (Xj , Yj ) is an El-Gamal encryption of the desired sum Tj . To decrypt, we broadcast Xj to all clients.
Each client that receives Xj should apply their share si of the secret key to it, and send Xjsi (mod p) to the server. Assume that for each j, the server receives at least t + 1 responses from some set K of clients. The server computes: Y Pj = (XJsi )Li,K = g sRj = hRj (mod p) i∈K
where Li,K is the lagrange coefficient needed for interpolation. Finally, the server computes: Yj Pj−1 = γ Tj (mod p). Although computing Tj requires taking a discrete log, the values of Tj will be small enough (106 to 109 ) that a babystep/giant-step approach will be practical. This p can be done by many of the clients in parallel to speed up the process. In |Tj | steps, the value of Tj will be found, and the client can send this info directly to the server for verification, since it is public. 3.5
Computation and Communication Costs
We start with the time for a user to encrypt their information (without zeroknowledge proof). In each round, the user computes Ai , Bi and Ci and sends them to the server. The size of Ai is much larger than the others (k 2 n entries). For Eachmovie data with n = 1600 items and model dimension k = 10, Ai has 160000 entries. Using recent benchmarks for the CRYPTO++ toolkit, a single exponentiation with 1024-bit keys takes about 10ms. Encrypting a single item takes twice this long. So the time for a user to encrypt their information is about 3200 seconds, which is a little under 1 hour. This is not unreasonable. User ratings are added quite slowly. It is very reasonable to run just one iteration of EM-FA per day. Once the model has been created, it can track the user data as it slowly increases. If we want better performance, we can also back away from EM-FA and go to a simpler least-squares regression method (see [2]). Leastsquares regression requires only data of size kn from each user. For Eachmovie and k = 10, that drops the time to encrypt each user’s data by a factor of ten down to 320 seconds, or about 5 minutes. On the server side, only multiplication of user data is needed. Aggregating information for each user takes about 1.6 seconds for EM-FA equations, or 0.16 seconds for the less accurate regression equations. So for 1000 users, the time to aggregate is less than half an hour (EM-FA) or about 3 minutes (regression). A single server could realistically handle 10000 users. Aggregation by multiplication is embarrassingly parallel, so the number of users scales linearly with the number of servers. The crunch comes with checking ZKPs. Creating or checking a ZKP requires about 10 exponentiations per data item (see [1]). That means a user would need about 10 hours to generate a ZKP for an EM-FA iteration, or 1 hour for regression. The former time is excessive, although the latter is not unreasonable. Things are very difficult on the server end, however. The server is supposed to check the ZKPs, but this takes just as long as generating them (10 or 1 hour). One server cannot possibly check all the ZKPs from even a small community of users.
We are currently exploring improvements to the ZKPs. One promising approach is to use a “weak” ZKP on every round for every user, and then request a full ZKP from users who fail the weak test several times. In the weak ZKP, the server sends random weight vectors to each client, and asks the client to prove that the inner product between the weight vector and their data is small. Such a proof is very small, and could be checked in a few seconds. Thus the server could apply weak checks on every round to every client, and reserve the full blown ZKPs for clients which appear suspicious. We also hope to be able to improve the time for the full-blown ZKP. These computation times apply to a state-of-the-art PC. They are beyond the capabilities of mobile devices. On the other hand, if the user has available a powerful machine on the network (such as their home computer), they can communicate with it from their portable device using a secure channel such as SSL. The actual calculations and transfer from PC to server could happen overnight while the network load is light. This scheme has other advantages, because the user’s persistent and sensitive data is stored in a safe place (their home) rather than on a mobile device which might be lost or stolen. Other providers that are very trustworthy (e.g. data “banks”) may provide the archival and client computation service in future in lieu of the home server.
4
Location, Time, and Purchase Records
The infrastructure developed so far supports collaborative filtering on “items” that users have rated either explicitly or implicitly. In an ubicomp setting, we want to construct sensible generalizations to “places” and “events” that may be localized in space and time. Developing these generalizations is our next step towards ubiquitous location-based recommender services. We also want to provide complementary information about the purchases the user has made, from a variety of sources. All of this section is future work. 4.1
Dealing with Location
GPS is an increasingly popular service for portable devices. It is available as an add-on for most PDAs, and it will soon be integrated with cell-phones and integrated phone/organizers. But location and place are not the same thing. At a minimum, we would like to know the names and significant metadata for commercial or public places that the user is in or near. An ideal data structure would be a “map” of a region (such as a city) which accepts location information from GPS and returns the identity of any business the user is in, or the street they are on, or the address of the private residence they appear to be in. It should also return such items within a specified distance of the user. The reason for going after commercial and public places is that they are the most likely sites for collaboration. They are open to everyone, and the visitors to them will often be strangers to each other. Thus there is a good chance of gathering collaborative information that users would not otherwise have. Private residences on the other hand, have much smallers communities of visitors,
and those visitors normally form a tight social network. Information-sharing is probably already happening. The number of commercial and public places is also more manageable than the number of private residences. We would like to keep the number of “items” for collaboration small, say in the 10s or 100s of thousands. This should be enough to cover the number of businesses and public places even in a mediumsized city. For larger cities, we can always segment them into neighborhoods, but by dealing with segments, we will be exposing a little bit more user information (namely that a user is in a particular neighborhood). Such map lookup services are available from several companies today (Go2, Wcities, Onstar, Infospace). Simple geometric algorithms such as planar Voronoi diagrams can be used to partition the map and rapidly answer queries about the business that contains, or is nearest to, a given location. We have designed and written code for several kinds of Voronoi diagrams [21–24]. The construction of such diagrams is efficient in theory (pseudo-linear time) and in practice, and querying is also fast (log time) in practice. Often the main challenges are in dealing with degeneracies. In the planar case, these are usually manageable. If not, we have published an efficient and general scheme for dealing with them in [25]. The next step is to deal with notions of nearness. In this context, the primary notion of nearness is a perceptual one: what distances do users describe as “near” or “very near” or “far” etc? We will rely on user studies to try to quantify this notion. The problem is really one of developing a metric: a function that maps actual distance to an abstract, perceptual notion of nearness (which might still be a numerical value) that can be used to trade off distance with other considerations: how should the system balance distance against user preference when making recommendations? It is very reasonable to learn such a function. In fact, it could be argued that the metric should be exactly the latent (hidden) function that best explains observed user behavior when making such tradeoffs. This will be our initial approach. Such a learning problem should be easy because the metric is a simple one-argument function that can be modeled with very few parameters. 4.2
The Frontier
To be able to deal with large numbers of ratable places or times, we introduce a concept called the rating “frontier”. The frontier is an integer-valued vector F . For elements in the frontier, we maintain only a count of the number of users that have rated them, not a model of user ratings. So let Fj be the count of the number of users who have rated item j. The set of items in the frontier F is typically much larger than the set of items modeled in Λ. For instance, if F contains k times as many items as Λ, then the vector F and the matrix Λ will both contain kn elements. With easy extensions to the protocol we described in section 3, the counts in F can all be maintained without disclosing user data. Then the set of items actually handled in the aggregate Λ at each iteration would be the subset of the n most frequently-rated items from among the items counted in F . It would be an even smaller subset if there are fewer than n items
whose count lies above a cardinality threshold (e.g. 2k) for accuracy purposes. In this way, Λ would only model ratings of reasonably popular items (items with at least 2k raters). As well as protecting privacy and avoiding inaccurate extrapolations, this scheme allows a much larger set of items to be handled by the system with a small impact on storage and computation. For instance, for the Eachmovie dataset with 1600 items and k = 20, maintaining a frontier with kn = 32000 items would only double the storage needed, and less than double the computation, compared to the basic protocol. Given the typically Zipf-like distributions of number of raters of items, most of the items in the frontier will have very few raters, and would not meet the cardinality threshold. Thus we could not provide accurate extrapolations for them. We can recognize this fact from the values in F , and advise the user of it. 4.3
Dealing with Time
Our high-level goal is to allow users to share information about places, events and services. Events, such as concerts, movies, gallery openings etc., have temporal as well as spatial patterns. We have not yet found any potentially-collaborative events that have temporal patterning only1 . So temporal structure is best modelled as subordinate to spatial patterning. Temporal patterns can be very complex, but the most common ones are by day of week and by time of day. Time of day in turn, is most often quantified by hour or half-hour. We know this from everyday experience, and we see it embedded in the user interfaces of popular calendar and scheduling programs. The most na¨ıve approach to temporal patterning would be to split a place into say one-hour place-time chunks. Visits to the place at different times would count as separate items. This is mathematically OK, but practically impossible. It multiplies the number of places under consideration by the number of hours in a week. It also ignores the fact that some activities recur at the same time every day, or at the same time on several days of the week. We believe a sensible approach is to focus attention on the most popular times of the day or week, and ignore the others. This extends the use of a frontier in the following ways: We assume that places are being monitored for user traffic. The total set of places forms a frontier Fp as in section 4.2. The most popular places are in an aggregate, and this popular set is perhaps 10 to 20 times smaller than the frontier Fp . For places that are in the aggregate (i.e. the most popular places), we create two new frontiers, Fd and Fh which splits the visits to the place by day of week, and by hour of day respectively. These new frontiers would be about the same size as the original frontier Fp . The most popular 1/20th of the place-time 1
One could argue that TV and radio programs have temporal but not spatial patterning, because they can be viewed or listened to anywhere. But a sensible response is the channels are virtual places, and it is not useful to know the time of viewing without knowing the channel.
combinations (over all such combinations) can then be further split by the other time measure, in order to discover activities that are only patterned by hour of week. This generates a new frontier Fdh of similar size. Finally, the most popular place-time combinations from each of Fd , Fh , and Fdh are then added to the aggregate and collaboratively mined. The size of the aggregate is not increased significantly by this process because splitting and selection of most popular were always used together. In this way we can focus the collaborative inference on the most popular recurring events, without having to process a much larger set of items. There are many other issues with time that we could pursue. One-hour boundaries are not universal. There is no reason not to include both larger and smaller durations. Care needs to be taken in how this is done so as not to inflate the number of place-time combinations. But again the meta-principle is to concentrate on only the most popular combinations. We believe that everyday life, much like the web and english language texts is “Zipf-like”. That is, most of peoples’ energy is concentrated on the most popular events and places. We will be able to check this hypothesis, and in fact to quantify the Zipf exponent if the model fits, from our testbeds. 4.4
Dealing with Purchases
While many activities do not involve a payment, many others do. Payment records can provide information which is much more specific than location or time of transaction. For example, purchase records identify specific items bought from a business which cannot be inferred from position information (e.g. which CD did the user buy?, which movie did they see?). They include the vendor name, date of the transaction, and amount. Payment records are accessible from several sources. First of all, digital wallets have been proposed and may become popular at some future time. They will be embedded in a PDA or smart-phone, and will provide an immediate record of the transaction. If a popular standard emerges during the lifetime of this project, we will attempt to write an interface between it and our CF system. Even now, it is possible to get purchase records from almost all credit card providers from their web-based customer services. At worst, this information needs to be extracted (screen-scraped) from the web pages returned from user queries. It is quite likely that in the near future, as with many other business and individual services, this information will be standardized into XML format. It will be relatively easy to import this information into a user’s activity record. In the near term, will write converters for some common credit card providers. Assuming purchase information is available, there is still the problem of using it. The number of products that could be considered by our system is daunting, many orders of magnitude larger than the location and time data. Our metaprinciple for addressing this problem is the same as before. Although users could potentially buy a huge range of products, there are practical limits to what they can actually buy, and relatively few products that are popular enough to merit CF. Simple counting arguments show that if n users by 20 products on average in
a week, there will be less than n products bought by 20 or more users. In practice the number of popular items will be much smaller. The challenge is to discover these efficiently. Once again using notions of a hierarchy of frontiers, it should be possible to adapt the focus of our system around the most popular items. We could for instance, use existing hierarchies of product types, recursively refining categories that are popular among our community. 4.5
Testbeds
We plan to develop the system on two testbeds: A PDA platform (right now the HP Jornada) equipped with GPS, and a GPS-enabled cell phone. As noted before, that these platforms cannot run the CF with privacy protocol. We will configure each device to communicate through a secure channel to a normal networked PC which is the user’s “home” machine (in our lab). That PC will collect user data, and run the CF protocol. We also recently received a grant from Qualcomm Inc. to develop applications for their BREW (Binary Runtime Environment for Wireless) www.qualcomm.com/brew/developer/. Qualcomm supports the development of the gpsOne positioning system. This system is not yet available in BREW current phones, but should be available soon. So we hope reasonably soon to have at least two devices equipped with GPS and programmable enough to implement location capture and communication with a client PC.
References 1. Canny, J.: Collaborative filtering with privacy. In: IEEE Security and Privacy Conference, Berkeley, CA (2002) 2. Canny, J.: Collaborative filtering with privacy via factor analysis. In: ACM SIGIR 2002, Tampere, Finland (2002) 3. Frey, B.: Turbo factor analysis. Adv. Neural Information Processing (1999) (submitted). 4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39 (1977) 1–38 5. Ghahramani, Z., Jordan, M.I.: Learning from incomplete data. Technical Report AIM-1509, MIT AI Lab (1994) 6. Claypool, M., Gokhale, A., Miranda, T., Murnikov, P., Netes, D., Sartin, M.: Combining content-based and collaborative filters in an online newspaper. In: Proc. ACM SIGIR Workshop on Recommender Systems. (1999) 7. Basu, C., Hirsh, H., Cohen, W.W.: Recommendation as classification: Using social and content-based information in recommendation. In: AAAI/IAAI. (1998) 714– 720 8. Good, N., Schafer, J.B., Konstan, J.A., Borchers, A., Sarwar, B.M., Herlocker, J.L., Riedl, J.: Combining collaborative filtering with personal agents for better recommendations. In: AAAI/IAAI. (1999) 439–446 9. Popescul, A., Ungar, L., Pennock, D., Lawrence, S.: Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In: 17th Conference on Uncertainty in Artificial Intelligence, Seattle, Washington (2001)
10. Sarwar, B.M., Karypis, G., Konstan, J.A., Riedl, J.: Application of dimensionality reduction in recommender system – a case study. In: ACM WebKDD 2000 Web Mining for E-Commerce Workshop. (2000) Full length paper. 11. Breese, Heckermen, Kadie: Empirical analysis of predictive algorithms for collaborative filtering. Technical report, Microsoft Research (1998) 12. Herlocker, J., Konstan, J., Borchers, A., Riedl, J.: An algorithmic framework for performing collaborative filtering. In: Proc. ACM SIGIR. (1999) 13. Pennock, D., Horvitz, E.: Collaborative filtering by personality diagnosis: A hybrid memory- and model-based approach. In: IJCAI Workshop on Machine Learning for Information Filtering, International Joint Conference on Artificial Intelligence (IJCAI-99), Stockholm, Sweden (1999) 14. Cramer, R., Gennaro, R., Schoenmakers, B.: A secure and optimally efficient multi-authority election scheme. European Transactions on Telecommunications 8 (1997) 481–490 15. Cramer, R., Damg˚ ard, I.: Zero-knowledge for finite field arithmetic. or: Can zeroknowledge be for free? In: Proc. CRYPTO ’98. Volume 1462., Springer Verlag LNCS (1998) 424–441 16. Pedersen, T.: A threshold cryptosystem without a trusted party. In: Eurocrypt ’91. Volume 547., Springer-Verlag LNCS (1991) 522–526 17. Gould, J.D.: How to design usable systems. In Baecker, R.M., Grudin, J., Buxton, W.A., Greenberg, S., eds.: Readings in Human-Computer Interaction, Toward the Year 2000. Morgan Kaufman Publishers, San Francisco, CA (1995) 93–122 18. Nielsen, J.: Usability Engineering. Academic Press (1993) 19. Nielsen, J., (Eds), R.L.M.: Usability Inspection Methods. John Wiley and Sons, New York (1994) 20. Rettig, M.: Prototyping for tiny fingers. Comm. ACM 37 (1994) 21–27 21. Canny, J.: A voronoi method for the piano movers’ problem. In: IEEE Conference on Robotics and Automation. (1985) 22. Canny, J., Donald, B.: Simplified voronoi diagrams. In: ACM Symposium on Computationl Geometry. (1987) 23. Canny, J., Lin, M.: An opportunistic global path planner. In: IEEE Conference on Robotics and Automation. (1990) 1554–1561 24. Mirtich, B., Canny, J.: Using skeletons for nonholonomic path planning among obstacles. In: IEEE Conference on Robotics and Automation. (1992) 2533–2540 25. Emiris, I., Canny, J., Seidel, R.: Efficient perturbations for handling geometric degeneracies. Algorithmica 19 (1997) 219–242