Determining User Privacy Preferences by Asking the Right Questions: An Automated Approach Keith Irwin North Carolina State University
[email protected]
ABSTRACT As the Internet becomes increasingly used for sensitive transactions, the need to protect user privacy becomes more and more important. One fundamental aspect of user privacy is to respect the privacy preferences that users have. A clear prerequisite to doing this is accurately gauging what user’s privacy preferences are. Current approaches either offer limited privacy options or have so many choices that users are likely to be overwhelmed. We present a framework for modeling user privacy preferences in terms of a hierarchy of questions which can be asked. We describe two means of dynamically choosing which questions should be asked to efficiently determine what a user’s privacy preferences are.
1.
INTRODUCTION
As distributed electronic systems, such as the Internet, are increasingly used for commerce, the importance of user privacy grows steadily. Initially, user privacy was primarily a question of security and anonymity. If users could know that their private actions were secret and that they could control which public actions could be connected to them, then they could be assured of total privacy. But as commerce becomes involved, it becomes increasingly necessary for users to disclose personal information about themselves in order to establish trust and to conduct more serioues transactions such as medical or financial transactions. It is well established that users care about their privacy and specifically about how their private data is handled. Before they give their private information over, users wish to know what it will be used for, if it will be shared, and how long it will be stored. It is important to users to know that their data be dealt with in a manner they consider acceptable. One way to assure users that this will happen is to inform them of the way that a site handles private data. Then the user can decide whether or not to release their data. To help inform users on these sort of issues, many web sites publish privacy policies which describe how user’s pri-
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2002 ACM X-XXXXX-XX-X/XX/XX ...$5.00.
Ting Yu North Carolina State University
[email protected]
vate information will be handled. These privacy policies are written in natural language and are, unfortunately, often quite long and difficult to understand. They often contain legal, business, or technical language which users have difficulty understanding. To help ameliorate these problems, the World Wide Web Consortium (W3C) has proposed a machine-readable format for privacy policies called the Platform for Privacy Preferences (P3P)[2]. IBM has also released their machinereadable standard for privacy policy exchange called EPAL (Enterprise Privacy Authorization Language) [?] to be used for both expressing and enforcing privacy policies. However, these standards are only a partial solution to the problem. Having machine-readable privacy policies still requires that there be software agents working on behalf of the users which can read and interpret these policies and match them against the user’s preferences. A second potential approach is for sites to query the user about their privacy preferences, so that they can store the privacy preferences along with the data and treat the data in accordance with them. In this case, the impetus would be on the site to reject the data if it cannot meet the user’s preferences. Both of these approaches still leave open an important question: how do we determine what a user’s privacy preferences are? We know that different users have wildly divergent privacy preferences. Some users are almost paranoid, wanting no information at all released except under the most stringent promises. Other users care very little about what will be done with their information. Most users fall somewhere on the continuum in between those two points. So far, there have been two basic approaches to determining user privacy preferences. The first is to ask the user to provide an XML description of their privacy preferences. This can be very complex to use, but also very powerful, meaning that it can express very detailed and flexible user preferences. The second approach is to give the user fairly limited options to choose from, often just a few check boxes. This is much easier to use, but not very powerful. We give examples of these two approaches in section 2. We propose a new approach for helping users specify their privacy preferences. Our approach is based on two observations. First, privacy preference elements can often be organized into hierarchies [?]. For example, in P3P, specific data items belong to different categories which are further organized into subcategories, forming a hierarchy. EPAL also provides mechanisms to structure the purposes of data usage into hierarchies. The second observation is that users’
privacy preferences tend to have some degree of locality. For example, a user will often have similar privacy preferences for two types of data which belongs to the same data category. Therefore, it is possible to derive users’ privacy preferences through a series of simple, dynamically generated questions. The derived privacy preferences can then be encoded automatically into machine-readable languages for agents to use. This approach is simple and easy for a user to understand while also being flexible enough to accommodate the preferences of a wide variety of users. In the next section, we introduce a variety of systems which in some way address this same problem and other related works. Then, in section 3 we describe the specifics of our approach. Finally, in section 4 we discuss our plans for the future implementation of our work.
2.
RELATED WORKS
One means of specifying privacy preferences is through the W3C’s APPEL (A P3P Preference Exchange Language) standard[?], an XML language for expressing privacy preferences which can be evaluated against P3P policies. The European Union’s Joint Research Center (JRC) has created an APPEL evaluator module [?] which applies rule sets written in APPEL to P3P policies to see whether or not those user preferences are compatible with a site’s P3P policy for a given resource. This simplifies the task of matching privacy preferences against P3P policies. All that is needed is for the user to write an APPEL file. Unfortunately, APPEL is not a simple language to write [?][?]. APPEL is not so complex that people cannot write it, but it is sufficiently complex that most people are not going to write it. Even computer programmers would need to spend some time dedicated to learning it to be able to write non-trivial rules in APPEL. There is another tool from the JRC, the Ruleset editor [?] which provides a GUI interface for creating APPEL rules. This tool eases things somewhat. For instance, it removes the user’s need to worry about syntax. However, it does not hide most of the underlying complexity of the APPEL standard. A further problem with APPEL is that it is very difficult to read. The intention of the W3C is that people craft APPEL rules and then share them with each other as needed. But if people cannot read and understand the rules themselves, they should be hesitant to accept them. There is a field in the specification to explain the rule’s behavior in human-readable language. The need for such a field makes it clear that APPEL cannot easily be read by people. XPref[?] is another preference specification language which can be automatically matched against P3P statements. It was introduced to fix certain shortcomings of APPEL concerning the ability to accurately write a ruleset which implements a particular preference. Although it does aim for some simplification of certain logical aspects of reasoning about rules, it also introduces more complicated syntax. It is possible that XPref is easier to write than APPEL, but it’s still a task for a computer programmer. And, like APPEL, it is also quite difficult to read and understand XPref statements once they have been written. AT&T’s Privacy Bird tool[?][?], which is a plug-in for Internet Explorer that monitors P3P policies for the user, takes the other primary approach. It has a simple, easy to use, easy to understand interface. Its options, however, are
limited. In the “Health or Medical Information” section, for example, there are only two check boxes, hence four possible settings. The “Financial or Purchase Information” and “Non-Personally Identifiable Information” sections also each have two check boxes. Only the “Personally Identifiable Information” section is significantly flushed out, with six check boxes. Internet Explorer also uses P3P to make judgments concerning which cookies it should accept. The user interface for it is a slider with six settings. The only fine-grained control available is that sites can be added to a list of exceptions. Neither Privacy Bird nor Internet Explorer generate any output which could allow an autonomous user agent access to the user’s privacy preferences in other situations than simple web browsing. Internet Explorer does accept some form of privacy preferences via an “Import” button, but what type of privacy preferences file is accepted or how it would be created is not clear.
3.
ASKING THE RIGHT QUESTIONS
Our approach is to ask a series of dynamically-generated questions which the user can answer to inform agents about their privacy preferences. This allows us to cover a large number of options without overwhelming the user. In asking our questions, we should have a number of goals. The first is that for whatever we define as our set of options, the questions we ask should result in complete knowledge about the user’s preferences for those options. The second is that the questions be approachable. This means that they should be written in clear, accessible language, that they should be multiple-choice questions, and that simpler questions are preferred to more complex ones. And our final goal is that the number of questions be as few as possible. In this paper, we are going to focus primarily on how to achieve the third goal, the minimization of the number of questions. If a tool requires users to answer a long series of questions, they are unlikely to use it. We begin with the observation that there are likely to be patterns among the preferences of users. A given user is likely to not have completely randomly chosen preferences. Users who do not want to share their address for marketing purposes are also likely to not share their phone number for marketing purposes. Also, users, viewed as a whole, are not going to have an even distribution of preferences. Some choices are going to be more likely than others. As such, we can make guesses about which questions we should be asking, and then, based on the responses, we can dynamically generate additional questions to ask. A set of questions generated dynamically for a particular user in response to their answers to previous questions should be able to achieve a greater level of specificity than asking a similar number of staticly generated questions.
3.1
Model
In order to ask the right questions, the first thing we need is a model of the user’s preferences. In general, we are going to use a model in which there are a series of private data use scenarios and we are going to assume that each of these scenarios is either acceptable or not acceptable to the user. In the examples in this paper, a data use scenario will be composed of a data item and a usage for that item. Our overall approach allows for much more complex data models, but
for the sake of simplicity and brevity we choose to restrict ourselves to only data type and purpose. We choose data type and purpose as the components of a data use scenario in particular because this seems more likely to represent the user’s concerns and is the same pairing which is used in other systems such as hippocratic databases [1]. We assume that each element of a data use scenario (data type, purpose, etc.) can be represented by some hierarchical taxonomy of values and categories. For instance, data items can be arranged into categories such as “contact information”, “financial information”, and so forth. Likewise, purposes can be arranged into a hierarchy. For example, “marketing” might be subdivided into “direct marketing” and “third party marketing”. These might then be further divided into specific marketing types such as “telemarketing”, “email marketing”, and “direct mail marketing”. Note that it is possible for some data items or purposes to fall into more than one category. For example, a cell phone number may be considered both a part of personal contact information and business contact information since most people have only one cell phone which is used for both purposes. The hierarchy, therefore, is not strictly a tree, although it will generally be roughly tree-shaped. These hierarchies of data use scenario elements can be represented using a directed acyclic graph where the nodes represent values or categories and an edge from a node a to a node b represents that b is contained within category a. In general, we will refer to this as being a parent/child relationship although nodes may have multiple parents. At the top of the hierarchy must be a node which represents the special category “Any” which contains all of the values. As such, there must be precisely one node with no parents, i.e. with in-degree zero, which we shall call the root. The nodes with no children, i.e. with out-degree zero, are the nodes which represent values rather than categories. These we shall call the leaf nodes. Given a set of hierarchies of data use scenario elements, we can create a hierarchy of data use scenarios out of tuples of nodes from the hierarchies of data use scenario elements. We will create one node in the new hierarchy for every possible tuple of nodes from the old hierarchy. There will be a parent-child relationship between two node (each a tuple) in the new hierarchy if they differ only in the value of a single data use scenario element and there is a parent-child relationship between those values in the hierarchy for that data use scenario element. See figure 3.1 for a simple example of a combined hierarchy. Any Data
Name
Address
Any Purpose
Research
Sales
Figure 1: Example of Combining Hierarchies For example, if we can combine the data hierarchy and the purpose hierarchy we get a hierarchy in which the leaf nodes represent the data use scenarios and every node represents a question which we could ask. An example leaf node might be
“Is it okay if a site uses your phone number for telemarketing purposes?” An intermediary node might be “Is it okay if a site uses your contact information for support purposes?” The root node, for example, represents the question “Is it okay if a site uses any of your data for any purpose?” We assume absolute answers for the leaf nodes: every data use scenario will either be acceptable or unacceptable to the user. Although there might be some advantage to allowing for more flexible options like “prompt me before allowing this”, it complicates the model significantly. So we present only two possible answers for leaf nodes, “Yes” and “No”. Intermediate nodes are more complicated. If all of the leaf nodes descended from an intermediate node have the same answer, then this answer should also be the answer to the intermediate node. This is because an intermediary node represents all data use scenarios which fall into the described categories. If all of them are acceptable or all are unacceptable, then the categories, and hence the intermediary node, should be likewise. However, an intermediary node’s descendent leaf nodes will often be mixed, so we need an additional “Maybe” option. What we will call the “Yes”, “No”, and “Maybe” options for intermediary nodes more correspond to english-language answers of “This is always acceptable,” “This is never acceptable,” and “This is acceptable for some data and purposes.” A complete set of answers to leaf nodes forms a complete set of user preferences. The values of non-leaf nodes can be determined from the values of leaf nodes. So, fundamentally, our goal is to find out the values for all of the leaf nodes. To accomplish this, we need to choose nodes and ask the corresponding questions.
3.2
Choosing Questions
In choosing questions we begin with two assumptions. The first is that the user will be able to answer questions consistently. If two nodes share descendents, the user will not answer “Yes” for the first and “No” for the second. Although we make this assumption, in general, we do discuss how to handle inconsistency if it does occur in section 3.3 The second assumption is that the user will be able to answer any question we give them accurately. They will not answer “Yes” if the actual answer is “Maybe” or “No”. The user can, however, answer “Maybe” if the actual answer is “Yes” or “No” without our approach being disrupted. One simple choice of questions would be to ask the questions corresponding to all the leaf nodes. Although this would offer complete flexibility to the user, it would also involve a very large number of options for the user to deal with. Another option is to choose a set fixed set of nodes which provide fairly good coverage of likely user preferences and just ask those as “Yes” or “No” questions. This is effectively what the previous approaches which offer a limited set of options do. It makes things fairly easy for the user, but does not offer very much flexibility. Instead, we want to try to achieve flexibility without overwhelming the user, so our approach dynamically generates a series of questions for each user. Our general goal is to minimize the number of questions asked. Since no information about leaf nodes is gained if a user answers “Maybe” to a question, we want to attempt to minimize the number of “Maybe” answers.
3.2.1
Optimal Strategy
Now, if we knew ahead of time what all of a user’s answers would be, we could figure out the optimal set of questions to ask. First we would not ask any question which will be answered “Maybe”. Then from the remaining questions with “Yes” or “No” answers, we could choose the smallest set of nodes which cover all of the leaf nodes. The nodes which comprise this minimal cover could then be asked in any order. For instance, if we had a user for the example hierarchy from figure 3.1 who we knew would answer the leaf node (in left to right order) “Yes”, “No”, “Yes”, and “Yes”, then we could exclude the root node, the hAny,Salesi, and the hName,Anyi nodes because they would have all be answered “Maybe”. Then we could see that an optimal cover set is hAny,Researchi, hAddress,Anyi, and hName,Salesi. We refer to this as the optimal strategy. Similar to the optimal strategy for cache replacement, it is a theoretical strategy which cannot be implemented, but is instead introduced for purposes of comparison. The problem of determining the minimal set of covering questions is NP-Complete. We prove this in Appendix A via a reduction from the minimum set covering problem to it. That it is NP-Complete is somewhat unfortunately from a standpoint of efficiency. However, really the timing is not of great concern, since it is only a theoretical limit which will not be computed outside of experimental settings.
3.2.2 General Greedy Strategy Let us say, instead, that we do not know all of the user’s answers, but we know the distribution of the answers for all potential users. Let us assume that we know the exact joint distribution of all answers to leaf nodes. That is, for any possible combination of answers to leaf nodes, we know exactly how likely that combination is. Then we can use this to calculate the odds for any given intermediary node of it being a “Yes”, a “No”, or a “Maybe” by calculating the odds for each possible combination of its values for its leaves. If we sum the “Yes” and “No” probabilities, this gives us the odds that we will gain information which determines the value of leaf nodes from asking this question. By then multiplying this by the number of leaf nodes which are descendants of the given node and whose answers we do not yet know, we can calculate the expected number of leaf nodes whose value will be determined if we ask the question associated with the node. We call this the expected leaf node determination. We could then form a simple greedy strategy by calculating the expected leaf node determination of each node and selecting the node with the highest expected leaf node determination to ask. Then once we ask it we could adjust the known answers. If the node is answered “Yes” or “No” then we mark all of its descendants as being the same. If it is answered “Maybe” then we do not adjust the answers of any of its descendants. We could then readjust the expected leaf node determination of each node by utilizing our new knowledge and the joint distribution. This newly adjusted expected leaf node determination could be used to make a new selection. By repeating this process we could determine all leaf node answers. This is the basic form that each of our two practical strategies will take. We describe the steps in figure 3.2.2. However, knowing the exact distribution is impossible, and instead we must approximate it. Especially thorny is
how we approximate the joint distribution of leaf node answers. We want not just to know the initial distribution, but also the distribution of the remaining undetermined nodes given the answers to the nodes which we have determined. Figure 2: General Greedy Algorithm 1. For all nodes with unknown answers calculate the expected leaf node determination utilizing approximation of distribution. 2. Select the node with the highest leaf node determination. In the event of a tie, select the node closest to a leaf because nodes closer to a leaf are likely to be simpler questions. If still tied, select arbitrarily. 3. Ask the question associated with that node. 4. Update answers in hierarchy based on received answer. 5. If unanswered nodes remain, go to step 1.
3.2.3
Database Strategy
Since we do not have all of the user data ahead of time, we instead have to find a way to make reasonable guesses. Here we utilize our earlier observation that there is likely to be a pattern amongst user answers and use this to help us form our guesses. A good source of information about the patterns would be to have a survey of user preferences as a starting point. The results of this would form a database of user preferences. Here is an example database which we could have collected from users for our example hierarchy.
Alex Barb Carlos Darla Emil Frita Guillermo Henrietta Inigo Joy
h Name,Research i Yes Yes No No Yes No Yes Yes Yes Yes
h Name,Sales i No Yes No Yes Yes No Yes No Yes Yes
h Address,Resear Yes No No No Yes No No Yes Yes No
Given such a database we can compute the initial distribution of the user answers, which should approximate the initial distribution of future users. Then we have two different approaches to generate the joint distributions. The first approach is to use the database directly. After each question, we can use a database query to find those users whose preferences match the preferences of our user so far. In the event that there are no users whose answers precisely match the answers that the answers we have received so far, we select a set of users whose answers are closest to the answers we have received, forming an approximate match. Users whose answers match the known answers of this user should be more likely to match the user’s future answers as well. As such, we use this to approximate the joint distribution. Then this smaller sample set can be used to rebuild
the distribution and choose the node with highest expected leaf node determination and the general greedy strategy can be used. We call this the database strategy. For example, if we are using the example hierarchy from 3.1 the following table represents the initial “Yes”, “No”, and “Maybe” odds and the expected leaf node determination for every node.
to collect the distribution of answers for each leaf node from the database. Then we wish to use the information we have about the probabilities for the children nodes to determine the probabilities of their parents. The easy way to do this would be to use an assumption of independence, but that would destroy all correlation between different questions, and hence underestimate the true odds of getting a determining answer for Node Yes No Maybe Expected a parent. hAny,Anyi 0.1 0.2 0.7 1.2 It would be too space-consuming to attempt to capture all hAny,Researchi 0.4 0.3 0.3 1.4 possible correlations between leaf nodes, but we can measure hAny,Salesi 0.3 0.3 0.4 1.2 the degree of correlation between leaf nodes and their anhName,Anyi 0.5 0.2 0.3 1.4 cestors. To do this, we first collect the distribution of the hAddress,Anyi 0.2 0.4 0.4 1.2 ancestor node’s answers from the database. Then we calhName,Researchi 0.7 0.3 0 1.0 culate two approximate indexes of correlation, Cy and Cn hName,Salesi 0.7 0.3 0 1.0 for each intermediary node in the hierarchy. Cy is the poshAddress,Researchi 0.4 0.6 0 1.0 itive index of correlation which increases above one as the hAddress,Salesi 0.4 0.6 0 1.0 odds of all the leaf nodes descended from a given intermediary nodes being answered “Yes” exceeds the odds of this Based on this information, we would choose to ask hAny,Researchi happening if they were all independent. Cn is the negative or hName,Anyi because they tie for the highest expected leaf index of correlation which similarly increases with the odd node determination and are both the same distance from of the leaf nodes being all answered “No” increases. If a the leaves.. If we choose hName,Anyi, and our user answers given intermediary node p has k leaf nodes descended from “Yes” then we recompute excluding the answers from the it named l1 , l2 , . . . lk , then users who would have answered “No” or “Maybe” (Alex, s Carlos, Darla, Frita, and Henrietta). This leaves us instead P r(p = Y es) Cy = k Qk with the following. i=1 P r(li = Y es) Node hAny,Anyi hAny,Researchi hAny,Salesi hName,Anyi hAddress,Anyi hName,Researchi hName,Salesi hAddress,Researchi hAddress,Salesi
Yes 0.2 0.4 0.4 1.0 0.2 1.0 1.0 0.4 0.4
No 0 0 0 0 0.4 0 0 0.6 0.6
Maybe 0.8 0.6 0.6 0 0.4 0 0 0 0
Expected 0.4 0.8 0.8 0 0.8 0 0 1.0 1.0
We continue this process until we have determined the value of all leaf nodes.
3.2.4
Correlation Approximation Strategy
The database strategy has a potential downside, though. The size of the sample set is likely to decrease fairly quickly in successive rounds of questioning, and as it decreases, the accuracy of the approximation is also likely to decrease. It is entirely conceivable that even with a significant initial database size, there would be no sample users who had given the same answers as our current user. As a result, a very large database would be necessary. Such a large database might be problematic to acquire since it may be that certain types of users are not willing to allow us access to their privacy preferences. Also, it would be preferred if we could have a full stand-alone version of whatever tool we create to ask the questions so that it could be downloaded and run locally by potential users. Obviously, having a large database which must go with the client code would be detrimental. Because of these shortcomings we also have a second means of approximation. This one also begins by using the sample database, however, we merely want to extract a more limited amount of helpful information from it and then use this to make our guesses. That way we can save space, but hopefully still achieve results of similar quality. The first step is
s
P r(p = N o) Qk i=1 P r(li = N o) Q This implies that P r(p = Y es) = ki=1 Cy · P r(li = Y es) Qk and P r(p = N o) = i=1 Cn · P r(li = N o). Then intuition behind these values is that compared to the case where all values are independent there is some extra probability of getting a given answer caused by the correlation. We want to measure this extra probability and allow it to continue to influence the parent node’s odds. But as questions are determined, the amount of this extra probability must decrease so that we do not end up with a probability greater than one. The way we deal with this is to divide this extra probability amongst all of the nodes, so that each of them carries some of it. That way when they are determined, and effectively leave the equation, they take their share of the extra probability with them. There are many possible ways to divy up this extra probability. We have chosen to divide it evenly among all leaves, because this is simple but seems likely to produce good results. Reusing our previous example, here is the table with Cy and Cn added to all non-leaf nodes. Cn =
Node hAny,Anyi hAny,Researchi hAny,Salesi hName,Anyi hAddress,Anyi hName,Researchi hName,Salesi hAddress,Researchi hAddress,Salesi
k
Yes 0.1 0.4 0.3 0.5 0.2 0.7 0.7 0.4 0.4
No 0.2 0.3 0.3 0.2 0.4 0.3 0.3 0.6 0.6
Maybe 0.7 0.3 0.4 0.3 0.4 0 0 0 0
Expected 1.2 1.4 1.2 1.4 1.2 1.0 1.0 1.0 1.0
Cy 1.06 0.9 1.03 1.01 1.12
It is possible to gain definite answers to one or more of
Cn 2.48 1.29 1.29 1.49 1.11
an intermediary node’s leaf nodes without gaining a definite answer to the intermediary node by asking a question about a node beneath the intermediary node or as a result of asking about other ancestor nodes which are not also ancestors of this intermediary node. Once this happens, we must modify the probability values for the intermediary node. Specifically, if a node now has leaves with a mix of “Yes” and “No” answers, the new probability of the intermediary node being “Maybe” must be 1 and the “Yes” and “No” probabilities must go to 0. If, however, all of the determined leaves are either “Yes” or “No”, then the opposite odds go to 0, and we need to adjust our approximation of P r(i = Y es) or P r(i = N o). To do this, we apply a modified form of the equation from before. Let us define U as the subset of {l1 , l2 , . . . , lk } which have not yet had their answers determined. Q Then we define that our new value for P r(p = Y es) = li ∈U Cy · P r(li = Y es) or Q P r(p = N o) = li ∈U Cn · P r(li = N o). In our example, we would again ask hName,Anyi and get an answer of “Yes”. This would result in recomputing our “Yes”, “No”, and “Maybe” guesses. Cy and Cn do not change. Node hAny,Anyi hAny,Researchi hAny,Salesi hName,Anyi hAddress,Anyi hName,Researchi hName,Salesi hAddress,Researchi hAddress,Salesi
Yes 0.17 0.36 0.41 1.0 0.2 1.0 1.0 0.4 0.4
No 0 0 0 0 0.4 0 0 0.6 0.6
Maybe 0.83 0.64 0.59 0.3 0.4 0 0 0 0
Expected 0.34 0.36 0.41 1.4 1.2 0 0 1.0 1.0
Cy 1.06 0.9 1.03 1.01 1.12
These are only approximations, but this allows us to avoid the assumption of independence and hopefully calculate reasonable results. This will help us most if the correlations between different leaf nodes tend to be larger when these leaf nodes share fairly recent ancestors. If the most significant correlations in the user preferences are between leaf nodes which are distantly related, then this approach will not be helpful. For instance, we expect that in our example the values of hName,Salesi and hAddress,Researchi will not be strongly correlated with each other. We call this approach the correlation approximation strategy.
3.3
Discussion
In practice, we expect that the database strategy will generally outperform the correlation approximation strategy on reasonably sized sample sets. However, it is worthwhile to have the correlation approximation strategy for reasons of space efficiency and we could also form a hybrid strategy by using the database for most users, but the correlation approximation strategy for users whose answers did not match anyone else’s in the database. An issue which we have not discussed is what to do in the event that we do get contradictory answers. This can happen in any of our strategies if nodes with overlapping sets of leaves are chosen. Although it is the case that nodes which already have some descendents determined are less likely to be chosen, that can be outweighed by strong odds of having a definite answer. As a result, there certainly may be nodes with overlapping children which will be asked, and
this may lead to contradictory answers. In that situation, we plan to first determine the conflicting sub-hierarchy, i.e. the set of nodes which are children of both questions. Next we need to deal with it. A simple way would be to reset this sub-hierarchy to an untouched state, removing all answers from all of the nodes and restoring their “Yes” and “No” probabilities to the original values. But to simply reset them could result in some other high-level question very much like the conflicting ones determining the answer. A better approach is probably to tell the user that a conflict has occurred and then ask a series of questions from within that sub-hierarchy to determine the answer. This will result in more questions, but it is likely that the user would not object so long as they understand the reason. Further, the presence of a conflict is not just an indication of user error, but potentially an indicator that the user himself is conflicted. As such, it is wiser to devel into the conflict than to risk continuing to ask higher-level questions.
4.
CONCLUSIONS AND FUTURE WORK
We have presented a significant problem for protecting user privacy: how to adequately determine user’s privacy preferences. And we described a general model for reasonCing n about user privacy preferences in terms of hierarchies. 2.48 We have presented the theoretical optimal strategy for se1.29 lecting questions to ask from that hierarchy. And we have 1.29 presented two different practical approaches to generate dy0namic questions to probe user privacy preferences. These 1.11 methods represent a new approach to probing user privacy preferences which offer simplicity without sacrificing flexibility. The work which is currently underway is to build simulations of some simple user profiles and use these to validate the strategies by comparing them to the optimal strategy for those profiles. The work still to be done is to measure how well these approaches work with real users and how well they compare to the optimal strategy for those users. To do this, there is a lot of work remaining. We must complete the implementation of the tool, including dealing with the issues necessary to ensure that the questions are easy to read and understand. We must collect the data from actual users to serve as the database which determines the approximate distribution. Then we must test the tool on real users and formulate a means of determining how accurately the tool has determined their preferences. For this last step we may ask users to read sample website privacy policies and see if their perceptions of their privacy preferences match up with whether or not their recorded privacy preferences. That is, does the output of our tool allow them to access the web sites that they would expect it to from reading the sites’ privacy policies or not. We may also employ experimental economics to measure the accuracy of how well we capture user’s privacy preferences relative to the value that they place on different data items.
5.
REFERENCES
[1] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. Hippocratic Databases. In 28th International Conference on Very Large Data Bases, Hong Kong, August 2002. [2] W3C, http://www.w3.org/TR/WD-P3P/Overview.html. Platform for Privacy Preferences (P3P) Specification.
APPENDIX A.
NP COMPLETENESS OF MINIMUM NODE COVER
The problem of determining the minimal set of covering questions is NP-Complete. Demonstrating this is fairly easy as it is quite similar to the minimal set cover problem, a known NP-Complete problem. We demonstate that it is in NP by showing that the decision version of the problem is in NP. The decision version is “Does there exist a set of nodes of size N such that this set covers all leaves?” That this version is in NP is clear since there is a short certificate for it when the answer is yes, specifically, a set of nodes of size N which covers all the leaves. Since the decision problem is in NP, we know that the optimization problem is as well since we can solve it by asking the decision problem no more than n times where n is the number of nodes in the graph. To show that it is NP-Hard, we show that the minimum set cover problem (MSC), which is NP-Hard, can be reduced to it. Given a MSC problem with a set of elements S = {a1 , . . . , an } and a set of subsets of S, {S1 , . . . , Sk }, we can construct a hierarchy in which there are n + 1 leaf nodes, {a0 , . . . , an }, a root node, R, and k intermediary nodes {S1 , . . . , Sk }. All of the intermediary nodes are children of the root node. Every node ai for i ∈ {1, . . . , n} is a child of an intermediary node Sj if and only if ai is in Sj . a0 is a child of the root node and only of the root node. The answer for a0 is “No”. The answer for all other nodes a1 , . . . , an is “Yes”. By introducing a0 , we have ensured that the root node will never have a definite answer, and hence is not a candidate to be part of the minimum question set. Clearly then, the minimum question set will be equal to the minimum set cover since every leaf will be covered if and only if it is in the set associated with an intermediary node. Hence we can solve MSC by finding the minimum question set.