Proceedings of the 2006 IEEE Workshop on Information Assurance United States Military Academy, West Point, NY
Aspects of Personal Information Theory Sabah S. Al-Fedaghi governing the collection and handling of personal information. Abstract— This paper demonstrates that there exists a ground for building personal information theory through the exploration of several notions such as personal information privacy, security, sharing, and mining. It introduces a methodology for developing a conceptualization of these notions in the personal information context. To illustrate unique techniques that can be applied only to personal information, we develop a general model for sharing personal information. A protection strategy, based on separating non-personal information from its proprietors, is introduced and applied to personal information. Index Terms— Privacy, informational privacy, personal information
T
I.
INTRODUCTION
II. PERSONAL INFORMATION THEORY There is a long history related to the practice of collecting, storing, and analyzing information about individuals, their associates, and their activities. A personal information (PI) flow model (Fig. 1) provides a systematic method of understanding related notions and explains a broad variety of cases by illustrating the relationships between different actors on personal information. According to Kang [9], “privacy involves the control of the flow of personal information in all stages of processing— acquisition, disclosure, and use. In general, personal information has “a tendency to propagate far from the initial context of its disclosure and to persist for long periods of time” [12].
he notion of privacy is becoming an important feature in all aspects of modern society. First, privacy is related to concern about the disclosure of confidential information. In many cases, the release of personal behavior and information causes embarrassment, even when there is no blame attached to the action. Protection of privacy is also necessary against the inappropriate utilization and unlawful uses of personal information. Nevertheless, the appetite for personal information is increasing in all aspects of life. According to the 2005 report of the Privacy Commissioner of Canada, “New technologies designed for, or capable of, surveillance of individuals are widespread and are used not only by law enforcement and national security agencies. Businesses, and individuals … are gathering personal data …” [11]. Second, government is collecting more personal information with the assistance of improved technology, “As law enforcement and national security agencies collect more information, from more sources, about more individuals, the probability increases that authorities will make decisions based on information of questionable accuracy or take information out of context” [11]. Third, privacy is related to transborder data flows. This implies that privacy is not only a “local problem,” but it also concerns international parties, which hold and process personal information [11]. In this paper, we will concentrate our discussion on informational privacy, which we refer to as personal information privacy as described later. Typically, informational privacy limits privacy to matters involving information. It is said to involve the establishment of rules
S. S. Al-Fedaghi is with the Computer Engineering Department, Kuwait University, P.O. Box 5969 Safat 13060 Kuwait (phone: (965) 4987412; fax: (965) 483946; e-mail:
[email protected]).
1-4244-0130-5/06/$20.00 ©2006 IEEE
Proprietor
Non-proprietor
1
2
Creating Personal Information
3
5
4
6
Uses
Collecting Personal Information
7 Storing
Uses
Processing Personal Information
Storing
Mining
Uses
Disclosing Personal Information
Fig. 1: Personal information flow model.
The personal information flow model divides functionality into four modules or phases that include informational privacy entities and processes, as shown in Fig. 1. New personal information is created at points 1, 2, and 6 in the figure by
155
Proceedings of the 2006 IEEE Workshop on Information Assurance United States Military Academy, West Point, NY proprietors, non-proprietors (e.g., medical diagnostic procedures performed by physicians), or deduced by someone (e.g., data mining that generates new information from existing information). The created information is used either at point 5 (e.g., used in decision making), point 4 (collected), or it is immediately disclosed (point 3). Not every act of creating personal information is considered gathering, as in the case of writing in one’s own diary. Existing personal information is gathered at points 4 and 7 in Fig. 1, either after creating it or after disclosing it. For example, whenever you gather information, either you gather it from someone who has created it or you gather it from a source that is not necessarily its creator. The processing phase of personal information involves acting (e.g., storing, data mining, marketing) on PI for whatever purpose it is collected. For example, this might include building into the system the ability to challenge the accuracy, completeness, and updatability of the stored data (e.g., as required in the EU 1995 privacy directive). The disclosure phase involves releasing PI to insiders or outsiders. For example, this function is concerned with access control/security of PI. We can imagine personal information moving in this model from one phase to another phase. “Personal information” is envisioned here as “informational objects” that have an “existence' ” within the information realm [4]. Traumas suffered by identity theft victims who have to clarify their names time after time are symptoms of this feature. Further studies may connect this type of information to the so-called “meme” as “a hypothetical unit of cultural transmission conceived not as an inert object but as a quasiorganic entity endowed with the capacity of self-replication …” [6]. The flow model makes PI visible as soon as it is physically produced from someone' s brains (e.g., proprietors) or from systems (e.g., mining software). Then, PI, in most cases moves beyond the control of its creator repeatedly propagating into the phases of the model: replicated, handexchanged, changing forms (e.g., different codes), etc. That is why the model does not include en explicit “disposal” sink that in simple create-destroy cases indicates the disappearance of PI. The erasure of (one copy of) PI is implicitly included in the uses box in Fig. 1. Our “model” reflects the personal information pattern that guides and restricts relationships among objects (e.g., proprietors, pocessors, miners) and phases. The purpose is to show relationships between processes in order to recognize, understand, and manipulate personal information. It complements other descriptions such the data protection EU directive as an explicit representation of personal information flow in realty. For example, the EU directive lumps together all “processing of personal data” to mean “collection, recording, organization, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, blocking, erasure or destruction” [7]. In spite of this large number of terms bundled together the EU directive does not distinguish the creation of PI as a distinct act. To give
1-4244-0130-5/06/$20.00 ©2006 IEEE
an instance of the effect of this omission, exemptions from some provisions of the Act are given for “the processing of personal data carried out solely for journalistic purposes or the purpose of artistic or literary expression only if they are necessary to reconcile the right to privacy with the rules governing freedom of expression” [7]. This gives a blank check to “generate” new personal information. In our case, it is possible to restrict the exemptions to a certain phase or phases in the model. Personal information theory is also based on an ethical foundation. Potential abuse of personal/private information raises many ethical, legal, and economic issues. One aspect of personal information theory is personal information ethics (PIE), which is based on the thesis that personal information itself has an intrinsic moral value. Recognition of the intrinsic ethical value of personal information does not imply prohibiting acting upon the information. Rather, it means that, while others may have a right to utilize personal information for legitimate needs and purposes, it should not be done in such a way that it devalues personal information as an object of respect. The human-centred significance aspect of personal information derives from its value to a human being as something that hides his/her secrets, feelings, embarrassing facts, etc., and something that gives him/her a sense of identity, security, and, of course, privacy [1]. The notion of security in this context means how that personal information would be protected from malicious users while it moves through the four phases in the PI flow model. For example, the typical countermeasure against attacks in the processing phase involves enforcing access permissions policies. When malicious users gain access to personal data, the database system is responsible for protecting the personal information. Do personal information features affect security methods? What is the relationship between personal information and the general notion of privacy? Personal information privacy involves acts in reference to personal information. For example, creating, collecting, processing, and disclosing as reflected in the PI flow model, are examples of these acts. The topic of “personal or personal information security” is frequently mentioned in the context of information security. Currently, commonly used terms are: health data privacy and security and personal information security and privacy [3]. There is also a large amount of legislative material about the Personal Information Security Act and Personal Information Privacy Act implying that PI has two separate aspects: security and privacy. We may grasp the difference between privacy and security in the context of personal information from the Health Insurance Portability and Accountability Act (HIPAA), which may be said to be a comprehensive venture in the direction of privacy. According to HIPAA, “Security refers to the specific measures and efforts taken to protect privacy and to ensure the integrity of personal information. Security is the ability to prevent unauthorized breaches of privacy, such as might occur
156
Proceedings of the 2006 IEEE Workshop on Information Assurance United States Military Academy, West Point, NY if data are lost or destroyed by accident, stolen by intent, or sent to the wrong person in error. The HIPAA privacy rules mandate that basic security measures be in place, and the forthcoming security regulations prescribe a comprehensive set of requirements with implementation features that must be in place to assure that individual health information remains secure” [5]. It seems from the first part of the definition that “security of personal information” is what everyone understands as the meaning of the basic term “security.” Certainly, the security of non-personal information also includes such events as the following: “data are lost or destroyed by accident, stolen by intent, or sent to the wrong person in error,” without being labeled as “unauthorized breaches of privacy.” According to Spafford [10], “Recent events have increasingly focused public attention on issues of information privacy, computer and network security, cybercrime and cyber terrorism. Yet despite all of this attention, there is some confusion about what is actually encompassed by those terms.” “Information security” focuses “not on computers and networks, but on the information...” To clarify these notions, we need a definition of personal information. Defining personal information as “information identifiable to the individual” does not mean that the information is “especially sensitive, private, or embarrassing. Rather, it describes a relationship between the information and a person, namely that the information—whether sensitive or trivial—is somehow identifiable to an individual” [9]. A personal information theory includes a universal set of personal information agents, Z = V ∪ N, of two fundamental types of entities: Individual and Nonindividual. Individual represents the set of natural persons V, and nonindividual represents the set of non-persons N in Z [2]. Definition: Personal information is any linguistic expression that has referent(s) of type individual. Assuming that p(X) is a sentence such that X is the set of its referents, then there are two types of personal information: (1) p(X) is atomic personal information if X ∩ V is the singleton set {X}. That is, atomic personal information is an assertion that has a single human referent. (2) p(X) is compound personal information if |X ∩V| is greater than 1. That is, compound personal information is an expression that has more than one human referent. The relationship between individuals and their own atomic personal information is called proprietorship. If p is a piece of atomic personal information of v ∈ V, then p is proprietary personal information of v, and v is its proprietor. Proprietorship gives “permanent” rights to the proprietor of the personal information. A single piece of atomic personal information may have many possessors; where its proprietor may or may not be among them. A possessor refers to any entity that knows, stores, or owns the information. Any compound personal statement is privacy-reducible to a set of atomic personal statements [2]. In this paper we will not examine the issue of different degrees of sensitivity of personal
1-4244-0130-5/06/$20.00 ©2006 IEEE
information. III. PERSONAL INFORMATION SECURITY We can distinguish two types of information security: (1) Personal Information Security (PIS), and (2) Non-Personal Information Security. As in the general field of information security, PIS can be defined in terms of private data and private communication integrity, secrecy (confidentiality), authentication, accessibility, and non-repudiation. Personal information security involves the protection of ‘personal information’ in and between the phases of the PI flow model from direct and indirect inferences utilizing such protection mechanisms as authentication, encryption, anonymization, randomization, etc. To accomplish the aim of this paper’s thesis to prove that personal information lends itself to unique techniques that can be applied in different areas, we will restrict PIS to the special security aspects that pertain only to personal information. Certainly, all information security tools can be applied to personal information security. However, PIS has unique characteristics as discussed in this paper.
IV. BASIC MODEL PIS is concerned with the protection of privacy in the context of information. This protection includes several aspects such as preventing malicious personal information mining (PIM) as will be described in section eight. We use the notion of ‘information sharing’ as our basic environment to be protected. Security in this sense deals with methods to accomplish the objectives of maintaining two facets of this sharing: the privacy of sharing information and/or the sharing of personal information. The basic PIS model of information sharing shown in Fig. 2 includes five actors who may participate in the sharing process: proprietors, possessors, sharers, casters, and attackers.
Proprietors (if private information)
Casters (Individuals)
Information Sharers
Attackers
PIS
Possessors (Individuals or Non-individuals)
Fig. 2: PIS permits legal participants
in sharing PI.
The casters are individuals who handle (e.g., communicate) personal information, and are, thus, involved in the sharing of information (the arrows in the PI flow model). It is assumed that the first four actors are legal participants in information
157
Proceedings of the 2006 IEEE Workshop on Information Assurance United States Military Academy, West Point, NY sharing. PIS aims at excluding attackers from this process, as shown in Fig. 2, in order to protect the private sharing of information (e.g., identities of individuals who share nonpersonal information) and the sharing of personal information (e.g., identities of proprietors when they share personal information). In Fig. 2, ‘sharing’ can be classified as: Personal sharing of information: The actors in this type of sharing are limited to individuals (i.e., persons); however, ‘information’ in this type of sharing entails two kinds of information: (a) Non-personal information: For example, if John and Robert share sensitive information (e.g., pornographic), then they want to preserve the privacy of this sharing. This sharing is called ‘private communication’ in the context of the classical communication model. (b) Personal information: For example, if John and Robert share compound personal information, then they want to preserve the privacy of this sharing as in (a), and also the privacy of their compound personal information. Non-private sharing of information: The actors in this type of sharing are limited to non-individuals (i.e., companies, government agencies, etc.), however, ‘information’ in this type of sharing entails two types of information: (a) Personal information: For example, if two hospitals share personal information of their patients, the ‘privacy’ aspect is not of the hospitals; rather, it is of the privacy of the patients. Our definition of privacy is applied only to human beings. For example, Mount Sinai Hospital in NY is very expensive is not personal information because the assertion does not include a referent of a type of person. If there is some information that a hospital does not want to reveal to others, then this is secret information, not personal information. (b) Non-personal information: For example, two companies share their technical information. The security in this situation is not directly a privacy-related issue. The ‘sharers’ are the actors that participate in some type of information sharing. For example, in the classical communication model, the sender and receiver are sharers regardless of whether they are proprietors. ‘Sharers of type individual’ are of special importance in PIS because they are the persons whose privacy is to be protected. They can be classified into two non-exclusive types: • Sharers who are proprietors, and • Casters. The casters are agents of type individual who deal with information in the sharing of information ‘game,’ reflected in the PI flow model. Proprietors may be casters (e.g., agents act on their own personal information) or they may not be casters (e.g., patients where other agents act on their personal information). Also, proprietors are sharers by virtue of the fact that their personal information is being shared. Similarly, (persons) possessors may not be casters, and considered sharers only because of the fact that others’ information in their possession is being shared.
1-4244-0130-5/06/$20.00 ©2006 IEEE
Example: Consider the case of a person who is referred to as ‘Deep Throat’ and who works with X Inc., asking the reporter, John, to tell the FBI agent, Alice, that: X Inc. owners, Robert and Jim, cheat on their taxes. Fig. 3 shows the actors in this personal information ‘feast.’ All actors are sharers in the process of information sharing. Notice that casters are responsible for “moving” the personal information, while possessing and proprietorship relationships may be static. Protecting the privacy of sharing Casters: {Deep Throat, John, Alice}
Possessor: X Inc. X Inc. owners Robert and Jim cheat on their taxes. Proprietors: {Robert, Jim} Protecting the sharing of private information
Fig. 3: The set of sharers of the example.
Casters also have, as proprietors, ‘privacy’ to be protected. Imagine a game of chess in which the pieces of the game are persons. In this case, an assertion (e.g., John (the castle) has been knocked out of the game) about any piece-person is personal information because it refers to a person. Furthermore, the persons who play the chess game (not piecepersons) are the casters. So, any assertion (e.g., Robert has moved John (the castle) to position a2) about these players is also personal information. Casters’ privacy is the privacy of sharing information. In the usual communication model, it is the privacy of communicators, regardless of the type of communicated data. However, being a caster does not imply being a (person) possessor of personal information. In the communication model, the (person) receiver is a caster, even though for some reason, the personal information has not actually arrived at its destination. The sharing, in this case, is potential possessing of information due to the fact that the receiver/caster participates in sharing the personal information, even though the sharing may not be realized. We adopt the notion of ‘sharing’ instead of, for example, the usual model of transmitting personal information from one actor to another because it is a directionless, multi-agent concept. Sharing of personal information can be accomplished through communication, observation, participation (e.g., bulletin board), etc. This implies making information available by providing access to information sources. It covers the cases of transferring information, in addition to the four notions, creating, collecting, processing, and disclosing that are described in the PI flow model. Thus, personal information methodologies, such as the one proposed in this paper, can be
158
Proceedings of the 2006 IEEE Workshop on Information Assurance United States Military Academy, West Point, NY applied to the ‘anonymized transferring’ of personal information or to the building of constraint-based mining techniques in privacy enhanced database systems. Fig. 4 shows the two aspects of the security of ‘sharing personal information’ and ‘private sharing of information’ categorized according to the type of information.
INFORMATION
Personal
SHARING Private (Case 1) PIS protects proprietors and casters
Nonpersonal
(Case 3) PIS protects casters
Non-private (Case 2) PIS protects proprietors
(Case 4) Non-PIS security
Fig. 4: Personal information security is not concerned with non-private sharing of non-private information.
Cases (1) and (2) in Fig. 4 are situations in which there is a need to protect the sharing of personal information. Cases (1) and (3) in the figure are situations in which there is a need to protect the private sharing of information. Examples: (a) Sharing of personal information: (case (2)) Reuter → John divorced his wife, Alice → Newsweek Reuter and Newsweek are non-individual sharers (noncasters). However, John and Alice are also sharers because the operation of sharing the information involves their personal information. This sharing of personal information does not involve casters. (b) Personal sharing of information: (case (3)) Bob → Hurricane Katrina caused 20 billion dollars damage → Sam In this case, the sharers are individuals (casters), so the operation of sharing non-personal information is privacyrelated. Bob and Sam are identifiable individuals; hence, from a PIS perspective, there may be interest in protecting the privacy of the act of sharing information between these casters. (c) Personal sharing of private information: (case (1)) Bob → John divorced his wife, Alice → Sam Bob and Sam are casters; while John and Alice are proprietors. In this case, we need to protect the privacy of the casters who share information and the privacy of proprietors whose personal information is shared. On the other hand, personal information security is not concerned with the following sharing: (d) Reuter → Hurricane Katrina caused 20 billion dollars damage → Newsweek This case (4) does not involve private sharing of information or sharing personal information. Consequently, personal information security, PIS, is concerned with the protection of information sharing when: Individuals ∩ Sharers ≠ ∅ OR Proprietors ≠ ∅. That is, PIS is involved when at least one of the sharers is a person, or the information is personal information (about persons). “Security” in this case
1-4244-0130-5/06/$20.00 ©2006 IEEE
means “the security of privacy”: guarding information about persons. V. ANATOMY OF PERSONAL INFORMATION What exactly is the form of personal information that is to be secured in sharing personal information? Answering this question identifies an example of the unique features of personal information that can be applied to several fields besides security, a topic which is discussed here. Since all personal information is reducible to atomic personal information, the security problem of personal information can be projected in one form: how to protect pieces of atomic personal information and their relationships. As we will discuss later, the security problem, even though it is projected in atomic forms, may not necessarily always be implemented in terms of atomic personal information. The process of protecting a piece of atomic PI involves: (1) Protection of the identity of the proprietor. (2) Protection of the non-private portion. Protecting the compound personal information, additionally, involves protecting the relationships among the pieces of its constitutive atomic personal information. For example, John loves Alice embeds the atomic assertion John loves someone, and Someone loves Alice. Thus, it is necessary to preserve the set {John loves someone; Someone loves Alice} to preserve the embedded semantics. The problem related to the method used to reconstruct the original compound assertion is a nonsecurity one. Rather, we are concerned with the integrity of the compound assertion such that the two atomic assertions do not mix with other atomic assertions in the database such as Robert loves someone. If we have the set {John loves someone, Someone loves Alice, Robert loves someone}, then we conclude that Robert loves Alice. Clearly, any compound personal information can be converted into a set of pieces of atomic personal information and relationships among these pieces of atomic personal information. Definition: The canonical form of a piece of personal information T is (TA, PT) where TA is its anonymized version and PT = {P1, P2, …, Pn} is the set of proprietors. Anonymization here means the absence of any type of proprietor in the assertion. If we can map the personal information to its proprietor by any means, then the assertion is not an anonymized assertion. The protection of the identities of the proprietors involves the protection of PT, and the protection of the non-private portion of T involves the protection of TA. Of course, all information security tools such as encryption, etc. can be applied in this context. We will investigate other methods that target especially personal information as it has been previously defined, utilizing the unique structure of personal information. VI. PROTECTION METHOD Partitioning data according to some criteria (e.g., horizontal) is a known technique in the area of privacy preserving data mining. Privacy preserving mining also utilizes anonymization
159
Proceedings of the 2006 IEEE Workshop on Information Assurance United States Military Academy, West Point, NY of private data (e.g., k-anonymization) to facilitate useful mining of such data. Using personal information features, we envision partitioning and anonymization as usual ‘states’ of ‘personal information as it is stored or exchanged.’ Partitioning is performed according to the unique structure of personal information, and anonymized data is utilized in reconstructing the original data. The notion of partitioning is also connected with the known concept of ‘unlinkability’ that involves separating people’s identities from non-personal information. Our approach introduces a refinement of unlinkability that distinguishes ‘unlinkability to identities’ and ‘unlinkability across identifications’ (proprietors of compound personal information). Notice that our approach is an information-based methodology in which private assertions are the sole factor in determining privacy-related notions such as linkability and anonymity, in contrast to non-private informational communication-based factors such as devices, services, and activities (e.g., probabilities). Let S denote the set of sharers; the PIS model of sharing personal information is defined as the tuple: (S, (TA, PT)). Examples: Suppose that John, Albert, and Josef gossip about Alice by saying that Alice is obnoxious. This case can be represented by: ({John, Albert, Josef}, (Someone is obnoxious, {Alice})) Suppose John sends an e-card to Alice that says: John and Alice are in love forever. This can be represented as: ({John, Alice}, (Someone and someone are in love forever, {John, Alice})) The protection of the compound personal information, T, can be reduced to the protection of the corresponding set of atomic assertions {T1, T2, …, Tm} and their relationships. Each piece of atomic personal information, Ti, can be represented as (TAi Pi) where TAi is the anonymized version of Ti, and Pi is its proprietor. Now, the security problem with respect to T can be expressed in terms of protecting each (TAi, Pi). Protecting any (TAi, Pi) necessitates protecting: (1) TAi, (2) Pi, and (3) the mapping between TAi and Pi. Such taxonomy of the objects-of-protection can be utilized in building the mechanism of protection. The mere separation of TAi, Pi, and the mappings is a form of protection in addition to the usual protection methods (e.g., encryption, data hiding, etc.). The attacker’s efforts will be divided among several components. This ‘separating strategy’ involves separate sharing of pieces of anonymized information, lists of identifiers, and mapping lists, such that the sharers reconstruct the ‘whole’ of the original personal information from these components. If the personal information includes several compound personal information assertions C′, C′′, … then the general form of the model is: (S, ((TC′A1, PC′1), (TC′A2, PC′2) … (TC′Am, PC′m), (TC′′A1, PC′′1), …)) In actual implementation, it is possible to have incomplete reduction of the original compound personal information to atomic assertions, as suggested in the previous example about
1-4244-0130-5/06/$20.00 ©2006 IEEE
John and Alice are in love forever, instead of producing two atomic assertions. This model of grouping proprietors of different pieces of compound personal information introduces interesting variations on the notion of linkability/unlinkability within and between sets. To simplify our model, we will assume that there are only two actors that share personal information. This is the classical communication model in which the sender and the receiver exchange personal information. VII. SECURITY ANALYSIS OF CERTAIN CASES The PIS introduced in the previous sections can be taken as a base to analyze different types of security concerns that involve the sharers. 6.1 Only Proprietors Let us assume that S = PT in the basic model (S, (TA, PT)); i.e., the sharers are all proprietors. This situation is very typical in email, phone conversations, etc. We restrict this case to two proprietors exchanging compound personal information. In the classical communication model, the sender and the receiver are proprietors who exchange ‘their’ personal information. This model is represented as: ({P1, P2}, (TA, {P1, P2})) where P1 and P2 are the proprietors. They are also the casters and the sharers. Assume that the given information T is compound personal information. Its atomic personal information version is: ({P1, P2}, (TA1, {P1})) and ({P1, P2}, (TA2, {P2})). Example: Suppose that the given information is: John told Alice he loves her. It can be represented as: ({John, Alice}, (Someone loves someone, {John, Alice})) Or, it can be represented as two atomic assertions: ({John, Alice}, (Proprietor loves someone, {John})) ({John, Alice}, (Someone loves proprietor, {Alice})) The security problem is how to protect the proprietors when they share their personal information with each other. It involves protecting: TA and {P1, P2}. Protecting {P1, P2} requires protecting P1 and P2 as proprietors and as casters (e.g., communicators). This involves: (1) Protecting the identities of the casters (2) Protecting the identities of proprietors (3) Protecting the anonymized information (4) Protecting the relationship between the anonymized information and proprietors. In the communication model, this can be accomplished by transmitting each component separately. Similar discussions can be introduced for the case in which the personal information is represented by two atomic assertions. One advantage of this method is that when an attacker captures one part of the personal information, it does not reveal the embedded personal information. In this case, there are several alternative situations:
160
Proceedings of the 2006 IEEE Workshop on Information Assurance United States Military Academy, West Point, NY The attacker does not know the identities of the casters S in (S, (TA, PT)): (1) Capturing only {P1, P2}, the attacker can conclude that there is something common that involves P1 and P2, but he/she does not know the nature of this common thing. Notice that we concentrate on the case of the identities of the casters and the content of the transmitted message. For example, we don’t deal with the typical data mining analysis such as partial information about which parties are likely to engage in which actions. (2) Capturing only TA is useless for the attacker from the privacy point of view. The attacker knows the identities of the casters: (1) Capturing only {P1, P2}, the attacker can conclude that there is something common that involves P1 and P2, but he does not know the nature of this common thing. (2) Capturing TA may uncover the personal information. The best possible PIS result is making ({P1, P2}, (TA, {P1, P2})) = ({P1′, P2′}, (TA, {P1, P2})), {P1′, P2′}∩ {P1, P2} = ∅ That is, it involves ‘fooling’ the attacker into not linking the communicating parties with the proprietor. Example: The following is part of a letter: “Sweet, incomparable …., what a strange effect you have on my heart! Are you angry? Do I see you looking sad?” Suppose that attacker knows that this letter is sent from Napoleon Bonaparte to Josephine. Hence, the attacker concludes that there is an intimate relationship between Napoleon and Josephine. However, suppose that the attacker knows that the letter is exchanged between a confidential in Napoleon’s staff, say Morris, and Josephine’s maid, Mary. In this case, it is possible that the sharing of information represents one of the following cases: ({Morris, Mary}, (TA, {Morris, Mary})), ({Morris, Mary}, (TA, {Napoleon, Josephine})), ({Napoleon, Josephine}, (TA, {Napoleon, Josephine})) and other possibilities. This is a type of anonymization that involves Morris, Mary, Napoleon, and Josephine. 6.2 Gossip-based Cases This case can be represented for two proprietors as: ({P1, P2}, (TA, {P1, P2, Q}) where Q ≠ ∅, is a set of other proprietors. It is a variation of the previous case with respect to the attacker. The interesting case is when TA is only anonymized with respect to {P1, P2}. This is the classical gossip news: personal information shared between non-proprietors. There may be a situation that requires the protection of gossip. The attacker may be a competitive newspaper gossiper who tries to steal the news from another gossiper. So, if TA is completely anonymized, (TA, {Q}) is the sensitive part in ({P1, P2}, (TA, {P1, P2, Q})) that needs further protection. Here, we propose to apply the separation of the compound information where we have two pieces of compound personal information: ({P1, P2}, (TA1, {P1, P2})) and ({P1, P2}, (TA2, {Q})).
1-4244-0130-5/06/$20.00 ©2006 IEEE
Example: Suppose that reporter Robert sends the following email to his boss, Sam: To Sam from Robert: Alice secretly married Jim. It can be divided into four pieces as follows: ({John, Sam}, (To someone from someone, {John, Sam})) ({John, Sam}, (Someone secretly married someone yesterday, {Alice, Jim})) Example: Suppose we remove our assumption that S = PT. Consider the following compound personal information: REUTER reports to PUBLIC that John told columnist Sam that Alice secretly married Jim. Which can be represented as: ({REUTER, PUBLIC}, (Someone told someone, {John, Sam})) ({John, Sam}, (Someone married someone secretly, {Alice, Jim})) REUTER is an agency while PUBLIC denotes all actors in the system. 6.3 No Casters Suppose that casters = ∅. This is the classical case of an agency that shares its collection of personal information with another agency, where for each piece of personal information: (N, (TA, P)), N is a set of non-individual sharers. Typically, information about these non-individual sharers and the anonymized assertion TA are not sensitive information. Thus, the user of such data knows the hospital and the anonymized data but does not know the sensitive portion that is related to the identities of the patients. The security problem here involves only protecting the sharing of personal information, and it does not involve the private sharing of information. Example: Suppose that hospital H1 sends a medical record of John to hospital H2 such that TA and PT are transmitted separately. This involves: (1) Creating TA, i.e., anonymizing the patient’s medical record. (2) Sending TA and PT separately. The mapping from the set of proprietors to TA is a technical problem that does not represent difficulty. Consider the compound personal information. The patient whose name is John has a doctor whose name is Robert. It can be reduced to two atomic assertions: Patient’s name is John, and Doctor’s name is Robert. The compound PI is shown in Fig. 5 using the richer language OWL.
161
patient of John
x
owl:sameAs
owl:sameAs
y
Robert doctor of
Fig. 5: Compound PI is represented in OWL graph as two triples. The x and y nodes are blank nodes while the other two nodes are resources of type proprietor.
Proceedings of the 2006 IEEE Workshop on Information Assurance United States Military Academy, West Point, NY The X and Y variables in each statement must be uniquely declared. The graph is converted into a representation of the two anonymized atomic statements: Someone is the patient, and someone is his/her doctor. Hospital H2 would receive separately, (a) the two anonymized graphs, (b) their identifiers, in addition to (3) the link between the two atomic statements. VIII. APPLICATIONS ‘Personal information (data) mining’ (PIM) is one possible operation on personal information in the processing phase in the PI flow model. It is an important feature because it may create new personal information through deduction based on possessed personal information. It is the intersection of the fields of data mining and personal information theory that is used to uncover the privacy aspects of the information. PIM includes analysis of personal information, for privacy relationships and privacy-based correlations. For example, according to our model (S (T, ∅)), i.e., non-individual sharers with non-personal information, is not in the field of concern of PIM. In PIM, a data-mining attack can be classified according to the type of target information as follows: (1) Determining the identity of proprietor(s) from non-personal information. For example, determining the identity of the patient from anonymized information that gives age, sex, zip code in health records. (2) Determining atomic PI from a set of atomic assertions. For example, if the attacker knows the type of drug taken by a patient, the attacker may deduce the type of disease. An attack such as this targets the personal information of one proprietor to deduce more personal information about the same proprietor. In our methodology, the proprietor’s identity and the non-personal information are separated in reference to storage, and in terms of communication, as described previously. Furthermore, deduced personal information is monitored in PIM. Example: Assume a database that contains two relations PAYROLL (NAME, SALARY) and EMPLOYEE (NAME, ADDRESS). It is possible that “individuals with an exceptionally high salary may not want their payroll information in the same record as their address… Such detailed records may contain enough information to identify them” [8]. However, the operations of join and projection may uncover the address of, say, the president of the company, John, whose salary is 250K. In [8], preventing such an operation is determined by “the owner of the data” who gives/does-not-give the permission to use the data in a join operation. Suppose that the PAYROLL record is (John, 250K), and the EMPLOYEE record is (John, 227 Maple Street). The 250K in (John, 250K) represents (the attribute): The person whose salary is 250K. Since it is a unique value in PAYROLL, then it is an identifier of the individual, John. Thus, the tuple (250K, 227 Maple Street) produced by join/project operations, is the atomic personal information: The person whose salary is 250K lives at the address 227 Maple Street. PIM knows it is a personal assertion because
1-4244-0130-5/06/$20.00 ©2006 IEEE
The person whose salary is 250K refers to a single referent of type person in the database. (3) Determining compound personal information from a set of personal assertions. This type of attack reflects the classical inference problem in the context of privacy as in the case of the so-called association problem. For example, compound personal information about John and Alice can be used to infer a relationship between them. In PIM, these types of mining have special characteristics. For example, in the world of atomic personal information, it is not possible to deduce (real) compound personal information (linkability of proprietors). Suppose that the set of atomic assertions are {John loves someone, Alice loves someone, Robert loves someone, …}, then it is not possible to deduce that John loves Alice or Alice loves Robert, etc. Of course, we can produce such pseudo compound personal information such as John loves someone, and Alice loves someone, but such personal information does not produce new information. IX. CONCLUSION This paper proposes a distinct field of inquiry of personal information theory that is centered on personal information (PI), and has related notions such as PI security and PI mining. The theory can also be applied to such fields as law, privacy guidelines, semantic web, and privacy enhancing technology. REFERENCES [1]
Al-Fedaghi, S. “Crossing Privacy, Information, and Ethics,” 17th International Conference Information Resources Management Association (IRMA 2006), Washington, DC, USA, May, 2006 [2] Al-Fedaghi, S., “How to Calculate the Information Privacy,” Proceedings of the Third Annual Conference on Privacy, Security and Trust, St. Andrews, New Brunswick, Canada, October 12-14, 2005. [3] Acquisti, A. and J. Grossklags, “Losses, Gains, and Hyperbolic Discounting: An Experimental Approach to Information Security Attitudes and Behavior,” 2nd Annual Workshop on Economics and Information Security, May 2003. [4] Bouissac, P. (1994). Information vs. meaning: From ecology as semiotic utopia to evolution as entropy. In C. Dreyer, H. Espe, H. Kalkofen, I. Lempp, P. Pellegrino, & R. Posner (Eds.), Zeichen im Leben der Menschen/Signs within Human Life,. Hildesheim: Olms. [5] CFR Parts 160, 162, and 164 Health Insurance Reform: Security Standards; Final Rule, February 20, 2003, Department of Health and Human Services http://www.cms.hhs.gov/hipaa/hipaa2/regulations/security/03-3877.pdf [6] Dawkins, R. (1976). The selfish gene. Oxford: Oxford University Press. [7] EU Directive 95/46/EC - The Data Protection Directive, http://www.dataprotection.ie/viewdoc.asp?m=&fn=/documents/legal/6ai i-2.htm#5 [8] Dufay, G., Felty A., and Matwin, S. “Privacy-Sensitive Information Flow with JML,” In Twentieth International Conference on Automated Deduction, Springer-Verlag LNCS, July 2005. http://www.site.uottawa.ca/~stan/papers/2005/p2.pdf [9] Kang, J. “Information Privacy In Cyberspace Transactions,” 50 Stanford Law Review 1193, 1212-20, April, 1998. [10] Spafford G., “What Is Information Security?” ACM SIGCSE Annual Conference; Norfolk, VA; Mar 2004. [11] Stoddart, J. Annual Report to Parliament 2004-2005, Office of the Privacy Commissioner of Canada, 2005. http://www.privcom.gc.ca/information/ar/200405/200405_pa_e.asp [12] Strandburg, K. J. “Privacy, Rationality, and Temptation: A Theory of Willpower Norms,” 57 RUTGERS LAW REVIEW 1237, 2005.
162