Ontology-Enhanced Interactive Anonymization in. Domain-Driven Data Mining Outsourcing. Brian, C.S. Loh and Patrick, H.H. Then. School of Engineering ...
2010 Second International Symposium on Data, Privacy, and E-Commerce
Ontology-Enhanced Interactive Anonymization in Domain-Driven Data Mining Outsourcing
Brian, C.S. Loh and Patrick, H.H. Then School of Engineering, Computing and Science, Swinburne University of Technology, Sarawak Campus, Kuching, Malaysia {bloh, pthen}@swinburne.edu.my
Abstract—This paper focuses on a domain-driven data mining outsourcing scenario whereby a data owner publishes data to an application service provider who returns mining results. To ensure data privacy against an un-trusted party, anonymization, a widely used technique capable of preserving true attribute values and supporting various data mining algorithms is required. Several issues emerge when anonymization is applied in a real world outsourcing scenario. The majority of methods have focused on the traditional data mining paradigm, therefore they do not implement domain knowledge nor optimize data for domain-driven usage. Furthermore, existing techniques are mostly non-interactive in nature, providing little control to users while assuming their natural capability of producing Domain Generalization Hierarchies (DGH). Moreover, previous utility metrics have not considered attribute correlations during generalization. To successfully obtain optimal data privacy and actionable patterns in a real world setting, these concerns need to be addressed. This paper proposes an anonymization framework for aiding users in a domain-driven data mining outsourcing scenario. The framework involves several components designed to anonymize data while preserving meaningful or actionable patterns that can be discovered after mining. In contrast with existing works for traditional data-mining, this framework integrates domain ontology knowledge during DGH creation to retain value meanings after anonymization. In addition, users can implement constraints based on their mining tasks thereby controlling how data generalization is performed. Finally, attribute correlations are calculated to ensure preservation of important features. Preliminary experiments show that an ontology-based DGH manages to preserve semantic meaning after attribute generalization. Also, using Chi-Square as a correlation measure can possibly improve attribute selection before generalization.
warehouse, data preprocessing to remove errors, and knowledge discovery through the use of various algorithms. Most applications of KDD focus on discovering data patterns to solve problems related to a specific field. In the medical field, data mining plays an important role by enhancing the quality and efficacy of healthcare. For instance, a hospital performs classification analysis on a subset of patients to determine their probability of having heart disease. Through this procedure, appropriate actions can be taken based on a patient’s condition, thus enhancing treatment as well as saving time and costs. In the real world, data mining is highly constraint-based as opposed to traditional data mining which is a data-driven trial-and-error process [3], [4], [5], [6]. In traditional data mining, knowledge discovery is usually performed without the aid of domain intelligence thus affecting model or rule actionability for real business needs. Furthermore, there exist gaps between academic objectives and business goals as well as academic outputs and business expectations. For example, academic researchers rarely consider business environment or needs, and only focus on discovering patterns satisfying expected technical significance. Because of this, discovered rules, although possessing high confidence or support, may lack actionability. Domain-driven data mining aims to bridge these gaps by involving domain experts and knowledge to obtain actionable patterns or models applicable to real world business requirements. A. Data Mining Outsourcing Data mining outsourcing involves two parties, a data owner or publisher (hospital, clinic, etc) who provides input data and an application service provider (ASP) who returns service results (patterns or models) [2], [7]. There exist several beneficial reasons for organizations to adopt outsourcing as their data mining option. They include reduced mining cost, decrease in resource demand, and effective centralized mining [8], [23], [24]. Through outsourcing, an organization utilizes minimal computational resources since mining shall be performed by the service provider. Moreover, assume that an organization owns several hospitals in multiple locations. All patient records can be sent to the service provider who would compute patterns local to individual hospitals, or global for the whole organization. As opposed to in-house data mining, service
Outsourcing, privacy, domain-driven data mining, data publishing, anonymization
I.
INTRODUCTION
Data mining, also known as knowledge discovery in databases (KDD), allows for the extraction of knowledge from various domains including medical, financial, marketing, etc. KDD seeks to discover relationships and global patterns that are present within large databases but may be hidden within vast quantities of data [20]. A typical data mining process involves data collection by a data 978-0-7695-4203-4/10 $26.00 © 2010 IEEE DOI 10.1109/ISDPE.2010.7
9
providers are usually seen as a faster, more cost-effective solution, thus being a favorable choice for data owners. Although outsourcing offers these advantages compared to in-house mining, there are certain privacy and security concerns. Three main issues encountered in an outsourcing scenario are: data owner’s willingness to share sensitive data, un-trusted service provider, and laws forbidding the sharing of individually identifiable data [1], [7], [13], [23], [24]. Consider the previous example regarding classification of heart disease, in an outsourcing scenario. Due to privacy concerns or fear of leakage, the hospital may be unwilling to share patient data though it still wishes to investigate the occurrences of heart disease. To promote sharing, data protection would be needed to reduce privacy risks while outsourcing. Additionally, the service provider would most likely be an un-trusted entity who should be denied access to certain private or sensitive information. The assumption here is that trust between data owner and service provider is unattainable, thus the objective remains to protect personal data rather than to create trust. Finally, in certain countries, laws prevent the sharing of identifiable data and removal is required through the use of data protection techniques.
techniques. Anonymization allows concealment of patient identities or sensitive data, assuming that this information is required for data analysis [9]. In general, the steps involved in an anonymization process are as follows. First, the data owner groups each attribute in their dataset into explicit, QID or sensitive attributes. Next, an anonymization algorithm and privacy requirement is chosen based on the data mining purpose or potential linkage attacks. The QID attributes would then be generalized according to the chosen algorithm, privacy requirement, and their respective Domain Generalization Hierarchy (DGH). After anonymization, a utility measure is used to determine data quality as compared to the original dataset or result accuracy for a particular mining task. The majority of anonymization methods have focused on the traditional data mining paradigm, protecting data for general purposes or specific mining tasks such as classification. Although models obtained from these datasets may possess similar accuracy with their original counterparts, they do not necessarily contain actionable rules usable in real world settings. Furthermore, existing techniques provide little user interaction, only allowing the selection of privacy parameters [19], [37]. Users are also assumed to be fully capable of creating DGHs based upon their own knowledge [28], [33]. Lastly, previous studies on utility metrics have not considered attribute correlations during anonymization thus leading to possible over generalization of important attributes [29].
B. Data Privacy Data owned by a hospital may contain three types of attributes which can be divided into the following categories. Explicit identifiers which can clearly identify an individual (E.g. name or social security number). Quasi-identifiers (QID) whose values when taken together can potentially identify individuals (E.g. zip code, gender or date of birth). Sensitive attributes that represent private information of individuals (E.g. disease or salary). Guidelines such as the US Health Insurance Portability and Accountability Act (HIPAA) aim to preserve individual privacy by removing protected health information through de-identification. Although this can prevent direct identification of patients from a medical dataset, Sweeney has shown that 87% of US individuals can be uniquely identified based on a set of QID attributes which includes zip code, gender, and date of birth [21]. Privacy threats occur when an adversary links an individual in published data to their record or sensitive attribute. These attacks are referred to as identity linkage, record linkage, and attribute linkage. In all cases, it is assumed that an adversary knows an individual’s QID attributes. Furthermore, it is assumed that an attacker is aware of the individual’s existence in the released data. In an identity linkage attack, if a record is very specific whereby only few patients match it, an adversary with background knowledge can identify that particular individual. Record linkage is similar whereby an adversary attempts to identify an individual by linking their QID with externally available information. Attribute linkage happens when sensitive values occur recurrently with a specific set of QID attributes which allows inferences to be made without exact matches.
D. Contribution This paper studies the issues of data anonymization for domain-driven data mining in an outsourcing scenario. It proposes an anonymization framework capable of protecting data while maintaining rule “meaningfulness” after mining. The term meaningfulness refers to semantic meaning and actionability of a particular rule. First, the role of DGHs in the preservation of value meanings is examined. Second, user constraints for interactive generalization are discussed. Lastly, the use of Chi-Square as a measure for attribute correlation is examined. The paper shall be organized as follows. Section II reviews existing literature regarding previous anonymization techniques, user constraints and utility metrics. Section III presents the proposed framework for data anonymization in a domain-driven data mining scenario. Section IV examines preliminary experimental results. Section V concludes the paper and discusses future work. II.
RELATED WORK
Anonymization methods including k-anonymity, ℓdiversity, and t-closeness have been created to preserve privacy through generalization [16], [18], [21]. These techniques were designed to work with static datasets meaning that whenever new data was published, previous releases of the dataset were not considered during anonymization. In a real world data mining outsourcing scenario, data is constantly changing within a dynamic environment. Therefore, recent dynamic anonymization methods including m-Invariance, Ɛ-inclusion, and m-Distinct
C. Anonymization To prevent inference or individual identification when outsourcing, data can be protected via anonymization
10
generalization process, attributes are ranked based on ChiSquare measure to determine correlations.
have been created to deal with insertions as well as deletions of new records or values [15], [22], [25]. Both static and dynamic techniques mentioned work as a “1 size fits all” solution, meaning that it supports any mining operation though data quality may not be optimal for each task. Several researchers have developed techniques to anonymize data based on specific applications or workloads to ensure the best data utility for a particular task [33], [14], [10], [12], [11]. Although applicable to the traditional data mining paradigm, these methods may produce data unsuitable for domain-driven needs. Another approach for data anonymization, through implementation of constraints or preferences has been the focus of several researchers [35], [27], [36]. These works study the need for user-specified requirements to control the generalization process. Both [35] and [36] suggest the use of attribute constraints which impose a limit on the level of generalization allowed for a particular attribute. Furthermore, [36] introduces the use of value constraints which specifies allowable generalizations for a chosen attribute value. Reference [27] on the other hand, enabled users to create preferences based on their mining purpose. Depending on the task, each attribute would be associated with priority weights that determined which are to be preserved. In both cases, the user is assumed to be capable of constraining appropriate attributes in order to obtain better results. Numerous metrics have been proposed for both general and data-specific purposes. General-purpose metrics which include average equivalence class size [18], discernability metric [38], minimal distortion [17], and information loss [37] aim to measure utility loss caused by generalizations during anonymization. Although useful for capturing similarities between datasets, these measures do not necessarily indicate quality with respect to a particular mining task [14]. For instance, unmodified data containing noise often has worse classification compared to generalized data where noise has been masked [10]. Because of this, general-purpose metrics may indicate reduced utility after generalization when in fact, data mining utility has risen. Data-specific metrics avoid this error by measuring the ability of an anonymized dataset to build accurate models. These existing metrics manage to compare generalization levels or mining accuracy but even if they indicated high utility, rules obtained from anonymized data may not necessarily be meaningful. III.
A. Ontology-based Domain Generalization Hierarchy Numerous anonymization techniques have been created and most utilize pre-defined DGHs while others implement hierarchy-free models [28], [35]. Pre-defined DGHs are used to describe appropriate mappings between specific and general values of a particular attribute. In previous works, users are assumed to be manually capable of creating hierarchies without the aid of domain knowledge. These user generated hierarchies, although capable of introducing semantic meanings, may also cause over generalization with the possibility of reducing data precision [28]. Because of this, domain knowledge plays an important role in improving the semantic meanings of DGHs while preventing utility loss. In data mining, domain knowledge includes comprehension of a dataset, variable relationships, variable ranges, known causal relations, etc [30]. Domain ontologies represent a promising source of knowledge as they express domain concepts and relationships in a comprehensible way to a particular professional community [31], [32]. One of the world’s most comprehensive medical domain ontologies, Unified Medical Language Systems (UMLS) is a suitable choice for semantically mapping attributes to ranges or concepts while preserving meaning. Take for instance, the commonly found age attribute. Table I describes a UMLS semantic mapping for age with eight concepts and their year ranges. TABLE I. Attribute
UMLS SEMANTIC MAPPING Definition
Concept
Age [1 – 23 months]
Infant
Age [2 – 5 years]
Child, Preschool
Age [6 – 12 years]
Child
Age [13 – 18 years]
Adolescent
Age [19 – 44 years]
Adult
Age [45 – 64 years]
Middle Aged
Age [65 – 79 years]
Elderly
Age [80 over]
Ages, 80 and over
Age
FRAMEWORK
By employing this semantic mapping during hierarchy construction, the DGH for age can be improved. Fig. 1 illustrates a basic DGH which discretizes age into ranges of five years or more while Fig. 2 displays an ontology-based DGH which discretizes values into appropriate ranges following the UMLS concepts. One difference of the ontology-based DGH is that there are both numerical ranges and categorical concepts to which values can be generalized. The benefit of such an approach can be seen through a simple example where the attribute age, with a value of “42” needs to be generalized. Based on the basic DGH, it can be generalized to either range “40-45” or “1-50”. On the other hand, if the ontology-based DGH is applied, “42” can be
The proposed framework for domain-driven data mining in an outsourcing scenario comprises of three components: ontology-based DGH, user-specified constraints, and correlation-based anonymization algorithm. The main motivation for this framework is to preserve privacy through the k-anonymity model while maintaining utility for domaindriven data mining. To achieve this, domain knowledge is integrated into the DGH creation process to retain attribute meanings after generalization. Additionally, users may specify task constraints before anonymization to ensure less important attributes are generalized first. During the
11
generalized to either “38-44” or “adult”. According to the UMLS semantic mapping for age, an adult is between 19 to 44 years old while a middle aged person is between 45 to 64. The basic DGH fails to capture this semantic information since the range “40-45” can mean either adult or middle aged. In this case, although the basic DGH provides more specific ranges (five years as opposed to seven), the gy g ontology-based DGH retains more meaning.
towards heart disease prevalence. Both attributes are considered important to the task and should be preserved. Previous algorithms which automatically select attributes based on metric scores may choose to generalize both attributes first due to low scores. Hence, there is a need for user-specified constraints such as attribute weightings that would be considered by the anonymization algorithm during generalization. By providing such an option, instead of relying solely on privacy utility tradeoff metrics, a user can interactively constrain the attribute generalization process. C. Correlation-based Anonymization Algorithm Even with user constraints, generalization still depends on the particular anonymization algorithm and metrics it adheres to. Past works have adopted task-independent and task-dependent metrics which measure information loss or quality based on data applications [29]. Recent algorithms for classification tasks have implemented metrics involving Information Gain or Gain Ratio which are reminiscent of decision tree construction [14], [10], [12]. These techniques, however, do not consider attribute correlations when determining attribute utility. Referring to the previous scenario, imagine a case where no constraints are specified and the user relies purely on the anonymization algorithm. Metrics such as Information Gain focus on the purity of an attribute instead of correlation, therefore important attributes related to the target may be over generalized. To overcome this issue, Chi-Square, an information measure which is often used to discover dependencies or relationships between variables should be applied. By doing so, attributes would be chosen based on their connection with the target, thus possibly improving mining results.
Figure 1. Basic Domain Generalization Hierarchy
IV. Figure 2. Ontology-based Domain Generalization Hierarchy
PRELIMINARY EXPERIMENTS & RESULTS
Preliminary experiments were performed to evaluate the proposed framework. The two main objectives were to determine the advantages of creating an ontology-based DGH and to compare the effectiveness of Chi-Square with previous scoring metrics. We utilized the Top-Down Specialization (TDS) algorithm found in [11] and used the Cleveland Heart Disease dataset obtained from UC Irvine Machine Learning Repository. Table II describes the attributes contained within the dataset.
B. User-Specified Constraints Previous anonymization techniques have mostly restricted users to specifying privacy requirement parameters for a particular dataset. After parameter specification, data generalization would be performed according to a chosen algorithm with the objective of maintaining balance between privacy and utility. The objective of such a strategy was to obtain anonymous data while conserving as much information as possible based on the algorithm used. The produced anonymized dataset while sufficiently protected and satisfying the algorithm’s utility or information loss metric may still be unusable for intended mining tasks if important attributes have been over generalized. Before outsourcing, a publisher would already have a mining purpose in mind for their dataset. Therefore, it is essential that control be given to the user for determining which attributes are to be generalized first. For instance, picture a scenario where a hospital outsources a dataset with the intention of predicting heart disease occurrences. Several attributes are present in the dataset including age, gender, cholesterol, heart rate, etc. The study aims to discover the relationship of two attributes, “cholesterol” and “heart rate”
TABLE II. Attributes
12
CLEVELAND HEART DISEASE DATASET Description
Values
age
Age in years
29-77
sex
Sex
Male, female
cp
Chest pain type
Typical-angina, atypical angina, non-anginal pain, asymptomatic
trestbps
Resting blood pressure
94-200
chol
Serum cholestoral
126-564
fbs
Fasting blood sugar
>120,