Available online at www.sciencedirect.com
ScienceDirect Procedia Computer Science 63 (2015) 569 – 574
The 2nd International Workshop on Privacy and Security in HealthCare (PSCare 2015)
A Theoretical multi-Level Privacy Protection Framework for Biomedical Data Warehouses Fida K. Dankara* and Rashid Al Alia a
Sidra Medical and Research center, Doha, Qatar
Abstract A critical element in designing a biomedical research platform is the design of a responsible data-sharing mechanism. Such mechanism should provide high utility data in an efficient and timely manner. Existing platforms do not deliver on both timeliness and utility. In this paper, we present a theoretical privacy protection framework. The framework proposes a smart honest broker system that provides data requests with a privacy protection that matches their posed risk. It promises to deliver on both fronts. © 2014 The TheAuthors. Authors.Published Published Elsevier B.V. © 2015 byby Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Program Chairs. Peer-review under responsibility of the Program Chairs Keywords: Privacy; de-identification; biomedical data; institutional review boards.
1. Introduction Biomedical data holds the potential to great knowledge discovery through research and the promise to drive medicine to a more personalized level. Vast biomedical data warehouses are now under construction in various parts of the world to facilitate secondary usage of this rich data and to provide proper tools to manipulate it and analyse it. A critical element in designing these platforms is the design of efficient and responsible data-sharing mechanisms. The use of biomedical data for research is governed by privacy regulations1,2. These generally mandate the approval of an Institutional Review Board (IRB) for the use of any identifiable data in research. Thus, most existing biomedical platforms lightly de-identify their data warehouse and regulate investigators’ access to the de-identified data through an IRB3–5. However, exclusive reliance on the IRB for research regulation has proven to be inefficient6. IRB processes worldwide are often very lengthy, enough to hinder timely discoveries.
*
Corresponding author. Tel.: +974-4404-1958; fax: +974-4404-2158. E-mail address:
[email protected]
1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Program Chairs doi:10.1016/j.procs.2015.08.386
570
Fida K. Dankar and Rashid Al Ali / Procedia Computer Science 63 (2015) 569 – 574
Generally, when data is de-identified (i.e. when protected health information (PHI) is removed from the dataset), it can be exempt from IRB approval. Thus, to remedy the problem of timeliness, few platforms opted to employ an IRB-approved Honest Broker system (HBS)7,8. HBS provides de-identified data to individual investigators: the data is often de-identified then stored in a data warehouse. Investigators request subsets of the de-identified data through the HBS. The requested data is exported to investigators after signing a data use agreement. Such schema, while timely, suffers from several shortcomings: • Investigators may not request all the variables they need through the HBS, as required variables could be omitted during the de-identification process. • The decision on data requests is binary, the requested data is either granted to the investigator in question or not. The criteria upon which the decision is based are implicit and subjective, and • All granted data requests are treated equally (with respect to de-identification) without regard to the particular privacy risk posed by each request. In fact, some investigators have a good track record and pose less privacy risk than others. Particularly if they belong to a research institution that follows stringent privacy and security policies. On the other hand, different data subsets vary with respect to their capacity for privacy invasion, HIV related data poses a greater potential for injury than flu related data. A sensible way to overcome the problems above is by offering a multi-level privacy protection model through the HBS. The model should provide data requests with a level of protection that matches their posed risks. In other words, the risk posed by a data request should be automatically evaluated, and matched with a protection level that meets the calculated risk. In this manuscript, we present the first attempt toward the proposition and formalization of a multi-levelled privacy protection Honest Broker System. If properly designed, such HBS has the potential to eliminate reliance on IRBs by formalizing complex scenarios such as: a.
HIV data can only be accessed by investigators belonging to highly trusted institutions with no history of breaches, provided that subjects consented for research. Such access should be performed from the premises of the institution in question, and the exported data should be PHI free.
b.
Data regarding sexually transmitted diseases can only be accessed by highly trusted research institution. Such data should never leave the premises of the data holder and it should be deidentified as per the Safe harbour standard.
To realize our proposed HBS, we need to institute (A) an objective way to evaluate the privacy risk posed by a data request, (B) a de-identification scale with multiple data protection levels and (C) a set of decision rules to match computed risks with appropriate protection levels: (A)
Inspired by the work of9, we identify three independent factors that affect the privacy risk posed by a data request (referred to as Risk Score or RS): • Requestor risk (RR), or the level of risk posed by the data requestor. This will be defined as the risk posed by the research institutes that employ the investigators rather than the investigators themselves, as we adopt the view that the institution is accountable for its employees. • Data Sensitivity (DS), or the sensitivity of the information being disclosed, and • Study Purpose (SP), or the usages the recipient will make with the information. Note that if detailed (informed) consent from the data subjects is available and codified, then the purpose of the study could be matched against subjects’ consent for conformance. However, in this paper, we assume that consent is one-dimensional, i.e. subjects either consent for research (opt in) or not, and that studies with “research” as the stated purpose, will all qualify for data access.
(B)
For the multiple data protection levels, we offer (i) multiple data access modes with varying degrees of accessibility, and (ii) a de-identification scale with multiple de-identification levels offering variable degrees of re-identification risk.
(C)
The decision rules for matching risk scores with de-identification levels and access modes should be set by the concerned IRBs, nonetheless we provide recommendations on how to set these rules in the next
571
Fida K. Dankar and Rashid Al Ali / Procedia Computer Science 63 (2015) 569 – 574
section. Our HBS schema presents several advantages: (i) it formalizes the data access framework, makes it more transparent and limits subjectivity. (ii) It provides trusted investigators with accurate data thereby (potentially) reducing the burden on IRBs. (iii) More importantly, it will encourage higher privacy and security standards at research institutions. In the next section, we present an overview of our projected platform and of the data access framework followed by a discussion of future directions. 2. System Description Our projected data-request process provides investigators a choice of applying to HBS, IRB or both. An application to HBS makes sense when de-identified data is satisfactory, otherwise, if more detailed data is required, investigators should apply to the IRB. An application to both allows the investigator to formulate their hypothesis and to perform preliminary work while the IRB application runs its course. Fig. 1 depicts our projected data request process: • Investigators requesting a data set from the HBS are assigned a Risk Score based on two factors: the institution they belong to (RR), and the data they requested (DS). Each risk score will be matched with a default deidentification level and a default data access mode. Investigators will be granted timely access to data deidentified according to their default level. • If a higher level of precision is required, the investigator needs to go through the IRB process. The IRB application procedure is an extensive one, possibly involving multiple re-submissions and reviews. The review and decision process is completely governed by the IRB committee. The final IRB/HBS decision is fed to a data extraction tool along with specifications detailing the data granted as well as the mode of access allowed. The data extraction tool extracts the data from the data warehouse into a data mart. If the IRB/HBS specifications require additional de-identification, then a de-identification tool applies the proper de-identification to the extracted data mart.
IRB Investigator
IRB App
Review
Data request Risk score
Investigator premises
Export p data Data Use Agreement
De-id tool Local data Mart
Decision (with specifications)
HBS
Data extraction tool
Local data access
Fig. 1. Data request flowchart process.
In what follows, we give an overview of existing de-identification mechanisms and suggest de-identification scales that are appropriate for our context. The scales will be based on the de-identification mechanisms that have been used for biomedical data sharing. Then we present the different data access modes offered by the platform and we conclude this section with a presentation on how to evaluate the privacy risk posed by a data request.
572
Fida K. Dankar and Rashid Al Ali / Procedia Computer Science 63 (2015) 569 – 574
2.1. De-identification scales Privacy-preserving mechanisms can be divided into two broad categories: process driven and data driven. In process driven mechanisms, the dataset is held by a trusted server, users query the data through the server and privacy is built into the algorithms that access the data. Differential privacy is the most popular process-driven privacy model10. It perturbs records in a random but controlled manner. Differential privacy provides strong proofs for privacy, but often leads to low utility particularly in studies that rely on rare events (such as association studies that look at rare genetic events)11,12. In such events, it could lead to strange data representations and erroneous associations13. However, because it provides strong proofs of privacy it can be used with less trusted investigators, i.e. in cases where privacy supersedes utility. Data driven approaches suppress or modify the variables that could allow an attacker to know precisely the owner of a particular record, these are referred to as potentially identifying variables (PIVs). However, isolating such variables in a general context is a very challenging task, it requires knowledge of the variables that potential attackers might use to link a subject record with an identity, i.e. it requires modeling the knowledge and capacity of potential (but unknown) data recipients/attackers. To overcome these challenges, some data-driven mechanisms revert to using a fixed set of PIVs (these consist of the commonly known identifiers, and of attributes that have been used in the past to breach privacy). These mechanisms are known as heuristic de-identifications. Heuristic deidentifications are easy to implement as they do not rely heavily on the underlying dataset, however, they do not always offer adequate protection for the data subjects13,14. Nonetheless, all operating biomedical data warehouses use heuristic methods for data sharing3–5,15. The most popular heuristic de-identifications are the safe harbor and the limited dataset of the HIPAA privacy rule2. The safe harbor standard stipulates the removal of 18 variables from the dataset. These variables include names, geographical locations of population size smaller than 20,000, dates of higher granularity than years, and all uniquely identifying variables. The limited dataset standard stipulates the removal of 16 of the 18 safe Harbor variables from the dataset. All direct identifiers are removed; dates and geographical locations are kept. A third commonly used static de-identification lies between the two standards presented above. We will refer to it as the intermediate standard. It is similar to the safe harbor but instead of omitting dates, it shifts them between 1 and 364 days into the past. The shifting value is different across records and identical within the same record. Thus research that requires temporal analyses would be possible using this standard4. A heuristic de-identification scale could be defined by these three standards. Despite the aforementioned controversy, some data-driven methods try to model the background knowledge of potential recipients by introducing assumptions about the kind of information that attackers may possess and set the PIVs accordingly; these are known as the Statistical de-identification mechanisms. Opponents of these methods claim that the assumptions will never be accurate or comprehensive, while proponents argue that their methods perform well in real life applications. The most commonly used PIVs are demographics16 (commonly known as quasi-identifiers), however, it has been argued that diagnostic codes could also be used by adversaries to identify the subjects. In fact one GWAS study showed that more than 96% of 2700 patients have unique combination of diagnosis codes17. Statistical de-identification modifies the PIVs so as to lower the re-identification risk to an acceptable (and pre-defined) level. k-anonymity is the most popular of these techniques18. It works by suppressing or generalizing PIVs to make at least k records in the database identical on their PIVs. The aim is to hide every individual in a crowd of k look-alikes19. k-anonymity is not yet implemented in any existing biomedical data warehouse, however it is widely used in individual cases of data sharing13,18,20–23. Different de-identification levels could be set through the value k, the higher the k value, the lower the precision of the data (and the probability of reidentification). Therefore another scale could be based on k-anonymity. Different levels could be realized by varying the k value. Commonly used values range from 2 to 1516. Additional de-identification scales could be defined using a combination of the presented de-identification mechanisms. 2.2. Access modes Two main modes of data access will be offered: export mode and local access mode. In the export mode, investigators receive a copy of the requested data, and thus it becomes completely under their obligation. For the local access mode, the generated data mart is stored locally in the platform. Investigators must access the data
Fida K. Dankar and Rashid Al Ali / Procedia Computer Science 63 (2015) 569 – 574
through the platform, such access will be regulated by the HBS. The local access scheme will enable access audits and force restrictions on data sharing. As such, it can be used with less trusted institutions. Several variations of the local access mode could be offered as well, these would restrict time and location of access. 2.3. Risk scores In what follows, we talk about the quantification of the risk posed by a data request from two dimensions: institution and data: The security and privacy practices at the investigator’s institution are the major factors that shape the RR level. El Emam et al.20 introduced a checklist that can be used to evaluate the extent to which such practices have been implemented. The checklist contains important but straightforward questions about the privacy and security controls implemented at the investigator’s institution. The checklist inquires whether security and privacy policies and protocols are followed at the institution, whether the institution employs a privacy officer to enforce the policies, whether it offers mandatory training regarding the handling/sharing of private information, whether surprise audits from an independent third party are allowed, and whether the data will be placed in secure locations. The checklist also inquires about the geographical location of the institution to make sure that similar privacy laws are enforceable. Completion of the checklist is a straightforward task. The total number of unchecked items is one way for quantifying the institutions’ risk. A more robust quantification would assign varying weights to the different checklist items. Such weights would be best set by the concerned IRB, however items related to institutions’ location and third party audits should bear high weights. The checklist should be filled by a representative from the institution. Through independent third party audits, the HBS can establish the truthfulness of the responses and with time it will have built-in profiles for the different institutions. The sensitivity of the requested data is a critical risk factor. Patients’ data tend to have variable sensitivity rates, for example, HIV and mental disorders are regarded as highly sensitive attributes24. Assigning sensitivities to the different data attributes is a subjective exercise. The reason is that these sensitivities vary between individuals, cultures and situations. As such, they should be created by experts on social perceptions or by gathering such social perceptions through other means, such as surveys25. Once gathered, these perceptions should be applied on the different attributes through a rating that specifies the sensitivity level. The overall sensitivity of a requested data subset would be a function of the individual sensitivities of the requested data items. Having a sensitivity ratings for the different fields within the dataset facilitates the risk calculation, eliminates subjectivity and delivers on the timeliness requirement. 3. Discussion and Future Work Contrary to common beliefs, de-identification does not completely eliminate re-identification risk. More importantly, the residual re-identification risk of a de-identified dataset cannot be quantified precisely. In fact, it is not possible to measure the exact likelihood of the data falling into malicious hands (however small it is), nor the capacity of these malicious hands to perform a privacy breach. Therefore, it is important to understand the risks posed by data requests from all possible dimensions and to use additional protection methods alongside deidentification (such as data access modes). Currently, to realize our model, we are tackling the following issues: (i) assigning sensitivities to the different data attributes, (ii) extending the current risk score definition to account for the risk posed by the individuals requesting the data (investigators), in terms of their motivation and ability to reidentify the data. And, (iii) collaborating with the IRB at Sidra Medical and Research Center to define an initial set of broad risk classes (as an initial step), and match each class with a level of privacy protection. In the near future, we hope to extend the HBS system to provide different forms of genetic data. For that, we are trying to identify the different types of genetic data that can be generated, and to study the re-identification power of each form of this data. The purpose is to build a genomic privacy risk meter. The meter would sort genetic data based on its identifying power. Investigators could then be provided with the genomic data type that fits their risk score.
573
574
Fida K. Dankar and Rashid Al Ali / Procedia Computer Science 63 (2015) 569 – 574
Acknowledgements We would like to thank Dr. Radja Badji for her useful and insightful comments on earlier versions of this paper. References 1. Federal Policy for the Protection of Human Subjects (’Common Rule’). 2. Sweeney, L. Data sharing under HIPAA: 12 years later. in Workshop on the HIPAA Privacy Rule’s De-Identification Standard (2010). 3. McCarty, C. A. et al. The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies. (2011). at 4. Roden, D. M. et al. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin. Pharmacol. Ther. 84, 362–369 (2008). 5. Murphy, S. N. et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J. Am. Med. Inform. Assoc. JAMIA 17, 124–130 (2010). 6. Graham, D. G., Spano, M. S. & Manning, B. The IRB challenge for practice-based research: strategies of the American Academy of Family Physicians National Research Network (AAFP NRN). J. Am. Board Fam. Med. 20, 181–187 (2007). 7. Dhir, R. et al. A multi-disciplinary approach to honest broker services for tissue banks and clinical data: a pragmatic and practical model. Cancer 113, 1705–1715 (2008). 8. Erdal, B. S. et al. A database de-identification framework to enable direct queries on medical data for secondary use. Methods Inf. Med. 51, 229 (2012). 9. Adams, A. The implications of users’ multimedia privacy perceptions on communication and information privacy policies. in Proceedings of Telecommunications Policy Research Conference (1999). at 10. Dwork, C. Differential privacy: a survey of results. in Proceedings of the 5th international conference on Theory and applications of models of computation 1–19 (Springer-Verlag, 2008). at 11. Dankar, F. K. & El Emam, K. Practicing Differential Privacy in Health Care: A Review. Trans. Data Priv. 6, 35–67 (2013). 12. Dankar, F. K. & El Emam, K. The application of differential privacy to health data. in 158–166 (2012). at 13. Heatherly, R. D. et al. Enabling Genomic-Phenomic Association Discovery without Sacrificing Anonymity. PLoS ONE 8, e53875 (2013). 14. Benitez, K. & Malin, B. Evaluating re-identification risks with respect to the HIPAA privacy rule. J. Am. Med. Inform. Assoc. 17, 169–177 (2010). 15. Pulley, J., Clayton, E., Bernard, G. R., Roden, D. M. & Masys, D. R. Principles of Human Subjects Protections Applied in an Opt-Out, Deidentified Biobank. Clin. Transl. Sci. 3, 42–48 (2010). 16. El Emam, K. & Dankar, F. K. Protecting privacy using k-anonymity. J. Am. Med. Inform. Assoc. 15, 627–637 (2008). 17. Loukides, G., Liagouris, J., Gkoulalas-Divanis, A. & Terrovitis, M. Disassociation for electronic health record privacy. J. Biomed. Inform. 50, 46–61 (2014). 18. Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10, 557–570 (2002). 19. Gelfand, A. Privacy and Biomedical Research: Building a Trust Infrastructure. at 20. El Emam, K., Dankar, F. K., Vaillancourt, R., Roffey, T. & Lysyk, M. Evaluating the risk of re-identification of patients from hospital prescription records. Can. J. Hosp. Pharm. 62, 307 (2009). 21. Dankar, F. K., El Emam, K., Neisa, A. & Roffey, T. Estimating the re-identification risk of clinical data sets. BMC Med. Inform. Decis. Mak. 12, 66 (2012). 22. El Emam, K., Dankar, F. K., Neisa, A. & Jonker, E. Evaluating the risk of patient re-identification from adverse drug event reports. BMC Med. Inform. Decis. Mak. 13, 114 (2013). 23. Malin, B., Benitez, K. & Masys, D. Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA Privacy Rule. J. Am. Med. Inform. Assoc. 18, 3–10 (2011). 24. Government of Canada, T. B. of C. Guidance Document: Taking Privacy into Account Before Making Contracting Decisions 0 / 11. (2010). at 25. Fule, P. & Roddick, J. F. Detecting privacy and ethical sensitivity in data mining results. in Proceedings of the 27th Australasian conference on Computer science-Volume 26 159–166 (Australian Computer Society, Inc., 2004). at