An overview of human genetic privacy - Wiley Online Library

23 downloads 41459 Views 491KB Size Report
The study of human genomics is becoming a Big Data science, owing to recent ... An overview of the genomic data from private/controlled and public data ...
Ann. N.Y. Acad. Sci. ISSN 0077-8923

A N N A L S O F T H E N E W Y O R K A C A D E M Y O F SC I E N C E S Issue: Data Science, Learning, and Applications to Biomedical and Health Sciences

An overview of human genetic privacy Xinghua Shi1 and Xintao Wu2 1 2

Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, North Carolina. Department of Computer Science and Computer Engineering, University of Arkansas, Fayetteville, Arkansas

Address for correspondence: Xinghua Shi, Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9201 University City Blvd., Charlotte, NC 28223. [email protected]

The study of human genomics is becoming a Big Data science, owing to recent biotechnological advances leading to availability of millions of personal genome sequences, which can be combined with biometric measurements from mobile apps and fitness trackers, and of human behavior data monitored from mobile devices and social media. With increasing research opportunities for integrative genomic studies through data sharing, genetic privacy emerges as a legitimate yet challenging concern that needs to be carefully addressed, not only for individuals but also for their families. In this paper, we present potential genetic privacy risks and relevant ethics and regulations for sharing and protecting human genomics data. We also describe the techniques for protecting human genetic privacy from three broad perspectives: controlled access, differential privacy, and cryptographic solutions. Keywords: genetic privacy; ethics; regulations; Big Data

Introduction The biological data deluge, owing to recent advances in biotechnology, has fundamentally transformed biomedical research into data science and health care into Big Data health. In the coming years, millions of individuals are expected to have their genomes sequenced for research, clinical, and/or personal use. The unprecedented availability of personal genomes provides a rich resource for data mining and integration toward further understanding of human biology and translation of research findings into clinical practice. On the one hand, such endeavors call for sharing human genomic data widely and efficiently, which is vital for facilitating data-driven basic biological and translational research. On the other hand, there is a growing concern that sharing and releasing genomic sequencing poses risks to genetic privacy, that is, privacy related to the identification of a human individual and/or confidential/sensitive information about his/her personal traits from anonymous genome sequences. Adversary attacks could potentially be deployed to strategically breach genetic privacy by disclosing personal identities or private traits of the individuals. By nature, the genome encodes a sensitive yet heritable signature of an individual that is marked by

genetic variation reflecting one’s ancestry and disclosing one’s susceptibility to health and diseases. With the ubiquitous availability of DNA information, together with the metadata and digital profiles for individuals (e.g., voting registries, medical records, mobile apps, fitness trackers, and social media), genetic privacy emerges as a legitimate yet challenging concern that needs to be carefully addressed not only for individuals but also for their families. Data privacy has been extensively studied in other venues such as Internet, finance, and trading. However, genetic privacy has unique features that require new method development, mainly because of the characteristics of human genomes. For example, human DNA encodes three billion base pairs of four nucleotide acids, where nearby sequences are highly correlated, forming nonrandom linkage disequilibrium blocks, since these bases are inherited together and thus reflect an individual’s ancestry and potentially contribute to variation in human traits, disease susceptibility, and treatment response. In this paper, we provide an overview of genetic privacy from various perspectives. As illustrated in Figure 1, we found a rapid accumulation in human genomic data from private and public domains that

doi: 10.1111/nyas.13211 C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72 

61

An overview of human genetic privacy

Shi & Wu

Figure 1. An overview of the genomic data from private/controlled and public data domains; privacy breaches via identity and trait attacks on genotypes/statistics, DNA sequences, and metadata; and privacy protection through ethics education, laws and regulations, techniques of controlled access, differential privacy, and cryptography solutions.

bring about the importance of sharing genomic data. We will describe risks to genetic privacy by reviewing previous privacy breaches on not only genotypes and statistics but also on DNA sequences and metadata. We will then delineate strategies for protecting genetic privacy, including ethics and regulations, controlled data access, differential privacy, and cryptographic solutions. We will also present future directions in protecting genetic privacy in Big Data health. Genomic data sharing and potential risks to genetic privacy Genomic data sharing Open data science is called upon in genomic studies, where genomic data are freely and publicly shared for promoting and translating genomic research.1 With the goal of ensuring broad and responsible sharing of genomic data, the National Institutes of Health (NIH) issued the Genomic Data Sharing Policy,2 effective January 25, 2015, which states that large-scale genomic data of all types generated by NIH-funded research are required to be shared, including genome-wide association studies (GWASs), single-nucleotide polymorphism 62

(SNP) microarrays, genome sequence, transcriptomic, epigenomic, and gene expression data.2 Furthermore, the Personal Genome Project (PGP)3 has been initiated as an open repository for willing volunteers to create, aggregate, and share public genome and corresponding health and trait data for the greater good. The Genes for Good Project4 is another recent project that recruits volunteers from Facebook to build a large database of health and genetic data. With aggregated data from many individuals, this project aims to provide valuable information for studying, treating, and preventing common diseases. In addition, the DNA Land Project5 is a new initiative that collects genetic data and related traits from volunteers who have their genomic data assessed from third parties (e.g., 23andMe), and openSNP6 offers an opportunity for customers to openly share their genotypes and traits. In order to promote data sharing across international institutions, the Global Alliance for Genomics and Health (GA4GH)7 has recently been established to provide an anonymous-access web service (i.e., beacon) for genomic researchers, with the underlying genomic data coming from various institutions and studies. This project offers queries only for allele-presence information, and, thus, data holders

C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72 

Shi & Wu

are expected to not have to worry about data privacy or security. Genetic privacy and potential risks Along with the emergence of Big Data in human genetics and genomics, genetic privacy has recently raised significant interest and concerns among both general public and research communities.8–13 Such concerns call for research and methodology development in this new frontier of human genomics and data privacy. Figure 1 illustrates the key elements in genomic data sharing and genetic privacy studies. In general, genomic data are distributed across various projects and frameworks in private, controlled, and public domains. Techniques have been proposed to breach genetic privacy by introducing identity or trait attacks by aggregating data such as genotypes, GWAS statistics, DNA sequences, and metadata. Hence, privacy protection mechanisms have been deployed through ethics education, laws and regulations, and techniques of controlled access, differential privacy, and cryptography solutions. Some of the privacy breaches can result from noncompliance with the terms and conditions of data usage and/or data transfer agreements. These scenarios pose generic challenges of data sharing that are not specific to genetic privacy. For example, one recent accident of disclosure leaks is the recent exposure of the full names of participants in the PGP3 from filenames in the database. The PGP, an open repository of human genomes and all related traits, allowed participants to upload 23andMe genotyping files to public profile webpages, and many users simply used the default name convention, which are the first and last names of the user and, thus, explicitly identifiable. These genetic privacy leaks have been quickly fixed by changing the file naming systems in PGP, but point to potential privacy issues when aggregating genomic data from multiple data resources. However, there are rich cases where genetic data privacy techniques do not apply or suffice for genetic privacy protection. Particularly, deidentification has been a widely used method for publishing sensitive data, including genomic and health data. Many organizations, such as biobanks, hospitals, research consortia, and companies, collect and publish deidentified DNA sequence and genotype data. For example, the 1000 Genomes Project (1000GP)14,15

An overview of human genetic privacy

provides the public with free services, including browsing and downloading DNA sequences, genotypes of SNPs, and other types of genetic variation, demographic populations, geographical locations, and gender information from anonymous participants in different populations. Deidentified genomic data are typically published with additional metadata, such as basic demographic details, inclusion and exclusion criteria, pedigree structure, and health conditions that are crucial to the study. Nonetheless, these pieces of metadata can be exploited to trace the identity of unknown genomes. For example, the combination of birthdate, gender, and five-digit zip code can uniquely identify 87% of U.S. residents.16 There are extensive public resources, such as voter registries, public record search engines, and social media, which link demographic quasi-identifiers to individuals. One study reported the identification of 30% of PGP3 participants by demographic profiles, including zip code and birthdates.17 Information hiding—removing or hiding partial data—is another technique for preserving data privacy. However, it is insufficient to hide genetic information at disease risk loci by simply removing the genotypes or sequences at these loci. As one of the first openly accessed personal genomes, the public release of Dr. James Watson’s genome sequence18 omitted all gene information about apolipoprotein E (APOE). This decision stemmed from Dr. Watson’s wishes to prevent prediction of his risk for late-onset Alzheimer’s disease conveyed by APOE risk alleles. However, linkage disequilibrium (i.e., nonrandom associations) between other polymorphisms and APOE can be used to predict APOE status using advanced computational methods.19 A recent study20 developed a likelihood-ratio test that uses only the presence or absence of alleles from a beacon to predict if an individual genome is present or absent in the beacon database of the GA4GH project. This study suggested that anonymous-access beacons are susceptible to reidentification attacks and can thus be exploited to infringe genetic privacy. Since the Beacon project includes data with known phenotypes, such as cancer and autism, this reidentification also potentially discloses phenotype information about individuals whose genomes are present in the beacon. This study showed that first-degree relatives and individuals can be identified in the 1000GP. They also

C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72 

63

An overview of human genetic privacy

Shi & Wu

demonstrated that they can reidentify a single genome from PGP3 participants by querying the existing beacons 1000 times.20 These risks can be mitigated by revised strategies for user queries, such as requiring users to set up an account before query and limit the number of sequential and simultaneous queries a user can submit in a period. Genetic privacy studies on GWAS statistics were first introduced as the well-known Homer’s attack21 that showed that publicly released GWAS statistics can be used to estimate a GWAS participant’s disease status from knowing his/her genotypes at certain risk factors. Thus, deidentification, a common practice in research and clinical practice, is insufficient in protecting genetic privacy and confidentiality. This study motivated the NIH to switch the genotype and phenotype data from public domain to controlled access through the database of Genotypes and Phenotypes (dbGaP).22 The European Bioinformatics Institute implements a similar strategy to provide regulation-based data access with controls from a review community. Follow-up studies23 reported that the statistics imposed on these studies can be less reduced for privacy disclosure. As illustrated in Figure 2, our previous work24–26 further showed that summary statistics in public GWAS catalogs,27,28 even when they are differentially private,26 can be mined together with background information in a Bayesian network framework, to introduce attacks on personal traits and identities on both GWAS participants and regular individuals. A recent study demonstrated that expression data could be used to predict individuals’ genotypes at particular loci.29 Another study illustrated a potential leak of genetic privacy that was achieved by statistically linking phenotypes to genotypes using information from public quantitative trait loci and other anonymized phenotype datasets.30 These studies raise another genetic privacy concern as RNA sequencing has become a mature technology to characterize gene expression levels of many cell types and tissues of humans. Combined with DNA sequencing and release of expression quantitative trait loci statistics, open access to gene expression and RNA sequencing data is yet another potential source of privacy disclosure. RNA sequencing data are currently openly accessible and there is no planned control of data access or guidelines regarding sharing such data. 64

The Heidelberg Report31 was recently released as a summary of discussions from the European Science Foundation–sponsored workshop on the use of nonanonymized human genome data in research. The key topics included how to achieve anonymity, trust, and protection of data generated from new genomic technologies and research. Additionally, the report describes several issues raised for future considerations, such as international legal harmonization in relation to genetic data protection, the complex issues surrounding the individual nature of personal genomic data, and the growing commercial value of such data. All of these points are centered around the utility and concerns of utilizing and propagating such a rich collection of human individual data that capture human ancestry, behaviors, and personal life. Ethics of genetic privacy and current regulations One important step toward privacy protection is ethics education and enforcement of pertinent laws and regulations. Over the years, the research community has walked a fine line between genetic privacy and open data access. Technological advances are followed and accompanied by concerns, debates, and controversies on a wide range of topics in ethics, regulations, and laws regarding protection and preservation of genetic privacy. Although in many cases these regulations and laws have lagged behind, they do have significant impact on research, education, and clinical practice of personal genomics. As indicated by recent privacy and identity infringement work, protecting genomic anonymity becomes next to impossible because researchers increasingly combine patient data with many types of data from social media posts to entries in genealogy databases. Therefore, it is necessary to enhance and broaden educational efforts to train students, researchers, medical practitioners, and the general public about these available resources for genetic privacy protection. Many large human genome projects provide ethics education, such as the Human Genome Project,32 HapMap Project,33 the 1000 Genomes Project,34 and the Personal Genome Education Project.35 The scientific community is also building ethics about genetic counseling, prenatal screening, and incidental findings. One milestone regulation of health information protection is the Standards for Privacy of

C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72 

Shi & Wu

An overview of human genetic privacy

Figure 2. An example of a genetic privacy leak through reasoning on a Bayesian network constructed from public genome-wide association study (GWAS) catalog and background knowledge. Such a scheme can be utilized to introduce genetic privacy risks for not only GWAS participants but also for any individuals who are not GWAS participants.

Individually Identifiable Health Information (i.e., the Privacy Rule) in the Health Insurance Portability and Accountability Act of 1996 (HIPAA). The Privacy Rule was established to address the use and disclosure of individuals’ health information (i.e., protected health information) by covered entities (i.e., organizations subject to the Privacy Rule), and provides standards for individual privacy rights to understand and control the use of their health information. Note that covered entities of the Privacy Rule include health providers, insurers, data clearinghouses, and their business partners. Heath and medical information not originated from these covered entities are not covered by HIPAA. Therefore, newly established commercial sequencing and genetic screening companies (e.g., 23andMe) generate a large amount of health information that includes sensitive and identifiable data but is not regulated by HIPAA. The emerging consumergenerated health information from home paternity tests, fitness trackers, health mobile apps, and social media is not protected by HIPAA. Additionally, the Privacy Rule only protects identifiable health information, and there are no restrictions on the use or disclosure of deidentified health information.

Metadata, such as age, race, and geographical regions, can be publicly accessible. As illustrated by the potential risks to genetic privacy, such deidentifiable information and metadata can be used to disclose sensitive information of individuals. Anecdotes and accidents related to sharing genomic data have been in company with design and modifications of regulations on data access ever since genomic data became public. For example, as mentioned above, Homer’s attack21 motivated the NIH to move the genotype and phenotype data from public domain to controlled access through dbGaP.22 In addition, the study inferring identity of 1000GP individuals from triangulate age, geographical location, and surnames retrieved from chromosome genetic signatures36 prompted leaders of the 1000GP consortium to publish an article reviewing the ethical indications related to this study, and took the age information for the 1000GP individuals from public domain to controlled access. As we have seen from the above discussion, protections for participants have been cobbled together in the awareness of past controversies and accidents.37 Suggestions have been proposed to revise HIPAA regulations to encode recent technological and

C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72 

65

An overview of human genetic privacy

Shi & Wu

service advances. It is also suggested that the Human Genetics Society should revise and redevelop consent forms, the cornerstone of genomic study data collection, to reflect the advances of genomic research and potential genetic privacy risks, and to promote the interactions between research findings and feedback to individual subjects. Studies have reported that informed consent is, in fact, often not well informed, posing a broad problem in genomic and biomedical studies.37 Techniques for protecting genetic privacy The technical development of protecting genetic privacy is essentially about finding a fine tradeoff between utility and privacy. Different techniques have been proposed to protect genetic privacy from various perspectives. We summarize such techniques from three broad perspectives: controlled access, differential privacy preservation, and cryptographic solutions. Controlled access Similar to the current dbGaP regime,22 a classic model for access control allows users to download data only after approval has been granted and under defined conditions (e.g., users store data in a secure location and will not attempt to identify individuals). An alternative model for access control uses a trust-but-verify approach, in which users cannot download data without restrictions but, on the basis of their privileges, may execute certain types of queries, which are recorded and audited by the system. The monitoring has the potential to deter malicious users and facilitate early detection of adverse events. Recent studies have proposed trustworthy frameworks9 that facilitate trust in genetic research to provide a sustainable solution while reconciling genetic privacy with data sharing. Another model of access control is allowing the original participants to grant access to their data instead of delegating this responsibility to a data access committee. In this participant-based access-control setting, private access operates a service that manages access rights and mediates the communication between researchers and participants without revealing the identity of participants. A trusted agent holds the participants’ health data, offers stewardship regarding privacy preferences, and grants access to data on the basis of participants’ decisions. The controlled access of data for privacy protection relies 66

on a central control of the data and may involve a large amount of paperwork, a long verification and auditing time, and a potential bottleneck for downloading and transferring data. Differential privacy preservation Differential privacy38,39 is a paradigm of postprocessing the output of queries such that the inclusion or exclusion of a single individual from the data set makes no statistical difference to the results found. Differential privacy is agnostic to auxiliary information that an adversary may possess and provides a guarantee against arbitrary attacks. Traditional privacy-preserving data mining models, such as k-anonymity,40 l-diversity,41 t-closeness,42 and their sanitization approaches,43 such as suppression, generalization, randomization, permutation, and synthetic data generation, cannot guarantee achieving rigorous privacy protection, since they could not completely prevent adversaries from exploiting various auxiliary information to breach privacy. Differential privacy focuses on comparing the risk to an individual when she/he is included versus not included in the database, which is different from prior work on comparing an adversary’s prior and posterior views of an individual. Definition 1. (⑀ − differential privacy).38 A mechanism k is ⑀-differentially private if for all databases x and x  differing on at most one element, and any subsets of outputs S ⊆ Range(K ),   P r [k (x) ∈ S] ≤ e xp (⑀) × P r [K x  ∈ S]. Differential privacy has been widely studied from the theoretical perspective (e.g., see Refs. 44–49). Different types of mechanisms (e.g., the Laplace mechanism,38 the smooth sensitivity,50 the exponential mechanism,51 and the perturbation of objective function45 ) have been studied to enforce differential privacy. There are extensive studies on the applicability of enforcing differential privacy in real-world applications (e.g., collaborative recommendation,52 logistic regression,45,53,54 publishing contingency tables55 or data cubes,56 privacy-preserving integrated queries,57 deep learning,58,59 computing graph properties such as degree distributions60 and clustering coefficient,61 and spectral analysis62 in social network analysis). Enforcing differential privacy in genomic data is an emerging research area. Work in this area63,64 shows how to preserve differential privacy when

C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72 

Shi & Wu

releasing genomic data statistics (e.g., the allele frequencies of cases and controls, chi-square statistic, and P values) and building logistic regression. A suite of privacy-preserving technologies could be explored and adopted in advanced genomic data analysis, with the goal of preserving utility as much as possible while satisfying the differential privacy requirements. We briefly introduce those technologies below. One widely used mechanism for achieving differential privacy is the Laplace mechanism, which computes the sum of the true answer and random noise generated from a Laplace distribution. The magnitude of the noise distribution is determined by the sensitivity of the computation ( f ) and the privacy parameter (⑀) specified by the data owner. The sensitivity of a computation bounds the possible change in the computation output over any two neighboring databases (differing in at most one record). The privacy parameter controls the amount by which the output distributions induced by two neighboring databases may differ (smaller values enforce a stronger privacy guarantee). Theorem 1.34 For f : D → R d , the mechanism k f that adds independently generated noise with distribution Lap( f /⑀) to each of the d output terms satisfies ⑀-differential privacy, where the sensitivity,  f , is  f = ma x x,x   f (x) − f (x  )1 for all x, x  differing in at most one element. Different noise addition approaches should be developed for releasing summary statistics of genomic data with a very large number of attributes (e.g., SNPs and genes) but a small number of individuals. In a typical genomic study, the number of attributes (e.g., correlations between SNPs) is several orders of magnitude greater than the number of participants in the study. The amount of random perturbation that must be applied to mask the contribution of any single individual scales with the number of outputs and often renders the perturbed results unusable. For example, in an SNP data set, x = {x1 , x2 , . . . , xnc +nt }, which contains nc cases and nt controls, each SNP profile xi contains N SNPs. One typical task of GWAS studies is to discover K SNPs that are most significantly related with the trait under study. One naive approach for differential private releasing K most significant SNPs based on a given statistics f (e.g., chi-square statistic) is to add the Laplace noise L ap( N⑀  f ) to the

An overview of human genetic privacy

true statistic value of each of N SNPs and then output K SNPs with most significant perturbed statistics values. However, this naive approach is infeasible in GWASs because the noise magnitude of L ap( N⑀  f ) is very large due to the large number of SNPs (N) and the potentially large sensitive value  f for some statistics f (e.g., odds ratio). One potential approach is to add noise calibrated to the smooth sensitivity.50 Different from the Laplace mechanism where noise magnitude depends on the global sensitivity of f , the smooth sensitivity exploits the function f ’s typical insensitivity to individual inputs and adds instance-specific noise with smaller magnitude than the worst-case noise used in the Laplace mechanism. It introduces a class of smooth upper bounds on the local sensitivity of f . Note that the local sensitivity of f is defined as ma x x,x   f (x) − f (x  )1 for all x  differing in at most one element with the current database x, and the local sensitivity is usually much smaller than the global sensitivity. The use of smooth sensitivity for releasing GWAS statistics and conducting advanced genetic analysis is promising because the smooth sensitivity framework for a variety of statistical functions could perform well on many instances, although the Laplace mechanism based on the global sensitivity yields unacceptably high noise levels. The divide-and-conquer strategy should also be used to preserve differential privacy when releasing complex summary statistics. In genomic data analysis, these test statistics (e.g., chi-square) often have a high sensitivity. The use of the divide-andconquer strategy can help derive accurate sensitivity values of various test statistics. Generally, each genomic statistic and advanced analytical method can be decomposed into a sequence of unit queries or computational modules. For each unit query or computational module, noise can be introduced to the calculation in order to maintain its differential privacy requirement. For example, if we decompose a computation f into a sequence f 1 , . . . , f m , we can achieve ⑀-differential privacy of  f by running k with noise distribution L ap( m 1  f i /⑀) on each f i . One problem of this straightforward adaptation is that it could lead to poor performance, since (unnecessarily) large noise is added. To further improve the performance, correlations can be exploited among unit queries or computational modules to determine the (approximately) optimal decomposition so that the sensitivity of

C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72 

67

An overview of human genetic privacy

Shi & Wu

the complex computation f can be accurately estimated. Another idea is to use query transformation approaches to reduce noise when multiple statistics or computation tasks are present. Specific differential privacy-preserving GWAS statistical calculation approaches can also be developed. For example, in one study,44 the authors developed an effective differential privacy-preserving method on how to release the most significant patterns together with their frequencies in the context of frequent pattern mining. The authors of another study63 adapted this method to GWAS and achieved good performance, as the magnitude of the added noise is proportional to the number of significant SNPs K, rather than the total number of SNPs N. Advanced differential privacy-preserving mechanisms, such as objective function perturbation45,53 and exponential mechanism,51 should be investigated for genomic data analysis. Genomic studies often involve continuous data (e.g., continuous phenotype traits, such as blood lipid levels or heights) in addition to categorical SNP genotypes. Various regression models (e.g., linear regression, logistic regression, and lasso models) have been developed and adopted in genomic studies. It has been shown that an individual’s participation in a study can be inferred when regression coefficients from quantitative phenotypes are available.65 Hence, it is imperative to develop differential privacy-preserving regression models for genomic data analysis. Most regression analytical methods often iteratively optimize some objective functions with various constraints. For example, a linear regression model assumes that the response is a linear function of the attributes (i.e., there exists a coefficient vector w ∈ Rd and random residual error ␦ such that y = w T x + ␦). A linear regresˆ that minimizes the sion model is then to  estimate w n (y − w T x). However, objective function, i i =1 the classic approach of directly perturbing the output coefficients of the regression algorithms requires an explicit sensitivity analysis, which is often infeasible. A theoretical framework has been proposed of perturbing the objective function and ¯ then releasing the derived model parameter w that minimizes the perturbed objective function.45 In one study,53 the authors further developed the functional mechanism that adds noise to the coefficients of polynomial representation of the objective function and uses Taylor expansion for 68

those functions whose polynomial forms contain terms with unbounded degrees. The functional mechanism removes the requirements (i.e., the convexity of the objective function, from the original function perturbation approach).45 The exponential mechanism51 is a differential privacypreserving mechanism designed for nonreal-valued functions (e.g., what is the most significant SNP?) or functions whose utility does not depend on the distance from their exact answers. The idea of the exponential mechanism is to apply a score function q : R × D → R that assigns a value to each possible pair between the output and input databases. The exponential mechanism εq⑀ , has output distriq (r,D)⑀

bution P r [ εq⑀ , = r ] ∝ e 2 . For each function, some appropriate score function can be designed to yield a high probability for the correct output under the exponential mechanism. In a study64 that developed a framework for exploratory GWAS data analysis based on the exponential mechanism, the authors defined a general score function that works for arbitrary output spaces and supports a variety of GWAS exploration tasks (e.g., answering questions such as how many and which SNPs are significant, computing the number and location of SNPs that are significantly associated with a disease, and determining the block structure of correlations). Despite a powerful method in other domains, differential privacy techniques are still at an early stage for their applications in protecting genetic privacy. One challenge is that human genomic data are usually at large scale and with considerable noise, making it difficult to develop a fast yet efficient differential privacy-protection algorithm that strikes a balance of utility and privacy in Big Data genomic research.

Cryptographic solutions Cryptographic studies have considered the task of outsourcing computation on genetic information to third parties without revealing any genetic information to the service provider. Recent cryptographic work has suggested homomorphic encryption66,67 for secure genetic interpretation. In this method, users send encrypted versions of their genomes to the third party for interpretation. The interpretation service cannot read the plain genotypic values (because it does not have the key) and executes the analytical algorithms on the encrypted genotypes directly. Because of the special mathematical

C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72 

Shi & Wu

An overview of human genetic privacy

Figure 3. An outlook of human genetic privacy risks in Big Data genomics. Controlled access or private data can be incorporated with a large body of public data about human genetics, activity, and biometric measurements, and background information to infringe genetic privacy.

properties of the homomorphic cryptosystem, the user simply decrypts the results given by the interpretation service to obtain the analytical results. In the whole process, the user does not expose genotypes or any sensitive information to the service provider, and interpretation companies can offer their service to users who are concerned about their privacy. Another technology basis is secure multiparty computation, which allows two or more entities, each of which has some private genetic data, to execute a computation on these private inputs without revealing the input to each other or disclosing it to a third party. A method was developed67 to identify genetic relatives without compromising privacy by using a cryptographic construct, called a secure genome sketch (GS), related to the theory of errorcorrecting codes, as a representation of an individual’s genome segments. The authors then projected the comparison of individuals’ genomes to segment matching of these secure GSs, and the computation of a set distance between two GSs is performed only when their distance is within some threshold. Applications of this method to human genomic data showed that it can be used to identify first-order

parent–child relationships in the HapMap data,68 and second-order relationships in the 1000GP.14 In the settings of genetic testing and genomic medicine, there are concerns about protecting privacy not only from individuals who search for drugs matching their genetic features, but also from pharmaceutical companies that are reluctant to reveal specific genetic biomarkers for their drugs. Several studies69–71 developed cryptographic protocols for these queries assuming that a user holds a copy of his/her genome sequence and a pharmaceutical company holds a substring of the user’s DNA sequence. Then, the query of whether the user has some genetic biomarkers and thus can use the drug can be mapped to a problem of string matching, with the guarantee that the user learns nothing about the biomarker substring from the pharmaceutical company and the pharmaceutical company has no record of the user’s genome. Cryptographic methods for privacy protection introduce a new perspective of minimizing genetic privacy risks using a classical method with sound theoretical foundation. However, many of these algorithms are computationally expensive, making them impractical for Big Data genomic studies. New

C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72 

69

An overview of human genetic privacy

Shi & Wu

scalable methods that consider the characteristics of human genomic data are desirable for providing robust cryptographic solutions of genetic privacy protection. Conclusions The field of human genetics and genomics has recently become an active battlefield of data science to further our understanding of human health and disease. With the availability of unprecedented data about human individuals, it is a challenging opportunity to incorporate such rich data for integrative analysis. Therefore, it is critical to promote open science and data sharing to utilize the full potential of Big Data genomics. In this paper, we provide an overview of genomic data sharing, potential genetic privacy risks, and relevant ethics and regulations for protecting human genomics data. We also describe the techniques for protecting human genetic privacy from three broad perspectives: controlled access, differential privacy, and cryptographic solutions. As summarized in Figure 3, a whole panel of data including controlled access and public data can be utilized with background knowledge to introduce potential genetic privacy risks. Although the data access agreement of most controlled-access databases would prohibit such data linkage, genetic privacy leaks through partial data linkage are possible. Hence, we should respect valid privacy concerns from individuals and institutions, and provide appropriate yet efficient privacy protection mechanisms. With the ubiquity of biometric measurements from mobile apps and fitness trackers, and the access to human behavior data monitored from various mobile devices and social media, genetic privacy becomes a serious problem when sharing not only genomic data but also social and mobile data. We foresee that a variety of complementary methods for privacy protection are necessary and should be thoroughly investigated and carefully deployed. Acknowledgment The authors acknowledge the support from the National Science Foundation (NSF) to X.S. (DGE1523154 and IIS-1502172; and from NSF DGE1523115 and IIS-1502273) and from the National Institutes of Health (NIH 1R01GM103309) to X.W. The authors are grateful for Jia Wen’s help with formatting. 70

Conflicts of interests The authors declare no conflicts of interest. References 1. Vayena, E. & U. Gasser. 2016. Between openness and privacy in genomics. PLoS Med. 13: e1001937. 2. The National Institutes of Health. 2014. The Genomic Data Sharing (GDS) Policy. August 27, 2014. https://gds.nih.gov/. 3. Harvard University. 2016. The Personal Genome Project (PGP). http://www.personalgenomes.org/. 4. The University of Michigan. 2016. Genes for Good. http://genesforgood.sph.umich.edu/. 5. DNA. Land. 2015. https://dna.land/. 6. openSNP. 2016. https://opensnp.org/. 7. Global Alliance for Genomics and Health (GA4GH). 2016. http://genomicsandhealth.org/. 8. Erlich, Y. & A. Narayanan. 2014. Routes for breaching and protecting genetic privacy. Nat. Rev. Genet. 15: 409– 421. 9. Erlich, Y., J.B. Williams, D. Glazer, et al. 2014. Redefining genomic privacy: trust and empowerment. PLoS Biol. 12: e1001983. 10. Akg¨un, M., A.O. Bayrak, B. Ozer, et al. 2015. Privacy preserving processing of genomic data: a survey. J. Biomed. Inform. 56: 103–111. 11. Greenbaum, D. & M. Gerstein. 2008. Genomic anonymity: have we already lost it? Am. J. Bioeth. 8: 71–74. 12. Greenbaum, D. & M. Gerstein. 2009. Social networking and personal genomics: suggestions for optimizing the interaction. Am. J. Bioeth. 9: 15–19. 13. Greenbaum, D., A. Sboner, X.J. Mu, & M. Gerstein. 2011. Genomics and privacy: implications of the new reality of closed data for the field. PLoS Comput. Biol. 7: e1002278. 14. The 1000 Genomes Project Consortium, G.R. Abecasis, A. Auton, et al. 2012. An integrated map of genetic variation from 1,092 human genomes. Nature 491: 56–65. 15. The 1000 Genomes Project Consortium, A. Auton, L.D. Brooks, et al. 2015. A global reference for human genetic variation. Nature 526: 68–74. 16. Sweeney, L. 2000. Uniqueness of simple demographics in the US population. Technical report. Pittsburg: Data Privacy Laboratory, School of Computer Science, Carnegie Mellon University. 17. Sweeney, L., A. Abu & J. Winn. 2013. Identifying participants in the personal genome project by name. Available at SSRN 2257732. Cambridge: Data Privacy Lab, IQSS, Harvard University. 18. Wheeler, D.A., M. Srinivasan, M. Egholm, et al. 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature 452: 872–876. 19. APOC, A. 2009. On Jim Watson’s APOE status: genetic information is hard to hide. Eur. J. Hum. Genet. 17: 147– 149. 20. Shringarpure, S.S. & C.D. Bustamante. 2015. Privacy risks from genomic data-sharing beacons. Am. J. Hum. Genet. 97: 631–646. 21. Homer, N., S. Szelinger, M. Redman, et al. 2008. Resolving individuals contributing trace amounts of DNA to

C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72 

Shi & Wu

22.

23.

24.

25.

26.

27.

28.

29.

30.

31. 32.

33.

34.

35. 36.

37. 38.

39. 40.

highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4: e1000167. National Center for Biotechnology Information. 2016. The Database of Genotypes and Phenotypes (dbGaP). http://www.ncbi.nlm.nih.gov/gap. Wang, R., Y.F. Li, X. Wang, et al. 2009. Learning your identity and disease from research papers: information leaks in genome wide association study. In Proceedings of the 16th ACM Conference on Computer and Communications Security. 534–544. ACM Digital Library. Wang, Y., X. Wu & X. Shi. 2013. Using aggregate human genome data for individual identification. In 2013 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 410–415. IEEE. Wang, Y., X. Wu & X. Shi. Infringement of individual privacy via mining GWAS statistics. Technical Report. University of Arkansas. Fayetteville, AR. Report No. DPL-2014-004. Wang, Y., J. Wen, X. Wu, et al. 2016. Infringement of individual privacy via mining differentially private GWAS statistics. In Proceedings of the 2nd International Conference on Big Data Computing and Communication (BIGCOM 2016). Lect. Notes Comput. Sci. 9784: 355–366. Springer. Welter, D., J. MacArthur, J. Morales, et al. 2014. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42: D1001–D1006. National Human Genome Research Institute. 2016. A catalog of published genome-wide association studies. http://www.genome.gov/gwastudies. Schadt, E.E., S. Woo & K. Hao. 2012. Bayesian method to predict individual SNP genotypes from gene expression data. Nat. Genet. 44: 603–608. Harmanci, A. & M. Gerstein. 2016. Quantification of private information leakage from phenotype-genotype data: linking attacks. Nat. Methods 13: 251–256. McGonigle, I. & N. Shomron. 2016. Privacy, anonymity and subjectivity in genomic research. Genet. Res. 98: e2. National Human Genome Research Institute. 2015. All about the human genome project (HGP). http://www. genome.gov/10001772. The International HapMap Consortium. 2004. Integrating ethics and science in the International HapMap Project. Nat. Rev. Genet. 5: 467–475. Rodriguez, L.L., L.D. Brooks, J.H. Greenberg, et al. 2013. The complexities of genomic identifiability. Science 339: 275– 276. Personal Genetics Education Project (pgEd). 2016. http:// www.pged.org/. Gymrek, M., A.L. McGuire, D. Golan, et al. 2013. Identifying personal genomes by surname inference. Science 339: 321– 324. Hayden, E.C. 2012. A broken contract. Nature 486: 312–314. Dwork, C., F. McSherry, K. Nissim, et al. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography. Lect. Notes Comput. Sci. 3876: 265–284. Springer. Dwork, C. 2011. A firm foundation for private data analysis. Commun. ACM 54: 86–95. Samarati, P. & L. Sweeney. 1998. Protecting privacy when disclosing information: k-anonymity and its enforcement

An overview of human genetic privacy

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

53.

54.

55.

56.

through generalization and suppression. Technical report. SRI International, Menlo Park, CA. Machanavajjhala, A., D. Kifer, J. Gehrke, et al. 2007. ldiversity: privacy beyond k-anonymity. ACM Trans. Knowl. Disc. Data (TKDD) 1: 3. Li, N., T. Li & S. Venkatasubramanian. 2007. t-closeness: privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering. 106–115. IEEE. Aggarwal, C.C. & P.S. Yu. 2008. “A general survey of privacypreserving data mining models and algorithms.” In PrivacyPreserving Data Mining: Models and Algorithms. C.C. Aggarwal & P.S. Yu, Eds.: 11–52. Boston, MA: Springer. Bhaskar, R., S. Laxman, A. Smith, et al. 2010. Discovering frequent patterns in sensitive data. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 503–512. ACM. Chaudhuri, K. & C. Monteleoni. 2009. “Privacy-preserving logistic regression.” In Advances in Neural Information Processing Systems. 289–296. Neural Information Processing Systems Foundation, Inc. Hay, M., V. Rastogi, G. Miklau, et al. 2010. Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow. 3: 1021–1032. Kifer, D. & A. Machanavajjhala. 2011. No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 193–204. ACM. Lee, J. & C. Clifton. 2012. Differential identifiability. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1041–1049. ACM. Ying, X., X. Wu & Y. Wang. 2013. On linear refinement of differential privacy-preserving query answering. In PacificAsia Conference on Knowledge Discovery and Data Mining. 353–364. Springer. Nissim, K., S. Raskhodnikova & A. Smith. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing. 75–84. ACM. McSherry, F. & K. Talwar. 2007. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science, 2007. FOCS’07. 94–103. IEEE. McSherry, F. & I. Mironov. 2009. Differentially private recommender systems: building privacy into the net. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 627–636. ACM. Zhang, J., Z. Zhang, X. Xiao, et al. 2012. Functional mechanism: regression analysis under differential privacy. Proc. VLDB Endow. 5: 1364–1375. Wang, Y., C. Si & X. Wu. 2015. Regression model fitting under differential privacy and model inversion attack. In Proceedings of the 24th International Joint Conference on Artificial Intelligence. 1003–1009. Palo Alto: AAAI Press. Xiao, X., G. Wang & J. Gehrke. 2011. Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23: 1200–1214. Ding, B., M. Winslett, J. Han, et al. 2011. Differentially private data cubes: optimizing noise sources and consistency.

C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72 

71

An overview of human genetic privacy

57.

58.

59.

60.

61.

62.

63.

64.

72

Shi & Wu

In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 217–228. ACM. McSherry, F.D. 2009. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. 19–30. ACM. Phan, N., Y. Wang, X. Wu, et al. 2016. Differential privacy preservation for deep auto-encoders: an application of human behavior prediction. In Proceedings of the 30th AAAI Conference on Artificial Intelligence. 12–17. Palo Alto: AAAI Press. Shokri, R. & V. Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. 1310–1321. ACM. Hay, M., C. Li, G. Miklau, et al. 2009. Accurate estimation of the degree distribution of private networks. In 2009 9th IEEE International Conference on Data Mining. 169–178. IEEE. Rastogi, V., M. Hay, G. Miklau, et al. 2009. Relationship privacy: output perturbation for queries with joins. In Proceedings of the 28th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. 107–116. ACM. Wang, Y., X. Wu & L. Wu. 2013. Differential privacy preserving spectral graph analysis. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. 329–340. Springer. Fienberg, S.E., A. Slavkovic & C. Uhler. 2011. Privacy preserving GWAS data sharing. In 2011 IEEE 11th International Conference on Data Mining Workshops. 628–635. IEEE. Johnson, A. & V. Shmatikov. 2013. Privacy-preserving data exploration in genome-wide association studies. In Proceed-

65.

66.

67.

68.

69.

70.

71.

ings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1079–1087. ACM. Im, H.K., E.R. Gamazon, D.L. Nicolae, et al. 2012. On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. Am. J. Hum. Genet. 90: 591–598. Djatmiko, M., A. Friedman, R. Boreli, et al. 2014. Secure evaluation protocol for personalized medicine. In Proceedings of the 13th Workshop on Privacy in the Electronic Society. 159–162. ACM. He, D., N.A. Furlotte, F. Hormozdiari, et al. 2014. Identifying genetic relatives without compromising privacy. Genome Res. 24: 664–672. The International HapMap Consortium, D.M. Altshuler, R.A. Gibbs, et al. 2010. Integrating common and rare genetic variation in diverse human populations. Nature 467: 52–58. Baldi, P., R. Baronio, E. De Cristofaro, et al. 2011. Countering GATTACA: efficient and secure testing of fully-sequenced human genomes. In Proceedings of the 18th ACM Conference on Computer and Communications Security. 691–702. ACM. De Cristofaro, E., S. Faber, P. Gasti, et al. 2012. Genodroid: are privacy-preserving genomic tests ready for prime time?. In Proceedings of the 2012 ACM Workshop on Privacy in the Electronic Society. 97–108. ACM. De Cristofaro, E., S. Faber & G. Tsudik. 2013. Secure genomic testing with size-and position-hiding private substring matching. In Proceedings of the 12th ACM Workshop on Privacy in the Electronic Society. 107–118. ACM.

C 2016 New York Academy of Sciences. Ann. N.Y. Acad. Sci. 1387 (2017) 61–72