Blinded Anonymization: a method for evaluating ... - Semantic Scholar

6 downloads 2035 Views 290KB Size Report
Therefore, one has to comply with data protection regulations which are restrictive ... written individual consent and is compliant to existing privacy regulations.
424

Digital Healthcare Empowering Europeans R. Cornet et al. (Eds.) © 2015 European Federation for Medical Informatics (EFMI). This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License. doi:10.3233/978-1-61499-512-8-424

Blinded Anonymization: a method for evaluating cancer prevention programs under restrictive data protection regulations Sebastian BARTHOLOMÄUS a,1, Hans Werner HENSE b and Oliver HEIDINGER a a Epidemiological Cancer Registry of North Rhine-Westphalia, Germany b Institute of Epidemiology and Social Medicine, University of Münster, Germany

Abstract. Evaluating cancer prevention programs requires collecting and linking data on a case specific level from multiple sources of the healthcare system. Therefore, one has to comply with data protection regulations which are restrictive in Germany and will likely become stricter in Europe in general. To facilitate the mortality evaluation of the German mammography screening program, with more than 10 Million eligible women, we developed a method that does not require written individual consent and is compliant to existing privacy regulations. Our setup is composed of different data owners, a data collection center (DCC) and an evaluation center (EC). Each data owner uses a dedicated software that preprocesses plain-text personal identifiers (IDAT) and plaintext evaluation data (EDAT) in such a way that only irreversibly encrypted record assignment numbers (RAN) and pre-aggregated, reversibly encrypted EDAT are transmitted to the DCC. The DCC uses the RANs to perform a probabilistic record linkage which is based on an established and evaluated algorithm. For potentially identifying attributes within the EDAT (‘quasi-identifiers’), we developed a novel process, named 'blinded anonymization’. It allows selecting a specific generalization from the pre-processed and encrypted attribute aggregations, to create a new data set with assured k-anonymity, without using any plain-text information. The anonymized data is transferred to the EC where the EDAT is decrypted and used for evaluation. Our concept was approved by German data protection authorities. We implemented a prototype and tested it with more than 1.5 Million simulated records, containing realistically distributed IDAT. The core processes worked well with regard to performance parameters. We created different generalizations and calculated the respective suppression rates. We discuss modalities, implications and limitations for large data sets in the cancer registry domain, as well as approaches for further improvements like l-diversity and automatic computation of ‘optimal’ generalizations. Keywords. Data Linkage, Data Aggregation, Confidentiality, Data Protection, Program Evaluation, Registries

Introduction To evaluate the performance and outcomes of routine cancer prevention programs, it is necessary to collect and link data on a case specific level from multiple sources of the healthcare system. This is a challenging task especially in Germany, where legislative authorities have strengthened individual rights by restrictive regulations for collecting 1

Corresponding Author.

S. Bartholomäus et al. / Blinded Anonymization

425

and processing sensitive data in the context of scientific research. If individual written informed consent is not available, existing regulations in Germany enforce the usage of anonymized or, if that is not feasible, pseudonymized data wherever possible. Falling short of these directives requires explicit justification by outlining the predominant public interest in the research project and providing evidence that the goals cannot be accomplished by using anonymized or pseudonymized data. In the future, given the pending General Data Protection Regulation (GDPR) of the European Union [1], research institutions in Europe may generally experience more restrictions whenever large administrative databases are to be linked. In the context of an evaluation study on the impact of the German mammography screening program on breast cancer mortality commissioned by the Federal Office for Radiation Protection in Germany (BfS), we developed a data flow and processing model that does not require individual consent and that is, nevertheless, compliant to existing legal regulations in Germany.

1. Methods At its core our concept is a modification and enhancement of the data linking processes already implemented in the Epidemiological Cancer Registry of North RhineWestphalia (EKR-NRW) since 2005. The EKR-NRW combines a client-side preprocessing of reports in collaboration with a downstream pseudonymization service and a probabilistic record-linkage process, using mainly encrypted identifiers [2]. The setup includes various data owners (DO), a pseudonymization service (PSS), a data collection center (DCC), an evaluation center (EC) and one or multiple research groups (RG) that use anonymized records provided by the EC (Figure 1).

Figure 1. The overall structure of DO, DCC, EC and RG was defined by the study design.

The goal is that the DCC links data received from different sources to create a common dataset that complies with a predefined degree of anonymity. Our approach is to perform the anonymization without actually knowing the content of the records. Therefore we decided to apply k-anonymity [3] for the measure of anonymity, as it only requires the information on whether two values of an attribute are equal or not, which can be checked on deterministically encrypted data too, as the same input is always mapped to the same cryptogram. Before sending data, the different data owners have to use a dedicated software (reporting tool) that pre-processes the records. The identifying data (IDAT) are split up and normalized; for name attributes this includes also the generation of phonetic codes.

426

S. Bartholomäus et al. / Blinded Anonymization

The up to 31 attributes that are derived from the IDAT are individually encrypted into person cryptograms (PCG) by using a deterministic one-way function. The evaluation data (EDAT) also contain those IDAT attributes that are necessary for evaluation purposes. In the context of the German mammography screening programme these are e.g. the date of birth and the zip code of residence. For each EDAT attribute the software tool creates multiple reasonable levels of aggregation. To comply with the need for data economy, very low aggregation levels containing very specific information (e.g. a full date of birth) are left out, as are very high aggregation levels that contain information that is insufficiently precise for the evaluation. Each of the remaining aggregated EDAT (EDATAGG) items is then separately and deterministically encrypted ((EDATAGG)EC) (Figure 2). This is done in such a way that only the EC can decrypt the EDAT in the eventually anonymized data set. EDATAGG Level0

Dateof Birth

ZIP

Dateof Diagnosis

06.10.1944

66879

15.09.2010

Level1

10.1944

6687

09.2010

Level2

Q4.1944

668

Q3.2010

Level3

H2.1944

66

H2.2010

Level4

1944

6

2010

(EDATAGG)EC

Dateof Birth

ZIP

Dateof Diagnosis

Level1

gtz54D230oi34

f3409gkn439

i2nf23ng43u89

Level2

54g4w5h5676u

43t89u43gn4

h44u7jsfglkeds

Level3

32j3fk65ds

r302q94ufk5i

Lowerbound: dataavoidance Reasonable aggregations. Upperbound: evaluability

Deterministically encrypted

Figure 2. The reporting tool generates different aggregation levels and encrypts each value for the EC.

The PCGs and the (EDATAGG)EC are transmitted to the PSS, which in turn uses a secret key to add another encryption on the PCGs of each record and transforms them into record assignment numbers (RAN). This is necessary to prevent the PCGs from being attacked by crypto analytical methods such as rainbow tables [4], and to prevent a direct communication about specific records between the data holders and the DCC (‘six eyes principle’). The PSS then transmits the RANs and (EDATAGG)ES to the DCC. The DCC uses the RANs to perform a probabilistic record linkage based on the Fellegi-Sunter model, which is used by the EKR-NRW since 2005. The accuracy of the this process has been evaluated by an independent research group. The observed synonym error rate was 0.18% and homonym error rate was 0.015% [5]. Periodically, the DCC creates a data export to the EC. Before sending the data, the RANs are replaced by a random case number and the blinded anonymization process is executed. In the course of this process, the DCC selects a generalization - i.e. the necessary aggregation level for each potential quasi-identifier attribute in the EDAT - such that there are always at least k records that are identical with regard to their quasi-identifiers. As there are always single records that will never comply with the predefined degree of anonymity, a certain number of records has to be suppressed in order to achieve a predefined k-value. The suppression rate is an important measure for the suitability of the resulting dataset for evaluation purposes. Finally only the random case numbers and the selected aggregation levels of the quasi-identifiers are transferred to the EC (Figure 3). The EC decrypts the exported EDAT to receive an anonymous plain-text dataset which is stored permanently and can be used by different research groups.

427

S. Bartholomäus et al. / Blinded Anonymization

(EDATAGG)EC Level1

Dateof Birth

ZIP

Dateof Diagnosis

gtz54D230oi34

f3409gkn439

i2nf23ng43u89 h44u7jsfglkeds

Level2

54g4w5h5676u

43t89u43gn4

Level3

32j3fk65ds

r302q94ufk5i

Dateof Birth

ZIP

54g4w5h5676u

43t89u43gn4

(EDATAGG)EC Level2

Dateof Diagnosis i2nf23ng43u89

Level1

Selectminimallevels of aggregation that satisfy apredefined kvalue

Selecteddata,that sent to the EC.

Level3

Figure 3. Choosing the aggregation levels that satisfy predefined k (and l) values.

2. Results Our concept has been approved by German data protection authorities. We have implemented a “proof-of-concept” prototype containing all key processes and tested it against a simulated data set with more than 1.5 million records. The simulated records contained realistically distributed IDAT for women in the age of 50 to 69 living in North Rhine-Westphalia (NRW) and some potentially ‘quasi-identifying’ EDAT attributes, like date of diagnosis. We generated different generalizations and analyzed the suppression rates resulting from different k-values. Figure 4 depicts the suppression rates for four different generalizations. Setting k = 5, an aggregation level of a fourdigit postal code and a date of birth as MMYYYY (A) proved barely practical for an evaluation dataset because the suppression rate was above 2.5%. A one step higher aggregation level for one of the attributes resulted in a significantly lower suppression rate: e.g., using date of birth by quarters of a year QYYYY (C), the suppression rate was below 0.1% and values up to k = 11 would still allow a suppression rate of less than 1.0%.

Figure 4. Suppression rates for combinations of 3/4 digit zips and monthly/quarterly date of birth.

In the actual study we may have to add more attributes into the set of quasiidentifiers, which will lead to higher suppression rates. However, the intriguing advantage of our approach is that all aggregation levels will be available in the DCC. This way, the DCC can create all kinds of generalizations that are required (or desired) for the respective research questions and calculate the suppression rates against the background of empirical data This way we can suggest to data protection authorities sets of quasi-identifiers and required k-values such that the final generalizations yield suppression rates compatible with the evaluation purpose; i.e. our approach is able to balance information depth with data protection on demand.

428

S. Bartholomäus et al. / Blinded Anonymization

We are currently developing the actual software suite for the project and are testing it in the model region of NRW. Our reporting tool (SecuNym-RT) has already been deployed at one of our data holders and we received over 80.000 simulated records that were derived from real data.

3. Discussion An important precondition for the usability of our approach is the quality of the primary data. As the main data processing happens at the data holders all imported data have to be standardized and of high quality and validity. We facilitate this by providing plausibility checks in SecuNym-RT and an integrated editor for implausible reports. The main technical limitation of our approach is the number of primary data holders. Due to the identical secret keys required by all data holders, their number has to be strictly limited to prevent the keys from being compromised. We currently examine different encryption schemes and algorithms to deal with this limitation. Another limitation are the manual processes. Although we aim for a high level of automation, some processes still require manual intervention. Currently the largest workload is caused by the record linkage process which presently involves a manual post processing of around 5% of all reports. We are currently working on approaches to reduce the number of records that require manual processing without losing too much quality in the linkage process. Although k-anonymity is a comparatively weak measure of anonymity [6], we nevertheless employ it due to its property to be usable with deterministically encrypted data too. This also applies for the slightly stronger l-diversity [7], which we plan to add to our approach. So far aggregation levels have to be configured manually and acceptable suppression rates have to be found via trial and error. We currently examine algorithms, e.g. Incognito [8], that allow a highly automated search for optimal generalizations.

References [1]

[2]

[3] [4] [5] [6]

[7] [8]

European Commission, “Proposal for a Regulation of the European Parliament and of the Council on the protection of individuals with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation),” 2012. V. Krieg, H. Hans Werner, and L. Marting, “Record Linkage mit kryptographierten Identitätsdaten in einem bevölkerungsbezogenen Krebsregister – Entwicklung, Umsetzung und Fehlerraten,” Gesundheitswesen, no. 63, pp. 376–382, 2001. L. Sweeney, “k-anonymity: A model for protecting privacy,” Int. J. Uncertain. Fuzziness Knowl.Based Syst., vol. 10, no. 05, pp. 557–570, 2002. P. Oechslin, “Making a faster cryptanalytic time-memory trade-off,” in Advances in CryptologyCRYPTO 2003, Springer, 2003, pp. 617–630. I. Schmidtmann, G. Hammer, M. Sariyar, and A. Gerhold-Ay, “Evaluation des Krebsregisters NRW Schwerpunkt Record Linkage.” http://www.krebsregister-nrw.de/index.php?id=121 J. Domingo-Ferrer and V. Torra, “A Critique of k-Anonymity and Some of Its Enhancements,” in Availability, Reliability and Security, 2008. ARES 08. Third International Conference on, 2008, pp. 990 –993. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, “l-diversity: Privacy beyond kanonymity,” ACM Trans Knowl Discov Data, vol. 1, no. 1, Mar. 2007. K. LeFevre, D. J. DeWitt, and R. Ramakrishnan, “Incognito: efficient full-domain K-anonymity,” in Proceedings of the 2005 ACM SIGMOD international conference on Management of data, New York, NY, USA, 2005, pp. 49–60.

Suggest Documents