Mar 27, 2003 - while preserving anonymity: an application .... an application to the monitoring of medical information ... digital signature methods to enable.
Methodology for chaining sensitive data while preserving anonymity: an application to the monitoring of medical information ! Catherine Quantin*, Béatrice Gouyon*, François-André Allaert**, Olivier Cohen*** Record linkage allows the compiling of same-person records from various source files, and can thus improve the feasibility of epidemiological research such as population-based studies. Compliance with European legislation on data privacy and data security is, however, essential. Our article describes one way to achieve this: a computerised record hash coding and linkage procedure to chain medical information for the purpose of epidemiological monitoring, developed by the DIM1 of the CHU2 at Dijon. Before their extraction, files are rendered anonymous, using a one-way hash coding based on the standard hash algorithm (SHA) function. Once the patient information is anonymised using ANONYMAT software, it can be linked by means of a mixture model that takes several identification variables into account. Applications of this anonymous record linkage procedure were carried out at the national and regional levels. The applications illustrate how the use of the ANONYMAT program makes it possible to respect data confidentiality legislation without impeding data availability.
side from the uses imposed by the Social Security and State services (computerised treatment forms, the Medicalisation of Information Systems Programme (“Programme de Médicalisation des Systèmes d’Information”: PMSI)), it is possible to foresee uses and the circulation of information for doctors, for example, as part of the healthcare networks. Nevertheless, the compiling of same-patient medical information by linking the various existing files must comply with French and European legislation protecting individual freedoms with regard to the automated processing of personal data. We will show that respecting the legislation leads to the following paradox: it is possible to link the various parts of the same patient’s file without access to the patient’s identity. We will see how the encryption techniques, such as the anonymity and chaining procedure developed by the Medical Information Department (“Département d’information médicale”: DIM) of the University Medical Centre (“Centre hospitalouniversitaire”: CHU) at Dijon provides a solution to this paradox.
A
Legislation on the security of nominative information The European directive of October 24, 1995, on the protection of individuals with regard to the processing of personal data and the free circulation of this data, replaces the concept of nominative information of the law of January 6, 1978 on information technology, data files and civil liberties with that of “personal data”, i.e. “any information concerning an identified or identifiable individual”. “Individuals are considered to be identifiable when they can be identified directly or indirectly, especially by reference to an identification number or to one or more elements specific to their physical, physiological, psychic, economic, cultural or social identity.” This extremely comprehensive definition makes it possible for any database to be considered indirectly nominative [Quantin et al., 1999]. This European directive has just been transcribed into French law, leading to the law of August 6, 2004 on the protection of individuals with regard to the processing of personal data and modifying the law no. 7817 of January 6, 1978. All these considerations result in the idea
Courrier des statistiques, English series no.12, 2006
that a large amount of information involves nominative or personal data, even if the name is not shown and no table of correspondence between the undisguised identity and the substitution alphanumeric codes exists. At the statistical level, the risk of identifying an individual from apparently anonymous information is far from negligible, because of the possibility of cross-matching with a wide variety of existing or future files. 12 Who would have imagined, not so long ago, that the Social Security identifier could be processed automatically by the tax authorities [law no. 98-1266, 1998]? Nevertheless, this concern, that the data subjects might be identified during the evaluation or
* The Medical Informatics and BioStatistics Office (“Service de Biostatistique et Informatique Médicale”), (Prof. QUANTIN), Dijon CHU, BP 1542, 21034 Dijon cedex. ** Chairman TC/251/WGIII the European Standardisation Centre (“Centre Européen de Normalisation”), CEN BIOTECH, BP 53077, 21030 Dijon cedex. *** TIMC – IMAG Laboratory UMR 5525, CNRS, Joseph Fourier University, Grenoble. 1 Medical Information Department (“Département d’information médicale”). 2 University Medical Centre (“Centre hospitalouniversitaire”).
31
Catherine Quantin, Béatrice Gouyon, François-André Allaert, Olivier Cohen
Administration-management at Henri Mondor Hospital
analysis of practices and activities of treatment and preventive care, is fully covered in article 41 of law no. 99641 of July 27, 1999 concerning the creation of a comprehensive healthcare coverage, modifying article 4010 of the law of January 6, 1978.
Anonymity: the boom in encryption methods Rather than use statistical methods to provide anonymity based on data perturbation [Sweeney, 1998; Willenborg et al., 1995; Quantin et al., 2000a] to make it more difficult to identify people and, as a result, introduce a loss of information quality, it seems preferable to use encryption techniques to ensure the security of the information. These techniques, which make it possible to protect information by means of a secret code, are generally the result of mathematical problems that are very difficult to solve without this code. These methods have existed for almost as long as the statistical methods for providing anonymity, but until recently their use has been restricted by law for national defence reasons. Authorisations to use these methods are not, therefore, easy to obtain from the Central Service for Information System Security (“Service central de la sécurité des systèmes d’informations”: SCSSI), which is under the direct control
32
of the Prime Minister. The field of cryptography has benefited from a liberalisation fairly recently, first of all in 1998 [decree no. 98-101, 1998; decree no. 98-206, 1998; decree no. 98-207, 1998] in the form of a simplification of the SCSSI declaration procedure. This softening towards the user, a priori not a specialist in this field, was continued in 1999 [decree no. 99199, 1999], by making cryptology professionals bear the weight of the legislation. In particular, use of 128-bit long high-security keys has been made possible (until
then, the limit was 40 bits), this change having become essential to satisfy electronic signature recognition requirements and to facilitate commercial transactions within the context of the Internet. This liberalisation has removed the obstacle to using encryption techniques to ensure the confidentiality of medical information that is directly or indirectly nominative and intended to be circulated over computer networks. In fact, if the French National Commission on Information Technology and Civil Liberties (“Commission nationale de l’informatique et des libertés”: CNIL) accepts keys of only 40 bits for encrypting indirectly nominative information, it requires keys of at least 56 bits long for directly nominative information. If you are concerned about the security of medical information circulating over a network, encryption methods can be used at three levels (figure 1). The first concerns respecting the confidentiality of the information during its transmission. According to the definition given by the European Standardisation Centre (“Centre Européen de Normalisation”) 4 , confidentiality [Fisher and Madge, 1996], is ensured when only duly authorised users have access to the information.
Radiology office at Corentin Celton Hospital
Methodology for chaining sensitive data while preserving anonymity: an application to the monitoring of medical information
Encoding or encryption, the basis of confidentiality Encoding a message means applying a transformation function to it that makes it unreadable for anybody else. This function is applied by an encryption algorithm [Douglas, 1996; Beckett, 1990; Brassard, 1993]. A key is used, so as to individualise the encryption (figure 1). If we take the example of an exchange of information between a health-care institution and a medical practice, the hospital doctor can be sure that only the general practitioner for whom the message is meant will be able to access it because, as the legitimate recipient, he/she will be the only one who knows the decryption key. Assuming that the algorithm is public and confidentiality is only ensured by the user’s key, this latter must therefore be difficult for even an experienced cryptanalyst to discover. A good encryption algorithm will be an NP-complete algorithm, i.e. where the reverse calculation (corresponding to the decryption of the message) is only possible by the exhaustive enumeration of key values. An encryption algorithm is said to be a symmetric or a secret key algorithm when a single key is used both for encryption and decryption. For example, this is true for the Data Encryption Standard (DES) algorithm
adopted as the official standard by the American government in 1977. The problem with using this type of algorithm is that the sender and the recipient share the encryption key. In contrast, asymmetric algorithms (or public key algorithms), which have been developed since 1976, are based on the use of two keys: the first is public, and everyone can use it to send an encrypted message to a given recipient; the second is private and is only known to the recipient, and it alone can enable the message to be decoded. This procedure gets rid of the problem of sending a key. In effect, only the legitimate recipient, the holder of the private key, has the means of decrypting the message. The best-known public key algorithm is the RSA algorithm [Rivest et al., 1978; Zimmermann, 1986], where the security is based on the hypothesis that the factorisation of a large number of prime numbers is a long and difficult process.
Digital signature and integrity check The second level concerns the use of digital signature methods to enable the recipient doctor to authenticate the doctor sending the message. In the example we have just given, this means that the general practitioner could make sure that the message has really been sent by the hospital doctor indicated. The digital signature has been recognised as having legal value by French law no. 2000-230 of March 13, 2000 on adapting
Cryptanalyst Unencrypted text
Encrypted text
Encoding Encrypted text
Unencrypted text Encrypted text
Figure 1: Encryption, decryption and cryptanalysis
Courrier des statistiques, English series no.12, 2006
the law of proof to information technologies and with regard to the digital signature. This mechanism comprises two procedures: the signing of a data unit and the verification of the said signature. The signing of a message is based on a key characteristic of the sending body. It is essential that the sole signatory is the only person able to produce the signature and also that it is impossible for the verification to reproduce the signature. Generally, public key algorithms such as RSA are used. Use of the digital signature will also make it possible to guarantee the message’s integrity, i.e. to be sure that the message has not been altered while it was being sent. In figure 2, you can see that the sender has created a fixedsize imprint of the message, which itself is a variable size, by a hashing technique [Marsault, 1995].3 Using hashing functions is fairly new in the world of modern cryptology. They have been developed above all so as to enable secure digital signature techniques to be developed. Hashing functions are said to be one-way if the calculation of their inverse is considered to be unfeasible with current technology in a “reasonable” length of time. The hashing function transforms an unencrypted text of a given length into a fixed-length hashing value4, often called the imprint. Among the many hashing functions proposed by cryptologists, the function that is considered to be the most secure is the Secure Hash Algorithm (SHA) recognised as the American standard by the National Institute for Standard and Technology (NIST). This hashing function is integrated into the DSA (Digital Signature Algorithm), which
3 The European Community’s “Quality and Security” Working Group III. 4 This may seem surprising: how can an initial text, whatever its length, once transformed, be a fixed length? This is due to the fact that the product of the hashing is a “compressed” text, and that this compression is such that it produces a result of a size that is independent of the size of the original text.
33
Catherine Quantin, Béatrice Gouyon, François-André Allaert, Olivier Cohen Public Private
Digital signature
X2 Secret relationship
CONFIDENTIALITY + AUTHENTICATION + INTEGRITY Recipient Sender
Message
Bob
Alice B
B
Hashing H
Encoded message
Encoded message
Message
Same?
Public channel
A
A Hashed message
Hashed message
Encoded imprint
Encoded imprint
Hashed message
Imprint Figure 2
NIST proposed in 1991. Initially, the message to be hashed is completed by a chain in order to make its size a multiple of 512 bits. Each 512-bit block is then divided into 16 32-bit sub-blocks, themselves transformed into 80 32-bit words on which 80 operations are performed. The result of the SHA algorithm is an imprint, i.e. a fixed-size 160-bit message. The imprint is therefore specific to the message. In particular, a slight change to the message leads to a radically different imprint. The sender sends the unencrypted message and the encoded imprint at the same time. To make sure of the message’s origin and integrity, the recipient will first of all recalculate the imprint of the message with the same hashing algorithm used by the sender, then he/she will compare the imprint obtained to the imprint that he/she has already decoded. In this way, the recipient can make sure that the sender is really the signatory of the message received, because the latter is the only person who knows the secret key used to encode the imprint, and that the corresponding public key is the only one making the decoding possible.
34
Using hashing techniques to ensure the anonymity of personal information The third level of the use of encryption techniques concerns the collation of medical information in an organisation external to the treatment centre. Indeed the problem of chaining nominative medical information to implement multi-centre epidemiological studies arises more and more frequently, for example as part of cooperative studies between local health-care facilities (doctor’s surgeries) and hospital healthcare facilities. According to CNIL’s recommendations [Vuillet-Tavernier, 2000], it is therefore preferable to use encryption techniques that guarantee an irreversible transformation of the data. After having tried to improve the existing methods such as the method proposed by Thirion et al. [1988], in 1995 we proposed to CNIL to use one-way hashing methods to provide this anonymity. In effect, unlike encryption methods that must be reversible in order that the legitimate recipient can decode the message, one-way hashing
techniques are irreversible. The result of the hashing operation is a code that is completely anonymous (not letting you get back to the patient’s identity) but always the same for a given individual in order that data for the same patient can be collated. In agreement with the SCSSI, we have chosen the SHA algorithm which, to our knowledge, is the public domain hashing algorithm that is the safest when faced with decoding attempts. The procedure was declared to CNIL and the SCSSI in March 1996. At that time, although the legislation concerning encryption functions was very strict, the use of hashing functions came under the regime of the simple declaration. In fact, insofar as these functions are irreversible, they cannot be used by secret organisations seeking to exchange information outside of government control. Nevertheless, even though it is irreversible, the hashing operation does not guarantee the information’s complete security. As the algorithm is public, the hashing could be applied to a large number of identities. A given individual’s codes from the hashed file could be matched against the codes obtained and thus his/her identity could be found. This is known as a dictionary attack. To guard against this type of attack, we have decided to not to use a single key, but instead to use a table of keys, so that the change introduced varies from one identity to another. In our study, the choice of key varied according to the identity to be hashed (depending on the characters contained in the identity and their position). In addition, we have proposed a double hashing operation. If, for example, you wanted to collate files coming from various sources, each sender of a file will use a first key table, called K1. This K1 “key”, used when identities are hashed for each information collection centre, allows the information to be protected with regard to people who do not know the code and who are therefore not part of the study. Nevertheless, as all the centres taking part in the study have to use the same key,
Methodology for chaining sensitive data while preserving anonymity: an application to the monitoring of medical information it is therefore essential to ensure the security of the centralised information, even with regard to the collection centres that have the K1 key. The information received by the processing centre performing the collation of the files is therefore hashed again, by the same hashing algorithm but with a second key table, K2. After the identity data has been hashed twice, carried out at the collection centre level and the processing centre level successively, the anonymity of the files is definitively preserved.
A procedure to provide both the anonymity and the chaining of medical information The anonymity and chaining procedure developed at the Dijon CHU DIM is a two-stage process. A first step involves the irreversible transformation (described above) of the identification variables [Quantin et al., 1998] (last name, first name, date of birth, sex, etc.) to obtain a completely anonymous code, which forms the reference for chaining. The second step is the collation of the files [Quantin et al., 2004; Quantin et al., 2005] to chain the information for the same person. The purpose of the chaining is to cross-match doublehashed files coming from different sources, to link the observations that refer to the same individual. Two types of errors [Brenner et al., 1997] may occur during the chaining process. The first corresponds to two observations concerning two separate individuals being chained and forms a “homonym” error: for example, if information concerning two individuals, called Dupond and Dupont respectively, is wrongly linked, because of an error entering their identities. The second type of error corresponds to two observations about the same individual not being chained, and forms a “synonym” error: for example, when one uses a woman’s maiden name and another uses the same woman’s married name. These errors may be due either to errors in the collecting of the identity data, or to the hashing
method itself. In particular, homonym errors may result from the existence of collisions during the hashing operation: i.e. the same code is obtained from the hashing of two different identities. In the case of the SHA algorithm selected for the hashing procedure, it turns out that the collision rate is very low (in the order of 10-48) and that the corresponding risk of homonym error, equal to this number, is therefore insignificant [Bouzelat, 1998, p. 97]. In order to reduce the impact of identity entry errors on the chaining, orthographic processing has been integrated into the anonymity procedure. The “AUTO-MATCH” chaining method proposed by Jaro [1995], widely used in the United States [Sugarman et al., 1996], has been adapted. It takes a number of identification variables into account at the same time: the last name, first name, maiden name, date of birth, sex and home post code. Of course, each of these variables does not give a one-to-one identification of an individual, and we are faced with the known problem of the informational value of a symbol. Each variable is therefore weighted depending on the amount of information it provides. For example, we assign a greater value to the information provided by the date of birth than to that provided by the sex (since the probability that two individuals will have the same date of birth is a lot smaller than the probability they will have the same sex). To determine whether two observations should be chained, a method of statistical analysis is applied that takes into account weighting coefficients for each variable used [Quantin et al., 2000b]. Let us look at the set of nA × nB pairs of records resulting from the systematic cross-matching of files A(nA) and B(nB) to be chained. We can define a separation into two sets M (for matched) and U (for unmatched) of the Cartesian product A × B. The set M contains all the pairs of records said to be concordant, i.e. where the two records correspond to the same individual. The set U contains all the pairs that remain, said to be non-concordant. Thus the
Courrier des statistiques, English series no.12, 2006
chaining procedure for the records consists of classing the various record pairs as belonging to M or U. If record pair j is concordant for identification variable i, i.e., for example, that the names on the two records of the pair are identical, then the weighting for this variable is given by formula (1): Wi,j = log(mi/uj)
(1)
where the parameters mi and ui respectively represent the probability that two records corresponding to the same individual match on this variable (probability known as “sensitivity”) and the probability that two records corresponding to two different individuals match on this variable (probability where the complement to 1 is known as “specificity”) for the variable i considered. The weights assigned to this variable will therefore be greater when mi is close to 1 and ui is close to 0. If, on the other hand, the pair j is non-concordant for variable i, i.e., for example, that the names for the two records of the pair are different, then the “concordance” distribution, a dichotomous qualitative variable (0 where there is concordance between the two records, 1 where there is no concordance), follows a binomial law of parameter m in set M and of parameter u in set U. The application of a model by combining these two distributions on the collected data thus makes it possible to estimate parameters m and u, necessary to calculate the weighting coefficients for each variable used. The decision to be taken in order to classify a pair of records depends on all the identification variables. Thus, an overall weight, known as the composite weight, equal to the sum of the weights corresponding to the different variables, is assigned to each pair of records. For each variable, this weight is positive where there is concordance between the two records and negative where there is no concordance, according to formula (2): wi,j = log((1 – mi)/(1 – ui)) (2)
35
Catherine Quantin, Béatrice Gouyon, François-André Allaert, Olivier Cohen After the distribution function of these weights is calculated, for set M and for set U, a pair of records is classed [Jaro, 1995]: • to be chained if its composite weight exceeds threshold no. 2 (the weighting value for which the distribution function conditionally to M is equal to 2.5%5); • not to be chained if its composite weight is below threshold no. 1 (the weighting value for which the distribution function conditionally to U is equal to 97.5%, complement to 1 of 2.5%); • in an undecided state if the composite weight lies between the two threshold values. This situation assumes manual checking of the data for nonconcordant chaining data (cf. application to the perinatal network). In practice, this checking can even be carried out on the data that has been made anonymous since each centre providing the information can rightfully preserve the correspondence between the anonymity number and the patient’s identity. The study’s co-ordinator may therefore ask for data corresponding to a particular anonymity number to be checked or corrected. The centre providing the information then retransmits all the corrected records, after a new anonymity procedure.
Applications in the field of medicine Creating regional or inter-regional databases In the light of the prospects that hashing techniques offer, many players in medical research have been busy creating databases on specific themes. Only those developed in partnership with the Dijon CHU DIM are cited here. They concern the
5 The value 2.5% selected here corresponds to the normal confidence interval, but another value is possible to give more or less precision.
36
Loire department (study of the active inter-file of people with cancer), the Bourgogne region (Perinatal Network), the Bourgogne and FrancheComté regions (hepatitis C network, monitoring suicide attempts, ESPOIR network for chronic kidney failure) or several regions simultaneously (the HC Forum platform). The main applications are detailed below. • Study of the active inter-file of people with cancer for three hospital structures within the framework of regional planning in the Rhône-Alpes region Following approval of the first General Regional Health Plan (“Schéma régional d’organisation sanitaire”: SROS) in 1994 [Abrial, 1998], the main hospital establishments in health-care district no. 6 of the Rhône-Alpes region, the Centre hospitalier régional et universitaire de Saint-Étienne (CHRUSE) and the Union départementale de la mutualité de la Loire (UDML) set up an inter-hospital union called the Institut de cancérologie de la Loire (ICL) to provide co-ordinated cancer treatment in this health-care district. In a letter dated 06/11/97 [Regional Hospitalisation Agency (“Agence régionale de l’hospitalisation”: ARH), 1997] the director of the ARH requested these establishments to “support the setting up of the Institut de cancérologie de la Loire” by research into “the drawing up of the cancerology active file” for each establishment and by studying the “active inter-file” between these establishments. It was therefore agreed to base this on the PMSI (the Medicalisation of Information Systems Programme or “Programme de Médicalisation des Systèmes d’Information” – see below), which, while it did not form a complete cancerology file, made it possible to determine how many people with cancer were treated by each establishment. But because of the anonymisation constraints for the PMSI data imposed by CNIL, there was the problem of enumerating the patients whose care was shared between these establishments. Since 1998, the Public Health and Medical
Information Department (“Service de santé publique et de l’information médicale) (Prof. Rodrigues) at CHRUSE has applied the Anonymat software to the last names, first names and dates of birth of each of the records for the three databases output by PMSI (as of the 1996 data) for the three establishments concerned, in order to make them anonymous and still be able to chain them so as to identify the patients in common between the different establishments [Quantin et al., 2000b]. In addition, the comparison between the number of Anonymat numbers obtained in each database and the number of patients calculated by the administration made it possible to estimate the duplication rate for each of the administrative databases. • Developing a regional collection of perinatal indicators in the Bourgogne region A perinatal network has been progressively developed in the Bourgogne region since 1992 [Gouyon et al., 1999]. This network includes the 18 public and private establishments providing pregnancy and neo-natal care in the region. A continuous regional collection of 42 indicators was implemented in 1998 on a voluntary basis for all births covered by the establishments of the Bourgogne region (approximately 18,000 births annually). The information is extracted from the PMSI, in the form of Medical Unit Reports (“résumés d’unité médicale”: RUMs). Indicators that do not exist in the PMSI, such as gestational age or psychosocial risk factors, are covered by an additional extraction from a record linked to the RUM, forming an “expanded RUM”. The chaining of the “expanded RUMs” at two different levels is essential to the processing of the medical data. Firstly, it must be possible for the “expanded RUMs” for the same person, mother or new-born baby, to be linked when there are successive hospitalisations in several units, even when this involves different establishments. Secondly, the mother’s “expanded RUMs” must be linked to her children’s,
Methodology for chaining sensitive data while preserving anonymity: an application to the monitoring of medical information even if they are hospitalised in a different establishment, in order to evaluate the postnatal impact of the pregnancy pathologies and risk factors. However, in compliance with the legislation, the files are only sent to the Dijon CHU DIM for use after they have been made anonymous. The hashing procedure is applied with the dual key system (cf. p. 18) making it possible to ensure strict anonymity, i.e. to definitively break any link between the person and the data concerning them. The chaining of anonymous data has therefore been made possible by using the Anonymat software (authorised by CNIL in 1998) for six variables: the mother’s maiden name, first name and date of birth, the child’s first name and date of birth, and the mother’s home post code [Cornet et al., 2001]. These six items of information are entered in an identical way in the “expanded RUMs” of the mother and baby. In the case of multiple births, the first names of all the newborn babies have to be entered in the mother’s “expanded RUM”. These six nominative variables are used by the chaining program, after they have been made anonymous. To date, the 18 establishments carry out the collection of indicators routinely, representing all births in the Bourgogne region. Before transmission, the files
Medical data for a mother and her baby must be entered in the same way
are checked in each establishment by comparing them against the department records (maternity and paediatric departments). In addition, the completeness and quality of the data collected for chaining is checked systematically in each establishment and in a centralised way by the co-ordinating team (Dr. Gouyon) processing the data at the CHU DIM (mother-child chaining tests, identification of non-chained records, error correction). A motherchild chaining is obtained for 86.3% of newborn babies before validation, and for 99.9% of newborn babies after all the manual and computerised error correction procedures. • Monitoring genetic diseases: the HC Forum platform Professor Cohen has implemented an application that can be accessed via the Internet by authorised researchers and genetics doctors, intended to bring the medical information for the same patient together. In this application, the doctor suggests to a patient being monitored for a genetic disease that he/she takes part in this project, which will make it possible, by using a process of family chaining containing anonymised personal identifiers, to understand the evolution of his/her disease and that of his/her family better. For this, the patient is asked to provide his/ her last name, first name and date of birth as well as those of his/her father and mother, after making sure that they consent, in order to be able to automatically define a numeric identifier with a family element, created jointly with the Dijon CHU DIM for this application and made anonymous by the Anonymat software. This anonymisation takes place locally, before transmission, so that only data already made anonymous is sent to HC Forum, the central platform [Cohen et al., 2001]. With the patient’s agreement, this identifier will then be transmitted by a secure system to the HC Forum platform with the related medical data. It will then undergo a second anonymisation, which makes the database completely non-identifying.
Courrier des statistiques, English series no.12, 2006
The application of the chaining procedure makes it possible to reconstruct the genealogical tree from a “vertical” point of view, i.e. an individual’s ancestry/descendents. This chaining also makes it possible to construct the genealogical tree from a “horizontal” point of view, i.e. within a single generation. When a patient is added, the chaining makes it possible to detect the presence of identical individuals in different families, by browsing the individuals already present in the HC Forum central database. The patient may thus benefit from a medical monitoring file that doctors caring for this patient can access, whatever centre the patient attends, a file that will be regularly added to with individual and family information. The whole procedure is subject to very strict security constraints in order to guarantee the confidentiality of patient and parent data. Clearly, the data cannot be used for a purpose other than that for which it was collected. In accordance with the provisions of articles 27 and 40 of the law of January 6, 1978 on Information Technology, Files and Civil Liberties, the patients, and their parents as well, have a right to access, correct or delete their data that they may exercise through a genetics doctor who is a member of the HC Forum database. They may also end their participation at any time. They are informed of any change to the system’s access procedure. In view of the elements thus presented, the patient is asked to give his/her consent. The whole procedure has been validated by CNIL, who gave a favourable opinion during the deliberation of March 4, 2004 (opinion no. 04-006).
Implementing information systems at the national level A number of countries are interested in the application of hashing techniques, especially in the healthcare field, such as Luxembourg, or Switzerland, which has developed, in collaboration with this department, a system combining hashing and
37
Catherine Quantin, Béatrice Gouyon, François-André Allaert, Olivier Cohen encryption techniques for hospital reports [Borst et al., 2001]. Only the information systems implemented in France will be described here. • The Medicalisation of Information Systems Programme (“Programme de Médicalisation des Systèmes d’Information”: PMSI) CESSI/CNAMTS6 “has designed and provided an anonymisation function called FOIN (Nominative Information Blanking Function or “Fonction d’Occultation d’Information Nominatives”) [Trouessin and Allaërt, 1997] since 1996, for putting in place the PMSI in private establishments, on the recommendation of CNIL (who suggested using the algorithm developed by the Dijon CHU DIM), at the request of MES/DH7 and after assessment by SCSSI. This anonymisation function makes it possible to replace patients’ identities with anonymity numbers, or chaining keys, that are durable over time (as long as the secret of one-way function is unchanged) and distance (for each of the several thousand private clinics)” [Trouessin, report]. The identifier made anonymous is the Social Security number of the insured person (or the person covered) as well as the patient’s date of birth and sex. This system was extended to all public establishments subject to PMSI in 2001 [Circular no. 106 of February 22, 2001].8 • SNIIR-AM, the Medical Insurance Information System (“Système d’Information de l’Assurance Maladie”) The 1996 orders instituted the registration of beneficiaries of
6 The Information System Security Study Centre (“Centre d’études des sécurités du système d’information”: CESSI) of the National Medical Insurance Fund for Salaried Workers (“Caisse nationale de l’assurance maladie des travailleurs salariés”: CNAMTS). 7 The Hospitals Directorate (“Direction des hôpitaux”: DH) at the Ministry for Employment and Solidarity (“Ministère de l’emploi et de la solidarité”: MES), at that time the Ministry for Labour and Social Affairs (“Ministère du travail et des affaires sociales”). 8 The Research Centre for Studying and Monitoring Living Conditions (“Centre de recherche pour l’étude et l’observation des conditions de vie”).
38
medical insurance from birth or when they enter France. The law of 1999 instituting universal health-care coverage (“couverture maladie universelle”: CMU) gave this system the principal feature looked for in a demographic survey: completeness. These decisions were the preliminaries to setting up a register of beneficiaries, without duplication, the national interregime medical insurance register (“répertoire national inter-régimes de l’assurance maladie”: RNIAM) based on the national identification number (“numéro national d’identification”: NIR), which also identifies the local medical insurance office (“caisse primaire”) which currently manages the beneficiary’s file, in order to provide consistency between management and demography. After a long development process since the orders of April 1996, the medical insurance regimes have finally created the National Inter-regime Medical Insurance Information System (“système national d’information inter-régime de l’assurance maladie”: SNIIR-AM), which offers exceptional opportunities. It consists of a database holding personal data for patients, made anonymous, and bringing together various elements based on the RNIAM (making the operation of the inter-regimes provision possible): reimbursement data with the details of the coding for the treatment and medicines, the identifiers of the health-care professionals and health-care establishments involved in the care of the patients, information on the pathology treated for patients with long-term illnesses and occupational diseases. This data is chained with the data coming from the PMSI: a unique chaining key makes it possible to link the medicalised hospital data from the PMSI with the data from local health-care facilities, thus making it possible to establish the patient’s medicalised progress [Merlière, 2004]. CNIL has given authorisation to preserve this comprehensive personal data for a period of two years, plus the current year. This linkage of secure data over time and between institutions is achieved by means of the identifier encrypted in
an irreversible way according to the hashing technique described above (p. 34). An agreement request, under the terms of chapter V b of the law on information technology and civil liberties, is being investigated in order to allow samples to be generated from SNIIR-AM, over long periods. The SNIIR-AM panel will thus preserve, without time limit, the services received for a permanent sample of 600,000 beneficiaries. This sample represents a renewal of the permanent sample of social welfare beneficiaries (“échantillon permanent des assurés sociaux”: EPAS) set up in 1976 by the statistics department of CNAMTS in collaboration with the Medical studies division (“Division d’études médicales”) at CREDOC9. Unlike EPAS, the SNIIR-AM panel will enable it to be based on a real demographic unit, the beneficiary. • Provision for monitoring 26 reportable diseases According to the Institute of Public Health Surveillance (“Institut de veille sanitaire”: InVS), an “essential component of public health and epidemiology, the provision for monitoring reportable diseases is based on the transmission of personal data to the public health authority. It applies two procedures: reporting and notifying, and entails the close involvement of three groups of players: the declarers (biologists and doctors) who suspect and diagnose the reportable diseases; the public health medical inspectors and the Departmental Health and Social Affairs Directorates (“Directions départementales des affaires sanitaires et sociales”: Ddass) and their employees, who are responsible for monitoring these diseases at the departmental level; and the InVS epidemiologists” [InVS Letter no. 8, 2003]. The new provision implemented in 2003 “reconciles two important elements: increased effectiveness of the surveillance system of reportable diseases and a better respect for the rights of the individual, thanks to an anonymisation system that is unique in the world” [InVS Letter no. 8, 2003]. Dijon CHU DIM contributed
Methodology for chaining sensitive data while preserving anonymity: an application to the monitoring of medical information to the initial outline of this provision, which was based on the hashing algorithm (described above), drafted in collaboration with Doctor Denis Coulombier of InVS, and, at CNIL’s request, attended discussion meetings with CNIL and the Central Security and Information Services Division (“Direction Centrale de la Sécurité et des Systèmes d’Information”: DCSSI) which led CNIL to issue a favourable announcement, on October 3, 2001, for the implementation of a system anonymising at source elements identifying the person by the encryption technique known as “irreversible”. CNIL authorised the whole provision by deliberation no. 02-082 dated November 19, 2002. After a European call for tenders, the Bertin company carried out the development of the anonymisation tools using public hashing algorithms (cf. above). The procedure differs slightly depending on the pathology looked at. For HIV (HIV positive), AIDS10 and acute hepatitis B infections, the declaring doctor or biologist performs the anonymisation at source, before sending the notification record to Ddass. The anonymity code is generated in an irreversible way by the software from the first letter of the surname, the first name, date of birth and sex of the person. For other diseases, the practitioner
transmits a notification record to the public health medical inspector (Misp) indicating the first letter of the surname, the first name, sex and date of birth. This is sent in a confidential envelope marked “medical secret”. After checking the record, the Misp carries out the anonymisation using the software provided by InVS and only sends InVS the anonymous portion of the notification record. While entering records coming from all the departments in the national databases, InVS performs a second anonymisation (cf. above). “This creates an index from the first anonymity code and a secret key that only InVS holds. This second process definitively breaks any link between the person and the data relating to them” [InVS Letter no. 8, 2003].
Applications in other fields (social, education) Type of observation of people entering and leaving RMI in Paris, implemented by CREDOC The individuals receiving French income support (“revenu minimum d’insertion”: RMI) constitute a very varied group, and the length of time they qualify for this provision is variable. In order to find out more about this population and the flow
Pitié-Salpêtrière Hospital – Medical Consultation
Courrier des statistiques, English series no.12, 2006
factors, CREDOC, at the request of the Department of Social Action, Public Health Childhood 9 and (“Direction de l’action sociale, de l’enfance et de la santé”: DASES) of the Paris department and the Paris Ddass, has implemented a system for observing people entering and leaving RMI in Paris. It is based on an original methodology that matches data coming from nine administrative files, which include: the file from the Paris Family Allowance Office (“caisse d’allocations familiales”: CAF), the national control file from CNAF10, for exchanges between Paris and the other French departments, the history file from the ANPE11 for unemployment and employment details, the single hiring declarations collected by URSSAF12 for employment in the private sector, the management data from the National Centre for the Management of Agricultural Exploitations (“Centre national pour l’aménagement des structures des exploitations agricoles”), for training courses financed by the State or the region and the types of assisted contract, and information from the central coordination unit for integration contracts. The starting point for the observation is a list, provided by the Paris CAF, of 48,000 people qualifying for RMI in Paris (end of 2000-beginning of 2001) [Aldeghi and Simon, 2002; Aldeghi et Olm, 2004]. The observation searched for information about these people in the eight other administrative files mentioned above. “To authorise this linkage between sources, CNIL ensured that the only information circulating between partners was an encrypted identifier, created from the NIR or CAF registration number. The anonymisation was performed by means of the FOIN procedure
9 HIV: human immunodeficiency virus. AIDS: acquired immunodeficiency syndrome, caused by the virus. 10 National Family Allowance Office (“Caisse nationale d’allocations familiales”). 11 National Employment Agency (“Agence nationale pour l’emploi”). 12 Union for the Collection of Social Security Contributions and Family Allowance Payments (“Union pour le recouvrement des cotisations de sécurité sociale et d’allocations familiales”).
39
Catherine Quantin, Béatrice Gouyon, François-André Allaert, Olivier Cohen updated by CESSI-CNAMTS (cf. p. 38). It is impossible to use these encrypted identifiers to reconstitute the initial numbers” [Aldeghi and Simon, 2002; Aldeghi et Olm, 2004]. The linkage of the files was possible for nearly 38,000 people, who had a complete NIR. In 21% of cases, the NIR shown in the Paris CAF data was not complete, either because the person had never been personally registered with Social Security, or because he/she did not give this data to CAF. In these cases it was not possible to identify any periods of unemployment, training placements, private sector or assisted contract employment. By concentrating on people with a complete NIR, the observation underestimated the number of women, in particular those living with a partner and foreigners, who’s NIR is very often incomplete. These investigations have shown the close relationship of RMI beneficiaries with employment, not only on leaving RMI but also throughout their period of RMI coverage (and especially the fact that people leaving RMI were often employed under assisted
employment schemes, confirming the importance of these employment schemes in boosting departures from RMI). This report [Aldeghi and Simon, 2002; Aldeghi and Olm, 2004] also confirms the supposed negative link between integration contracts and leaving RMI, even greater for integration contracts that only have a social aspect. Finally, these investigations made it possible to study the flow of people entering and leaving RMI, and to state that 40% of the people entering had already been beneficiaries of RMI in Paris.
Student monitoring by the Ministry of National Education (“Ministère de l’éducation nationale”) The Anonymat software system, developed by Dijon CHU DIM, was made available to the Ministry of National Education for the purpose of encrypting the identifiers of individuals’ files in the statistical information system on students (SISE), following an agreement
between the Dijon CHU and the ministry, signed on October 2003. On the request of Alain Goy (in charge of the statistical service at that time), this involved making the national student identifier (“identifiant national étudiant”: INE) anonymous using the technique of hashing (described above), in order to enable the monitoring of students at the national level by the Ministry’s Evaluation and Forward Planning Department (“Direction de l’évaluation et de la prospective”) (director Mme Peretti at the time of writing this article), and in particular by the Centre for Statistical Information and Decision-Making Assistance (“Centre de l’informatique statistique et de l’aide à la décision”: CISAD, manager M. Dispagne), which is attached to it, whilst respecting the anonymity due to the students, according to the procedure authorised by CNIL on March 27, 2003 [agreement MENK0300893A, 2003]. A project extending this monitoring to secondary school students is in progress. These issues are developed by Alain Goy in his article in the current Dossier.
Acknowledgements These projects were made possible thanks to the efforts of Professor Liliane Dusserre, who persuaded the key officials at CNIL, SCSSI and the Council of the Medical Association (“Conseil de l’ordre des médecins”) of the usefulness of anonymisation techniques in the context of storing patients’ medical information.
40
Methodology for chaining sensitive data while preserving anonymity: an application to the monitoring of medical information
Bibliography Abrial V., 1998. Les contrats d’objectifs entre les établissements publics de santé et l’agence régionale de l’hospitalisation: analyse d’environnement du CHU de St-Étienne. Doctoral thesis in Medicine. Franche-Comté University. Agence Régionale de l’Hospitalisation de Rhône-Alpes, 1997. Mission d’enquête sur les dépenses médicales et pharmaceutiques: Lyons, November 6. Aldeghi I. and Simon M.-O., 2002. Observatoire des entrées et sorties du RMI à Paris, first-wave report, reporting department – CREDOC – December 2002 no. 226. Aldeghi I. and Olm C., 2004. Observatoire des entrées et sorties du RMI à Paris. In Pascal Ardilly (ed.), “Échantillonnage et méthodes d’enquêtes”, Dunod, Paris, pp. 342-348. Agreement MENK0300893A of April 23, 2003, Bulletin officiel of the EN no. 18 of May 1st. Beckett B., 1990. Introduction aux méthodes de cryptologie, Masson, Paris. Borst F., Allaert F.-A. and Quantin C., 2001. The Swiss solution for anonymously chaining patient files. Proc. MEDINFO 2001; IMIA: 1239-41. Bouzelat H., 1998. Anonymat et chaînage de fichiers médicaux en vue d’études épidémiologiques. Doctoral thesis for University specialist in Medical Informatics. Bourgogne University. Brassard G., 1993. Cryptologie contemporaine, Masson, Paris. Brenner H. Schmidtmann I.. and Stegmaier C., 1997. Effects of record linkage errors on registry-based follow-up studies. Statistics in Medicine, 16(23), 2633-43. Circular DHOS-PMSI-2001 no. 106 of February 22, 2001 on chaining stays in health-care establishments in the context of the medicalisation of information systems programme (PMSI). Cohen O., Mermet M.-A. and Demongeot J., 2001. HC Forum®: a web site based on an international human cytogenetic database. Nucleic Acids Research, 9, pp. 305-307. Cornet B., Gouyon J.-B., Binquet C., Sagot P., Ferdynus C., Métral P. and Quantin C., 2001. Évaluation régionale en périnatalité: mise en place d’un recueil continu d’indicateurs. Revue d’Épidémiologie et de Santé Publique, 49, pp. 583-593. Decree defining the conditions under which declarations are undertaken and authorisations are granted with relation to cryptology services and means, no. 98-101 of February 24, 1998. Decree fixing the list of cryptology services and means freed from any prior formality, no. 98-206 of March 23, 1998. Decree fixing the list of cryptology services and means for which the declaration substitutes for the authorisation, no. 98-207 of March 23, 1998. Decree no. 99-199 of March 17, 1999 defining the categaries of cryptology means and services for which the prior declaration procedure is substituted for the authorisation procedure. Douglas S., 1996. Cryptologie, théorie et pratique, International Thomson Publishing. Fisher F. and Madge B., 1996. Data security and patient confidentiality: the manager’s role. International Journal of Biomedical Computer, 43, pp. 115-119. Gouyon B., Métral P., Fromaget J., Sagot P., Gouyon J.-B., 1999. Réseau périnatal de Bourgogne. Technologie et Santé, 37, pp. 51-56.
Courrier des statistiques, English series no.12, 2006
41
Catherine Quantin, Béatrice Gouyon, François-André Allaert, Olivier Cohen
Jaro M.-A., 1995. Probabilistic-linkage of large public health data files. Statistics in Medicine, 14, pp. 491-8. Law no. 98-1266 of December 30, 1998 (article 107). Finance Law for the year 1999. Letter of the Institut de Veille Sanitaire, prevalence, no. 8, July 2003. Marsault X., 1995. Compression et cryptage des données multimédias, Hermès, Paris. Merlière Y., 2004, “le SNIIR-AM” communication to the Journées de Statistique May 25, 2004, Montpellier. Quantin C., Bouzelat H., Allaërt F.-A. et al., 1998. Automatic record hash coding and linkage for epidemiological follow-up data confidentiality. Methods of Information in Medicine, 37, pp. 271-277. Quantin C., Allaert F.-A., d’Athis P., Dusserre L., 1999. Can a database be anonymous? MIE 99, Slovenia, August 22-26, 1999, pp. 297-301. Quantin C., Allaert F.-A., Dusserre L., 2000a. Anonymous statistical methods versus cyrptographic methods in epidemiology. International Journal of Medical Informatics, 60, pp. 177-83. Quantin C., Allaert F.-A., Bouzelat H., Rodrigues J.-M., Trombert-Paviot B., Brunet-Lecomte P., Gremy F., Dusserre L., 2000b. La sécurité des réseaux d’informations médicales: application aux études épidémiologiques. Revue d’Épidémiologie et de Santé Publique, 48, pp. 89-99. Quantin C., Binquet C., Bourquard K., Pattisina R., Gouyon B., Ferdynus C., Gouyon J.-B. and Allaert F.-A., 2004. Which are the best identifiers for record linkage? Medical Informatics and the Internet Medicine, 29 (3-4), pp. 221-227. Quantin C., Binquet C., Allaert F.-A., Gouyon B., Pattisina R., Le Teuff G., Ferdynus C. and Gouyon J.-B., 2005. Decision analysis for the assessment of a record linkage procedure: application to a perinatal network. Methods of Information in Medicine, 44, pp. 72-79. Rivest R.L., Shamir A. and Adleman L., 1978. A method for obtaining digital signatures and public key cryptosystems, CACM, 2, 120. Sugarman J.-R., Holliday M., Ross A. et al., 1996. Improving American Indian cancer data in the Washington state cancer registry using linkages with the Indian health service and tribal records. American Cancer Society, 78 (7 suppl.), pp. 1564-8. Sweeney L., 1998. Three Computational Systems for Disclosing Medical Data in the Year 1999. MEDINFO 98, IMIA, B. Cesnik, A. McCray, J.-R. Scherrer (Eds). IOS Press, Amsterdam, pp. 1124-1129. Thirion X., Sambuc R., San Marco J.-L., 1988. Epidemiology and anonymity: a new method. Revue d’Épidémiologie et Santé Publique, 36, pp. 36-42. Trouessin G. and Allaërt F.-A., 1997. FOIN: a nominative information occultation function. MIE, 3, pp. 196-200. Trouessin G. Report “qualité diagnostique et thérapeutique en cancérologie: communication d’informations multimédia dans un réseau sécurisé multidisciplinaire. Sécurité de l’information médicale en télémédecine”, study by the “Ministère de la recherche” (Ministry for Research). Vuillet-Tavernier S., 2000. Réflexion autour de l’anonymat dans le traitement des données de santé. Médecine et Droit, 40, pp. 1-4. Willenborg L.C.R.J., de Wall A.G. and Keller W.J., 1995. Some Methodological Issues in Statistical Disclosure Control. Statistics Netherlands, Department of Statistical Methods. Second Cathy Marsh Memorial Seminar, November 7th, London. Zimmermann P., 1986. A proposed standard format for RSA cryptosystems, Boulder Software Engineering, Computer, 9, 21.
42