A fault-tolerant cryptographic protocol for patient ... - Semantic Scholar

A fault-tolerant cryptographic protocol for patient record requests Thomas Aden, Marco Eichelberg, Wilfried Thoben Kuratorium OFFIS e. V., Oldenburg, Germany Corresponding author: Thomas Aden, Escherweg 2, 26121 Oldenburg, Germany, [email protected]

The ARTEMIS research project aims at providing an interoperability framework for health care IT based on semantic Web services. One important issue is to identify patients across organisations’ boundaries to allow for an exchange of medical data in conformance with privacy regulations. The ‘Patient Identification Process Protocol’, which is based on a method used in the Cancer Registry of Lower Saxony, defines a process and a protocol to enable participants within the ARTEMIS network to query for medical information about individuals by means of so-called control numbers which represent pseudonymous information about individuals and which are used at Record Linkage Services to identify definitive matches. INTRODUCTION Digital communication of medical information across hospital or even country boundaries is rarely seen in Europe although most of this information is digitally acquired and stored. Reasons certainly include the multitude of coding and communication standards in this field as well as national data privacy regulations. The exchange of clinical information across organisation’s boundaries is often still paper based. In many cases, relevant information about previous treatments, current medication etc. is not accessible at all. The EU research project ARTEMIS [1] tries to improve the interoperability of clinical information systems between different organisations, based on semantic Web services and suitable domain ontologies. A healthcare organisation can join the ARTEMIS peer to peer (P2P) network and advertise electronic services, such as the provision of access to a patient’s electronic healthcare record (given suitable authorisation), access to different subsystems (e. g. patient admission or laboratory information systems). Within the ARTEMIS network further services might be invoked dynamically, for example to translate and map between different representations of healthcare information. In ARTEMIS all participating healthcare organisations (peers) are loosely coupled via the ARTEMIS P2P network. Groups of participating organisations are coupled via so-called Super Peers which are connected among each other. One crucial aspect in ARTEMIS is to find and retrieve clinical information about a particular patient from different healthcare organisations where concrete sources are unknown. To complicate matters, in most countries there are no unique person identifiers that would be valid for the whole lifetime of an individual and used by all parties in healthcare and for all episodes of care. On the contrary, in many cases several identifiers for a patient do exist even within a single organisa-

tion. Consequently a protocol is needed that allows for the identification of patients by means of non-unique patient-related attributes. Different solutions are conceivable and one might follow the Integrating the Healthcare Enterprise (IHE) Integration Profile named ‘Patient Identifier Crossreferencing’ (PIX) [2]. The PIX profile is intended to be used at healthcare enterprises of a broad range of sizes. The PIX profile proposes a global repository (“cross-reference manager”) that holds plain-text information about patients provided from connected systems in different patient ID domains. Systems can report and request patient identifiers via Health Level Seven (HL7) messages. Whereas this model is suitable for scenarios wherein all parties know and trust each other, it is not applicable within the ARTEMIS context for data privacy and data security reasons. In this paper we propose an approach for the purpose of identifying patients across organisational as well as country borders under consideration of data privacy issues. 1. DATA PRIVACY REGULATIONS Due to the progress of the Information Society during the last couple of years the ability to easily gather and monitor personal data has improved dramatically. Several directives, recommendations, laws, and standards concerning these topics have been published at European level in recent years. These documents have in common that they are related to the protection of personal data against processing and that they formulate conditions and rules under which the processing is allowed and how processing may be carried out. Data privacy, namely the right to self-determine the disclosure of personal information in addition to the general principles of processing of personal data are ruled by the EU Directive 95/46/EC [3]. It addresses the issue of excessive processing of personal data and corresponding violations of data privacy. The EU Directive 95/46/EC embodies and regards the fundamental right of privacy guaranteed by article 8 of the European Convention for the Protection of Human Rights and Fundamental Freedom [4]. The objective of this directive is to secure for every individual respect to his right to privacy with regard to automatic processing of his personal data. The supplementary Directive 2002/58/EC [5] concerns the processing of personal data in the electronic communication sector. In particular the protection of medical data that is collected and processed automatically is covered by the Recommendation R(97)5 of the Council of Europe [6]. ISO 22857 [7] describes guidelines for the trans-border transfer of personal health information with respect to the direc-

tives and recommendations addressed above as well as to several other documents related to this context. The main conclusion from the regulations described above is that neither person identifying data nor medical data are allowed to be transmitted and delivered between different organisations for the purposes we intend unless either an individual agrees to share his/her personal and medical data for specific purposes or the vital interests of the individual are touched. Therefore, a patient has to be asked to authorise the transmission of his/her personal and medical data. The European Committee for Standardization has published a European pre-standard ENV 13606-3:2000 [8], which seems to be applicable to define distribution rules for automatic processing. These distribution rules are considered to be a controlling mechanism, enabling access to and further distribution of the components to which they are attributed. In addition to distribution rules based on patient preference there is also a necessity for the entity requesting health information to prove the entitlement to query and retrieve healthcare information for a specific patient under consideration of data security and privacy regulations. If a query is present but either the patient has not authorised delivery of his/her healthcare information or the requesting entity is unable to show suitable access permissions, the holder of the data must deny any answer so that no information is revealed. Even the fact that some entity holds healthcare information about a specific patient but is not allowed to share it must be considered confidential information, e. g. if the holder of the desired data is a hospital for mental health. 2. PATIENT IDENTIFICATION PROCESS Concerning the data that is needed to unambiguously identify an individual we have to consider that usually only the information that is already stored in a HIS can be used for queries. Additional information might be retrieved at the requestors’ side, e. g. from official documents or interviewing the patient, but the receiver of a query will not be able to gather further information about the desired patient. Non-constant Countrydetails specific details name at birth title unique person ID date of birth surname bank account no. place of birth given name(s) insurance no. sex nationality … address Table 1: Classification of patient attributes

Constant details

Table 1 sketches a classification of attributes that might be available for the query. We assume that at least the attributes listed in the first column are present in every HIS. These attributes are considered to be constant during the lifetime of an individual (except for the patient’s sex, which might change in rare cases), whereas

attributes in the second column might be changed. In some countries there will be specific identifiers that can be used to unambiguously identify a patient, but are only available on national level. Column three lists some of these country-specific attributes that are conceivable. To improve the selectivity of the identification process, we propose a few additional attributes representing formerly valid values of the attributes concerned; e. g. ‘former name’ would be such an attribute. In order to limit the number of attributes this history of attributes should not proceed more than one level. The attributes former name, former address (country, postal code, city, street, house number) as well as country-specific attributes that might change over time could be appropriate. We assume that for trans-border identification processes attribute names and values are present in a latin character set (strings) and Arabic digits (numbers). The process itself could certainly be extended to a concept like Unicode, but we have not yet examined the implications in detail. Due to the data privacy regulations described above, person identifying data must be anonymised or at least made pseudonymous, but even so it must be possible to identify an individual across organisation’s boundaries. Further reasons for the use of pseudonymous data records of patients instead of plaintext records are: • To ensure that a third party cannot listen to, record, and interpret patient identifying data during a lookup and identification process. • To ensure that a third party cannot easily start a lookup process to gather clinical data about arbitrary individuals. In order to fulfil these requirements, we adapted a system that is used in the Cancer Registry in Lower Saxony, Germany. This system uses a concept with socalled control numbers [9]. Control numbers represent a deterministic encryption of characters of personal data records. The process of generating control numbers is split into two phases, a standardising and a ciphering phase. Within the standardising phase all missing attribute values are expressed with well defined unknown or missing codes first. In a second step plaintext attributes are converted into a uniform presentation and split into parts if appropriate, for example the name ‘Günther-Eheim zum Besten‘ would be converted and separated into three parts ‘GUENTHER’ + ‘EHEIM’ + ‘ZUM BESTEN’. The number of parts in which certain string attributes are to be decomposed is fixed, all subsequent parts are concatenated to one trailing part. The decomposition of string attributes is done to allow for comparisons even though parts have been omitted or interchanged during admittance, e. g. to a hospital. Further standardising steps are conceivable, such as calculating the total sum of the digit values of each string attribute to detect character permutations. Within a concluding standardising step phonetic codes are generated from all string attributes. Phonetic codes allow for compensation of some writing and hearing mistakes.

is recognised and a match on this attribute array is assumed, the weight must be lower than the weight for a match on an attribute array in case of a correct order of the parts, because a different order may be the only difference between two people regarding person identifying attributes. HANS

PETER

HEINRICH

HEINRICH

HANS

PETER

Figure 1: Array comparison Furthermore, attribute values that are present in a query record and that represent former values of attributes, such as former name, should be compared to current values of a candidate record, such as surname. In this way some older records might be found, for example a record of a patient that has got married and has changed his/her name after the record had been stored. Analogous to the array comparison mentioned above, matches regarding the comparison of former values to current values should lead to lower weights. In the end precalculated threshold values are used to (semi-) automatically determine whether or not two records are likely to describe the same person. 3. PID PROTOCOL First of all, we assume that participants of the ARTEMIS network understand and agree that not all medical health records of a particular patient might be found, due to for instance the anonymisation of person identifying data, the nature of P2P communications, or deficiencies in the stored records that prevent a correct identification.

1.

Trusted Third Party (TTP)

2.

5.

9. 10. Repository

3.

Requestor

8.

During the subsequent ciphering phase, the standardised attributes are enciphered. A message authentication code (MAC) algorithm is used first to create hash-keys from most of the plaintext attributes. A few attributes, e. g. nationality, sex, and date of birth, are kept as plaintext and used for blocking during the record linkage (see below). In a second step the hash-keys are encrypted with a symmetrical encryption algorithm. These standardised, enciphered, and encrypted attributes are called control numbers. It should be noted that chosen plaintext attacks (in particular, dictionary-based attacks) against the control numbers are still possible for an entity that holds the encryption key. To identify an individual on the basis of control numbers a method that is called record linkage is used. Record linkage might be characterised as “The methodology of bringing together corresponding records from two or more files or finding duplicates within files.” [10]. We propose the use of a probabilistic rather than a deterministic record linkage, cf. [11]. Within the context of our patient identification (PID) process we expect from the record linkage to be: • an enabling technology to find all records of a patient that are available. • precise in finding exact and definitive matches for desired individuals, i. e. the rate of homonym and synonym errors must be very low. • efficient regarding computational complexity. • efficient regarding network bandwidth and latency. Prima facie, these demands seem to be contradictory, but we assume the second aim to be more important than the first one, because in ARTEMIS the retrieval of information from other organisations should be almost done transparently and automatically by the underlying systems. A probabilistic record linkage system [12] adapted for the needs of the patient identification computes weights for the similarity of two records that are represented by control numbers. Within the linkage process so-called matching variables are used. A set of matching variables is a subset of the control numbers a query record consists of. During the record linkage process two predefined probabilities for each matching variable are used. The first one expresses the probability that two equal control numbers represent the same entity while the second one expresses the opposite. The comparison functions that are applied to our process do not need to be complex, because we only have to compare strings that don’t have to be checked for similarity because they are constructed using a hash function and an encryption algorithm subsequently and thus similarity is accidental and cannot be used to calculate any similarity based weights. However, we need to consider arrays comprised of a subset of the set of matching attributes. This can be done especially with name attributes to overcome or to limit the consequences of the problem of a changed order of name components. In Figure 1 a comparison on arrays is sketched and a changed order of given name parts can be recognised by the record linkage algorithm. Indeed, if a changed order of name parts

6. Record Linkage Service (RLS)

4. 7.

Figure 2: Patient Identification Process Protocol Since the ARTEMIS network is a P2P network, there cannot be a global repository that stores control numbers for every patient among the participating hospitals and, therefore, the IHE ‘Patient Identifier Crossreferencing’ integration profile is not directly applicable. In the following we will sketch our slightly different PID protocol (see Figure 2). In order to locate medical records for a specific patient, a requesting entity has to generate control numbers using a random session key for encryption. The session key is temporarily stored at a Trusted Third Party (TTP)

(arrow 1) along with additional descriptive information about the requestor. At the TTP a unique query identifier is generated which is sent back to the requestor (arrow 2). In a subsequent step (3), a query is sent to a Record Linkage Service (RLS). It consists of control numbers, the plaintext attributes for blocking, the unique query identifier and additional information, such as constraints on types of hospitals to be queried or regional constraints. A RLS might be located at each Super Peer or next to a Super Peer and will never be able to decrypt the control numbers, because the RLS must not be able to get the appropriate session key from the TTP. In a third step the RLS sends the plaintext attributes that represent so-called blocking variables to all repositories that are known to the RLS (4), along with administrative data such as the unique query identifier and information about the TTP. The blocking variables are used to limit the amount of possible match candidates: a repository pre-selects records that exactly match these values. The selectivity of blocking variables should be on a level that the number of possible matches is reduced but the identity of an individual cannot be detected. Each repository now contacts the TTP and retrieves the session key bound to the query identifier, along with information about the requestor that allows to decide whether communication with this requestor is permissible at all (5). Now each repository is able to (pre-)select candidate records by means of blocking variables. The candidates are anonymised by generating control numbers using the session key. The control numbers are then transmitted to the RLS, along with an additional unique temporary ID for each set of control numbers (6). The RLS is now able to carry out the Record Linkage and identify possible match candidates. The RLS sends information about the requestor, the query itself and the candidate IDs, i. e. the temporary patient IDs of possibly matching records, back to the appropriate repository (7). The requestor is informed about the completion of the record linkage operation, too (8). Both, repository and requestor can now start to communicate, either to request or to deliver the desired information (9, 10). 4. CONCLUSIONS In this paper we presented an approach to identify patients across organisational and country borders under consideration of data privacy and security. Due to the anonymisation and the nature of P2P communications, our approach will not be able to find all patient records available. However, we believe that it can certainly improve the feasibility of locating and retrieving medical records across organisational and country borders even in the presence of incomplete and outdated patient identifying information. Further investigations will be focused on the development of an appropriate record linkage algorithm and configurations for this process.

References 1 ARTEMIS (IST-1-002103-STP), http://www.srdc. metu.edu.tr/webpage/projects/artemis/ (July 5, 2004) 2 HIMSS and RSNA, Integrating the Healthcare Enterprise (IHE) – IT Infrastructure Technical Framework - Volume 1, ITI TF-1 Integration Profiles, Rev. 1.0, http://www.rsna.org/IHE/tf/ihe_tf_index.shtml (July 5, 2004) 3 Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data, OJ L, 23 Nov. 1995, http://europa.eu.int/com-m/internal_market/privacy/law_en.htm (July 5, 2004) 4 Council of Europe, European Convention for the Protection of Human Rights and Fundamental Freedoms, 4 November 1950, http://conventions.coe.int/ Treaty/EN/v3MenuTraites.asp (July 5, 2004) 5 Directive 2002/58/EC of the European Parliament and of the Council of 12 July 2002, concerning the processing of personal data and the protection of privacy in the electronic communications sector, OJ L 201/37, 31 July 2002, http://europa.eu.int/information_society/topics/ecomm/useful_information/library/legislation/index_en.htm (July 5, 2004) 6 Council Of Europe – Committee of Ministers, Recommendation No. R(97)5 of The Committee Of Ministers to Member States on the Protection Of Medical Data, Council of Europe Publishing, Strasbourg, 12 February 1997 7 ISO/TC 215 - International Organization for Standardization, ISO/DIS 22857 (Draft International Standard): Health Informatics - Guidelines on data protection to facilitate trans-border flows of personal health information, 2003 8 ENV13606-3:2000 “Health Informatics – Electronic healthcare record communication – Part 3: Distribution Rules, http://www.centc251.org/ (July 4, 2004) 9 W. Thoben, H.-J. Appelrath and S. Sauer, Record linkage of anonymous data by control numbers, In: From Data to Knowledge: Theoretical and Practical Aspects of Classification, Data Analysis and Knowledge Organisation, W. Gaul and D. Pfeifer (eds.), pp. 412-419, Springer-Verlag, 1994. 10 W. E. Winkler. The State of Record Linkage and Current Research Problems. Technical Note, U. S. Bureau of the Census, 1999. 11 T. Blakely and C. Salmond, Probabilistic record linkage and a method to calculate the positive predictive value. International Journal of Epidemiology, 31:1246-1252, 2002 12 M. A. Jaro. Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association, 84(406):414-420, 1989