Error estimation in linking heterogeneous data sources

Health Informatics Journal http://jhi.sagepub.com

Error estimation in linking heterogeneous data sources H.K. Monga and Timothy B. Patrick HEALTH INFORMATICS J 2001; 7; 135 The online version of this article can be found at: http://jhi.sagepub.com/cgi/content/abstract/7/3-4/135

Published by: http://www.sagepublications.com

Additional services and information for Health Informatics Journal can be found at: Email Alerts: http://jhi.sagepub.com/cgi/alerts Subscriptions: http://jhi.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav

Downloaded from http://jhi.sagepub.com at PENNSYLVANIA STATE UNIV on April 15, 2008 © 2001 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.

04 Q-HIJ7.3_Monga

4/2/02

2:34 pm

Page 135

Error estimation in linking heterogeneous data sources H.K. Monga and T.B. Patrick

There are many instances in an institution when there is a need to gather data from different departments. Record linkage is the technique used to bring together these separately compiled records into a useful format. Unfortunately, even with perfect data, error may be introduced when records from different sources are linked. We assume then, that at least for large-scale day-today operations, no data collection or repository produced by linking records from different sources should be declared error free. Thus, for quality assurance purposes, it will be useful to have methods that may be applied in day-to-day operations for estimating error produced by linking records. In this paper we describe our attempts to estimate the false negative error rate in record linkage serving periodic quality reporting. Keywords Error estimation, record linkage, medical records

BACKGROUND Record linkage is the process of bringing together related records that have been compiled separately [1]. Many types of studies have been conducted that have used different methods and approaches to link medical records obtained from heterogeneous data sources. For example, the use of administrative data for research purposes has led to considerable interest in computerized methods for linking medical records [3]. Techniques for record linkage are also important because they help in expanding and correcting databases that may be used for studying various diseases such as AIDS or diabetes and a wide range of treatments and procedures [4]. These techniques have allowed researchers to study the long-term outcomes of exposure to hazards [5]. Epidemiologists have depended on record linkage to examine the effects of immunization on subsequent morbidity and mortality [3]. Notwithstanding the potential informational value of record linkage, it may be a source of two kinds of error: homonym errors and synonym errors. Homonym errors occur when identifiers are too similar in distinct patients. Synonym errors occur when there is a failure to link multiple notifications of the same patients. Even starting with perfect data, these two types of error may be introduced when records from different systems are linked. The error rate may be compounded, however, if there are errors in the data to begin with. The majority of these pre-existing errors are related to data entry problems in the original data sources [6].

INTRODUCTION

H.K. Monga MD Postdoctoral Fellow Tel: +00 (1) 573 884 0332 Fax: +00 (1) 573 447 3356 Email: mongah@ health.missouri.edu Timothy B Patrick PhD Assistant Professor Tel: +00 (1) 573 884 8155 Fax: +00 (1) 573 447 3356 Email: patrickt@ health.missouri.edu Department of Health Management and Informatics School of Medicine 306 Clark Hall University of MissouriColumbia Columbia, MO 65203, USA

Information technology has been introduced in the US Healthcare System in a decentralized manner. Often, departments in an institution have taken the responsibility for organizing their own data. They have installed their own software and database systems to fulfil their short-term needs without realizing the full impact of this practice on the entire facility. For example, for a given patient there is an admission record, a discharge record, an outpatient record, laboratory data and a financial record. All these different pieces of data are captured at different sites with different databases and operating systems and are recorded by different people [1]. This situation becomes troublesome when there is a need to exchange and integrate data between departments [2]. Our goal in this project is to investigate methods for estimating the error associated with linking records from such heterogeneous sources.

METHODS Our general research agenda is to develop methods to estimate error rates, both false positive and false negative, for methods of record linkage. Our specific objective in this study was to determine the false negative error rate associated with a particular method of record linkage in an ongoing calculation of breast cancer screening indicators. The method for which the false negative error rate was to be estimated was called the ‘reference method’. We based our estimate of error on samples of records from two systems: the IDX Registration and Scheduling System and Medical Archival System (MARS) used in the Department of Radiology. The IDX system collects data from routine registration of patients coming to the hospital for inpatient or outpatient visits. The MARS system collects information about patients for routine X-ray, MRI or mammo-

Health Informatics Journal (2001) 7, 135-37 Downloaded from http://jhi.sagepub.com at PENNSYLVANIA STATE UNIV on April 15, 2008 2001 SAGEBuilding, Publications. All rights reserved. for 7NX commercial useLexington or unauthorized distribution. © The Continuum Publishing Group Ltd 2002, © The Tower 11 York Road, LondonNot SE1 and 370 Avenue, New York, NY 10017, USA.

04 Q-HIJ7.3_Monga

136

4/2/02

2:34 pm

Page 136

Health Informatics Journal

grams. Our samples covered the period January 1996 to December 1999. There were basic differences between the formats of the two data sets. The IDX file was in a comma-delimited text format with patient’s last and middle name as one field and first name as a second field while MARS had patients’ first, middle and last names in one field separated by commas. In spite of the fact that they were two different sections of the same institution, there was no common ID column. Neither Medical Record Numbers nor Social Security Numbers could be used to match records. Patient name and date of birth were the only two common fields for matching. For further analysis, the records were transferred into MSSQL7.0. Transact SQL (T-SQL) procedures were written to place a patient’s first, last and middle names in separate columns and remove extra spaces and commas to format the files uniformly. Our two-step approach was 1) to compare the results obtained with our reference method of record linkage with the results obtained with other methods; and then 2) to use the other methods as a way of estimating the false negative error rate of the reference method. Our reference method linked records from IDX and MARS by matching patient last name, first name and date of birth. The other methods we examined matched records on (a) last and first names; (b) last name and date of birth; (c) last, first and middle names; (d) last name, middle name and date of birth; and (e) last name, first name, middle name and date of birth. Record linking can be ‘all-or-none’ or it can be ‘probabilistic’. The linkage is ‘all or none’ if decisions are made that a pair of records either do or do not relate to the same person or event. Linkage is probabilistic if it is based on computed calculation of the probability that the records relate to the same person or event [1].

Fig. 1 Number of records by the different methods © The Continuum Publishing Group Ltd 2002.

We used the ‘all-or-none’ approach for each method. The initial samples of records from IDX and MARS consisted of 42 201 and 36 434 records respectively. After removing male patients and duplicates from both data sets, there were 17 145 and 19 160 records in the IDX and MARS data sets respectively.

RESULTS Step1 Figure 1 shows a comparison of the results obtained by the reference method and methods (a)–(e). As shown in Figure 1, our reference method (LFDob or last name, first name and date of birth) matched 3548 records. Method (a) (LF or last and first names) and method (b) (LDob or last name and date of birth) produced a higher number of matched records (4521 and 3625 respectively). Methods (a) and (b) produced supersets of the results of our reference method. They contained all the records from the reference method set as well as some extra records. The extra records included those patients with same last and first names but with different dates of births and last names. Method (c) (LFM or last name, first name, middle name), method (d) (LMDob or last name, middle name and date of birth), and method (e) (LFMDob or last name, first name, middle name and date of birth) all missed about 50% of the matches obtained with our reference method. Surprisingly, however, all these methods using middle names produced subsets of the results we obtained with our reference method.

Step 2 The second step of our study was to use methods (a)–(e) to estimate the false negative error rate of our reference method (LFDob or last name, first name and date of birth). Since, in Step 1, methods (c)–(e) produced results that were subsets of the results produced by the reference method, only methods (a) and (b) were used for estimating the false negative error rate associated with the reference method. Using the reference method, there were 13 597 and 15 612 unmatched records from IDX and MARS respectively. These records were matched again with method (a) (LF or last and first names) and method (b) (LDob or last name and date of birth). The results are shown in Figure 2.


04 Q-HIJ7.3_Monga

4/2/02

2:34 pm

Page 137

Error estimation in linking heterogeneous data sources

Fig. 2 Rematching of the unmatched records obtained after matching with our reference method

Method (a) produced 497 matches. Upon manual inspection, we found 112 correct matches that had not been produced by the reference method. Most missed matches were due to misspellings. Some missed matches were due to Y2K problems in the data. Some dates of birth were recorded incorrectly, for example, ‘1-1-1913’ was incorrectly recorded as ‘1-1-2013’. Method (b) produced 53 matches. After checking them manually, we found an additional 45 correct matches that were missed by the reference method because of errors in spelling first names. Using methods (a) and (b), we determined that the reference method found only 3548 out of 3705 possible matches. Relative to the results of methods (a) and (b), the false negative error rate of the reference method was calculated to be approximately 4%.

137

mining the errors associated with the method or methods of record linkage used. Our specific example of a record linkage project was a project to calculate breast cancer screening indicators. We chose this as our demonstration example because it is one of our real ongoing, operational projects, and we felt this work could contribute to those operational concerns. We also chose it because it is at least of a scale that makes comprehensive manual review to determine the true linkage error rate somewhat problematic. Clearly, ongoing quality studies of a much larger scale are possible, and for those studies comprehensive manual review to determine error rates would be impossible. Whatever method of record linkage is used in such projects will have some associated positive rate of error. Since comprehensive manual review to determine errors and correct them will not be possible in such cases, some method of estimating error due to record linkage is necessary. However, any method for estimating the rate of error associated with a particular method of record linkage will itself be a method of record linkage. Thus, in general, estimating the rate of error associated with a particular method of record linkage must always be understood as an estimate relative to another method of record linkage, which itself will be characterized by some rate of error.

ACKNOWLEDGMENT This work is supported, in part, by grant LM07089 from the National Library of Medicine. Portions of this research were presented as a poster at AMIA 2000.

DISCUSSION This paper illustrates a record linkage project and an attempt to estimate the rate of errors associated with a particular method of linking. Record linkage involves comparing records from two sources and matching them on the basis of user-determined criteria. In the context of decentralized control of data and record standards, there is a real potential for linkage errors due to data disparity. In addition, the potential for linkage errors due to data disparity may be compounded by prior data entry error. Thus, we take a conservative point of view and assume that at least for large-scale day-to-day operations, no data collection or repository produced by linking records from different sources should be considered error free. Since erroneous record matches may lead researchers and policy makers to faulty decisions, it is necessary to have methods for deter-

© The Continuum Publishing Group Ltd 2002.

REFERENCES [1] Gill L, Goldacre M, Simmons H, Bettley G, Griffith M. Computerised linking of medical records: methodological guidelines. Journal of Epidemiology and Community Health 1993; 47: 31619. [2] Wang C, Ohe K. A CORBA-based object frame work with patient identification translation and dynamic linking. Methods of Information in Medicine 1999; 38: 56-65. [3] Hawker GA, Coyte PC, Wright JG, Paul JE, Bombardier C. Accuracy of administrative data for assessing outcomes after knee replacement surgery. Journal of Clinical Epidemiology 1997; 50: 265-73. [4] Roos LL, Walld R, Wajda A, Bond R, Hartford K. Record linkage strategies, outpatient procedures, and administrative data. Medical Care 1996; 34(6): 57-582. [5] Shannon HS, Jamieson E, Walsh C, Julian JA, Fair ME, Buffet A. Comparison of individual follow-up and computerized record linkage using the Canadian Mortality Data Base. Canadian Journal of Public Health 1989; 80: 54. [6] Arndt S, Tyrrell G, Woolson RF, Flaum M, Anreasen NC. Effects of errors in a mutlicenter medical study: preventing misinterpreted data. Journal of Psychiatric Research 1994; 28: 447-59.


Error estimation in linking heterogeneous data sources

Error estimation in linking heterogeneous data sources

Suggest Documents

Integrating Heterogeneous Data Sources in

Combining Heterogeneous Data Sources through ... - Semantic Scholar

integrating heterogeneous data sources - Information Sciences Institute

Prototyping Digital Libraries Handling Heterogeneous Data Sources ...

Query Relaxation across Heterogeneous Data Sources

Combining heterogeneous data sources for accurate ... - ScienceOpen

Clustering Genes using Heterogeneous Data Sources - CiteSeerX

Exposing heterogeneous data sources as ... - Semantic Scholar

Mining Distributed and Heterogeneous Data Sources: A Project in the ...

Learning from Heterogeneous Data Sources: An Application in ... - PLOS

Estimation of coherent error sources from stabilizer measurements

Estimation in Single-Index Panel Data Models with Heterogeneous ...

Error estimation in recording

Architecture for querying heterogeneous data sources for on - CiteSeerX

Leveraging Mediator Cost Models with Heterogeneous Data Sources

integration of data from heterogeneous sources ... - Semantic Scholar

Using First-Order Logic to Query Heterogeneous Internet Data Sources

Information Mining from Heterogeneous Data Sources: A Case ... - MDPI

integration of data from heterogeneous sources ... - Semantic Scholar

Integration of heterogeneous data sources for gene function prediction ...

Integrating heterogeneous data sources for better freight flow analysis ...

Leveraging Mediator Cost Models with Heterogeneous Data Sources

Fault-Aware Resource Allocation for Heterogeneous Data Sources ...

Semantic Matching across Heterogeneous Data Sources - Google Sites