Creating a Large Database Test Bed with Typographical Errors for ...

3 downloads 6116 Views 111KB Size Report
database. OBJECTIVE. To enhance the existing methods in creating a database test bed for record linkage evaluation by developing a PHP program to: ▫ Create.
Creating a Large Database Test Bed with Typographical Errors for Record Linkage Evaluation Nawanan Theera‐Ampornpunt, MD, Boonchai Kijsanayotin, MD, PhD, Stuart M. Speedie, PhD Health Informatics, University of Minnesota, Minneapolis, Minnesota INTRODUCTION ƒ Health information exchange across multiple organizations requires a method or algorithm to optimally link records of the same individuals using demographic data. ƒ Selecting the best record linkage algorithm requires an evaluation to determine its sensitivity and specificity. ƒ This evaluation is facilitated by a large database test bed that closely reflects a real world population and takes into account the potential data entry errors that unfortunately occur in realworld databases.

First Name

Last Name

4,275 Female 1,219 Male (1990 Census)

88,799  Last Names (1990 Census)

Probabilistic

Probabilistic

Gender

Zip Code

System ID

Age Distribution  of MN Population

M/F

MN Zip Codes

Sequence  Number 1‐N

Probabilistic

Uniform

Uniform

Date of Birth

ƒ Model the real world distribution of key variables ƒ Allow users to introduce typographical errors that occur in real world due to imperfect data entry, with frequencies of error occurrence specified by the user

Large Master Database of Demographic Data (N = 950,000 records)

ƒ Randomly select a combination of first names and last names for each gender based on lists of names from 1990 U.S. Census publicly available and their frequencies of occurrence ƒ Generate date of birth based on the available age distribution of the Minnesota population and randomly select a zip code from MN zip codes using a uniform distribution

ƒ Split records into 2 databases with both common records and distinct records

Next Steps: Record Linkage Algorithm Evaluation (Not Part of Study) ƒ Employ record linkage algorithms of interest to produce anonymous identifiers for evaluation

Split records into 2 databases with common and distinct  records to allow algorithm evaluation

Database B

Database A Introduce errors with user specified frequencies

Database A with Errors

Introduce errors with user specified frequencies

Evaluation of Record Linkage  Algorithms (Not Part of Project)

Database B  with Errors

METHODS Master Database Creation

Database Test Bed Creation

ƒ Randomly introduce errors in each applicable variable based on the frequency of each type of errors specified by the user

OBJECTIVE

ƒ Create a sufficiently large database of demographic data that allows more robust and reliable evaluation of record linkage algorithms

ƒ Generate male and female records in equal numbers, and produce a system identifier unique for each record to allow algorithm evaluation

Error Introduction

ƒ This study investigated the synthesis of such a database.

To enhance the existing methods in creating a database test bed for record linkage evaluation by developing a PHP program to:

METHODS (Continued)

ƒ Common records across the 2 databases would be used to check if an algorithm produces the same anonymous identifiers as it is supposed to. ƒ Distinct records across the 2 databases would be used to check if an algorithm produces different anonymous identifiers as it is supposed to. ƒ Errors introduced into each database can be used to assess robustness of the algorithm compared to the ideal databases with no errors.

SUMMARY OF CONCLUSIONS

5 Types of Data Entry Errors Character Insertion Richard Ricthard Character Omission Sullivan Sulivan Character Substitution Robert Rodert Character Transposition 55414 55441 Gender Misclassification M F

ƒ A large database test bed is achieved, allowing evaluation of record linkage algorithms Acknowledgment This  project  was  funded  in  part  under  grant  number  UC1 HS16155 from  the  Agency  for  Healthcare  Research and Quality, U.S. Department of Health and  Human Services.

ƒ The demographic data generated reflect real world distribution ƒ Data entry errors were introduced to allow algorithm evaluation of imperfect dataset

Suggest Documents