Downloaded from jamia.bmj.com on May 24, 2013 - Published by group.bmj.com
Research and applications
A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation Erel Joffe,1 Michael J Byrne,1 Phillip Reeder,1 Jorge R Herskovic,1,2 Craig W Johnson,1 Allison B McCoy,1,3 Dean F Sittig,1,3 Elmer V Bernstam1,4 ▸ Additional material is published online only. To view please visit the journal online (http://dx.doi.org/10.1136/ amiajnl-2013-001744). 1
School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA 2 Office of the Senior VP of Academic Affairs, MD Anderson Cancer Center, Houston, Texas, USA 3 Memorial Hermann Center for Healthcare Quality and Safety, The University of Texas Health Science Center at Houston, Houston, Texas, USA 4 Division of General Internal Medicine, Department of Internal Medicine, Medical School, The University of Texas Health Science Center at Houston, Houston, Texas, USA Correspondence to Dr Elmer Bernstam, School of Health Information Sciences, The University of Texas Health Science Center at Houston, 7000 Fannin, Suite 600 Houston, TX 77030, USA;
[email protected] Received 20 February 2013 Revised 23 April 2013 Accepted 27 April 2013
ABSTRACT Introduction Clinical databases require accurate entity resolution (ER). One approach is to use algorithms that assign questionable cases to manual review. Few studies have compared the performance of common algorithms for such a task. Furthermore, previous work has been limited by a lack of objective methods for setting algorithm parameters. We compared the performance of common ER algorithms: using algorithmic optimization, rather than manual parameter tuning, and on twothreshold classification (match/manual review/non-match) as well as single-threshold (match/non-match). Methods We manually reviewed 20 000 randomly selected, potential duplicate record-pairs to identify matches (10 000 training set, 10 000 test set). We evaluated the probabilistic expectation maximization, simple deterministic and fuzzy inference engine (FIE) algorithms. We used particle swarm to optimize algorithm parameters for a single and for two thresholds. We ran 10 iterations of optimization using the training set and report averaged performance against the test set. Results The overall estimated duplicate rate was 6%. FIE and simple deterministic algorithms allowed a lower manual review set compared to the probabilistic method (FIE 1.9%, simple deterministic 2.5%, probabilistic 3.6%; p