Development and evaluation of an open source software tool for ...

BMC Medical Informatics and Decision Making

BioMed Central

Open Access

Software

Development and evaluation of an open source software tool for deidentification of pathology reports Bruce A Beckwith*1,2, Rajeshwarri Mahaadevan2, Ulysses J Balis2,3 and Frank Kuo2,4 Address: 1Department of Pathology, Beth Israel Deaconess Medical Center, 330 Brookline Ave., Boston, MA, USA, 2Department of Pathology, Harvard Medical School, 25 Shattuck Street, Boston, MA, USA, 3Department of Pathology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, USA and 4Department of Pathology, Brigham & Women's Hospital, 75 Francis Street Boston, MA, USA Email: Bruce A Beckwith* - [email protected]; Rajeshwarri Mahaadevan - [email protected]; Ulysses J Balis - [email protected]; Frank Kuo - [email protected] * Corresponding author

Published: 06 March 2006 BMC Medical Informatics and Decision Making2006, 6:12

doi:10.1186/1472-6947-6-12

Received: 12 December 2005 Accepted: 06 March 2006

This article is available from: http://www.biomedcentral.com/1472-6947/6/12 © 2006Beckwith et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: Electronic medical records, including pathology reports, are often used for research purposes. Currently, there are few programs freely available to remove identifiers while leaving the remainder of the pathology report text intact. Our goal was to produce an open source, Health Insurance Portability and Accountability Act (HIPAA) compliant, deidentification tool tailored for pathology reports. We designed a three-step process for removing potential identifiers. The first step is to look for identifiers known to be associated with the patient, such as name, medical record number, pathology accession number, etc. Next, a series of pattern matches look for predictable patterns likely to represent identifying data; such as dates, accession numbers and addresses as well as patient, institution and physician names. Finally, individual words are compared with a database of proper names and geographic locations. Pathology reports from three institutions were used to design and test the algorithms. The software was improved iteratively on training sets until it exhibited good performance. 1800 new pathology reports were then processed. Each report was reviewed manually before and after deidentification to catalog all identifiers and note those that were not removed. Results: 1254 (69.7 %) of 1800 pathology reports contained identifiers in the body of the report. 3439 (98.3%) of 3499 unique identifiers in the test set were removed. Only 19 HIPAA-specified identifiers (mainly consult accession numbers and misspelled names) were missed. Of 41 nonHIPAA identifiers missed, the majority were partial institutional addresses and ages. Outside consultation case reports typically contain numerous identifiers and were the most challenging to deidentify comprehensively. There was variation in performance among reports from the three institutions, highlighting the need for site-specific customization, which is easily accomplished with our tool. Conclusion: We have demonstrated that it is possible to create an open-source deidentification program which performs well on free-text pathology reports.

Page 1 of 10 (page number not for citation purposes)

BMC Medical Informatics and Decision Making 2006, 6:12

http://www.biomedcentral.com/1472-6947/6/12

Table 1: Identifiers that must be removed to deidentify medical data per HIPAA.

Identifier Type Names Geographic subdivisions smaller than a State * All elements of dates (except year) All ages over 89 * Telephone numbers Fax numbers Electronic mail addresses Social security numbers Medical record numbers Health plan beneficiary numbers Account numbers Certificate/license numbers Vehicle identifiers and serial numbers, including license plate numbers Device identifiers and serial numbers Web Universal Resource Locators (URLs) Internet Protocol (IP) address numbers Biometric identifiers, including finger and voice prints Full face photographic images and any comparable images Any other unique identifying number, characteristic, or code * Additional details are given in the regulation text. Note that these categories refer only to identifiers that concern an individual, their relatives, employers or household members

Background The value of studying information contained within the medical record of patients has long been recognized. One of the issues related to using such medical information for research purposes has been protecting patient privacy. Currently, investigators wishing to use medical records for research purposes have three options: obtain permission from the patients, obtain a waiver of informed consent from their Institutional Review Board or use a data set that has had all (de-identified data set) or most (limited data set) of the identifiers removed [1,2]. The Health Insurance Portability and Accountability Act [2] (HIPAA) specifies that a de-identified data set can be created by removal of nineteen specific types of identifiers constitutes deidentification of the medical records (see Table 1). These identifiers include names, ages, dates, addresses, and identifying codes of patients, their relatives, household members and employers. Each year, pathologists in the United States examine millions of tissue samples. This results in the creation and storage of vast numbers of paraffin embedded tissue samples. These specimens have been examined and characterized by a pathologist and a large proportion of recent samples have reports available in electronic format. In recognition of this situation, the National Cancer Institute proposed the development of the "Shared Pathology Informatics Network" (SPIN). The SPIN consortium has successfully demonstrated a prototype of a web-based, searchable, peer-to-peer network for identifying and locating pathologic tissue samples at various institutions

throughout the country by searching information contained within pathology reports [3]. Since the ultimate goal of this network is to provide researchers throughout the country access to tissue specimens, it is absolutely necessary to deidentify the contents of the surgical pathology reports that form the core of the information that is contained within the network. To date, a number of automated text deidentifiers (scrubbers) have been described [4-11]. We are aware of four reports of systems, which have been designed for, or tested upon, pathology reports [4-7]. However, one of the systems is proprietary [4], the second was not available when we started this project [5], the third was designed only for removing proper names [6], and the final system works in part by altering the contents of the text [7]. Therefore, we undertook to develop an open source text scrubber optimized for surgical pathology reports.

Implementation The software was developed using the following open source tools: Java (Sun Microsystems, San Jose, CA) [12], MySQL (Uppsala, Sweden) [13], and JDOM (JDOM project) [14]. Our source code [see Additional file 2] is freely available at the SPIN website [15], under the GNU General Public License terms [16]. As part of the SPIN project, we defined an XML schema (available at the SPIN website), which can accommodate the various types of information contained within a pathology report. This format includes a header portion




Figure 1surgical pathology report in SPIN XML format Example Example surgical pathology report in SPIN XML format. This is an example of a simple (and fictitious) surgical pathology report, which has been converted into the SPIN XML schema and deidentified by our scrubber. The schema supports a great amount of detail, but only a few of the elements are mandatory. This example shows the minimal elements needed to have a valid XML file which the scrubber will process. The format is somewhat redundant as the textual portions of the report are included twice, once in the and elements and once in the combined element. Note that the original report contained the phrase 'Received in formalin labelled "John Doe"...'. The scrubber replaced "John Doe" with "xxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxx".



that contains demographic information about the patient from which the pathology specimen was obtained, such as name, medical record number, date of birth and social security number. In addition, it includes information about the pathology report, such as the accession number and the pathology department that generated the report. The advantage of using this standardized format is that reports from a variety of sources can be transformed into this common format and then a single scrubber (and other tools such as an automated UMLS coder) can easily process reports from many different source institutions. An example of a simple surgical pathology report in the minimal SPIN XML format is shown in Figure 1. The scrubber was designed initially by examining the typical format of pathology reports from the three different institutions that participated. The scrubber was iteratively tested on small datasets, typically groups of 50–100 reports, until it could reliably remove common identifiers. Reports from all three institutions were used in this testing phase, but most of the reports used came from one institution, Department A. The algorithmic structure of the scrubber involves three processes. First the program takes advantage of identifying information that may be present in the header of the file. The scrubber searches for any occurrences of these identifiers in the textual portions of the xml file (i.e. the body of the pathology report) and removes them. At this point the header information could also be completely removed, but in our environment, some of this information is required for later processing steps, so these fields are emptied just prior to the case being loaded into the SPIN network. The second step involves searching for predictable patterns likely to represent identifying data, such as dates, accession numbers, addresses, and proper names that can be found by way of markers such as Dr, MD, PhD etc. Additionally, names of institutions are identified by their common portions such as hospital, medical center, clinic, etc. These pattern searches are implemented as 50 "regular expressions" in Java [see Additional file 1] and stored in a separate file as variables. This modular construction allows easy modification of patterns or addition of new patterns to adapt the scrubber to new data sources with different conventions. As would be expected, the order in which the pattern searches are executed does play an important role since some expressions assume that certain patterns have already been removed. Following this pattern matching step, the program then compares each word in the file to a database of personal and geographic place names. These names are stored in a MySQL database. The entries were derived from publicly available census lists of frequently occurring names from the 1990 census [17]. The lists contain over 90,000


unique first and last names, which represent about 90% of the last names encountered in the 1990 census. In addition, the US Census Bureau provides a gazetteer file [18] which contains US place names (cities, towns, etc). This file contains over 16,000 unique place names. These files were combined to form one table with over 101,000 unique entries. At one institution (Department A), this composite name table was augmented with the names of pathologists who were active during the period from which the reports were drawn. No institution-specific pathologist or patient names were added at the other two institutions. A corpus of 1800 surgical pathology reports that had not been used previously for development or testing was extracted from the anatomic pathology laboratory information systems of three hospitals, with 600 cases being drawn from each institution. The reports were extracted in batches of 200 sequential case reports from three different time periods between 1999 to 2004. The reports from two hospitals included external consult cases, but in the third hospital these are separately accessioned with a unique set of numbers, so the extracted series contained no consultation reports. Since pathology departments do not usually retain tissue on external consult cases, there is little point in including consult reports in the SPIN network. The scrubber was therefore designed to segregate consult cases into a separate folder before removing the identifiers. Our analysis includes these consult reports, but when preparing reports for submission to the SPIN network we discard these consult reports. Once the series of cases was extracted, custom programs unique to each institution were used to transform the raw extracts from the native laboratory information system format into XML files consistent with the schema used by the SPIN project. The SPIN schema allows for many different levels of detail to be encoded, but the major textual portions of the pathology report are stored intact in four elements (clinical information, gross description, diagnosis text and full text report). There are also separate elements for addendum text, electron microscopy findings, etc. Only surgical pathology case reports were used, but these reports included the results of special studies such as immunohistochemistry, flow cytometry, and immunofluorescence studies. No cytology or autopsy reports were included among the reports. The scrubber software was installed on a personal computer running Microsoft Windows® in each institution. The 600 individual XML files were processed by the scrubber in a batch. The scrubber searches through each text field and replaces any suspect words or text strings with a series of X's and a notation as to what type of information was presumably removed. The scrubbed file is saved under a




not removed by the software were classified as being a HIPAA identifier or non-HIPAA identifier.

9.0 8.0

Results

7.0

Identifiers

6.0 5.0

Dept. A Dept. B

4.0

Dept. C

3.0 2.0 1.0 0.0 In-house

Consult

All Cases

Case Type

Figure 2number of unique identifiers present per report Average Average number of unique identifiers present per report. Note that Department B did not have any consult reports included in the sample.

new name so that the original file is not altered. In addition, the scrubber produces a log file with each character string that is removed from each file along with a reference to the reason that the string was removed (e.g. the type of the particular pattern match that led to the removal). Following deidentification, the scrubbed files were manually reviewed by one of the pathologists (BB, UB, FK) in order to find and classify identifiers that were not removed by the scrubber. In addition, each original report was manually reviewed to count the total number of identifiers present in the text portions and to clarify any questions regarding whether a removed phrase was an identifier. The types of information that were counted as identifiers included the HIPAA enumerated list of identifiers (see Table 1) as well as the following information that is not required to be removed in a de-identified data set under HIPAA: all ages (not just those >89 years), all dates including year, states and countries, names of health care providers and health care organizations such as hospitals, medical laboratories, etc. An identifier was considered removed if a large enough portion of it was removed to make it very unlikely to be identifiable from the remaining fragment of text. For example, if "Jones Medical Associates" was present in the report and what remained after scrubbing was only "Medical Associates", this was considered to be a successful removal. Misspelled proper names (e.g. Smiht) were counted as identifiers since the correct spelling could usually be guessed (Smith). The logical portions of addresses (street number and name, city, state, country) were counted as separate identifiers for purposes of our analysis. Only the identifiers that were

The 1800 cases included 1559 (86.6%) in-house surgical pathology case reports and 241 (13.4%) external consultation case reports. 1254 (69.7%) reports contained at least one identifier. There were a total of 4515 identifiers present, including 3499 that were unique in a given report. The average number of identifiers was 2.5 per case, with 1.9 unique identifiers per case on average. The frequency of identifiers in reports varied markedly between type of case with all 241 (100%) external consults, but only 1013 of 1559 (65%) in-house case reports containing at least one identifier {p < 0.001 Chi-square}. The percentage of reports containing identifiers also varied between pathology departments with Departments A, B and C having 69.2%, 39.8% and 100% of cases with identifiers respectively {p < 0.001 Chi-square}. Additionally, the number of identifiers present in a report varied by both type of report and source institution (see Figure 2). Since the presence of multiple instances of the same identifier in the same case report is unlikely to provide any additional information, the remainder of the analysis will focus solely on the identifiers that were unique in a given report. For example if the name "Jones" appeared twice in the text of three different reports in our testset, we ignored the second occurrence in each report and counted "Jones" as one identifier present in each report, for a total of three identifiers, not six. The deidentification software successfully removed the vast majority of identifiers. Of the 3499 unique identifiers present, 3439 (98.3%) were removed (see Table 2). Of the 60 identifiers that were not removed, 19 of them were HIPAA specified identifiers. Table 3 provides a breakdown of identifiers missed by type. Comparing performance by source institution, the percentage of unique identifiers removed varied between 94.7 and 99% (see Table 2). This performance was statistically different {p < 0.001 Chi-square}. Another way to look at the data is to consider in-house and external consult cases separately. The 241 consult cases contained slightly more than half of the identifiers (1809 of 3499, 51.7%). We can see in Table 3 that the majority, 36 of 60 (60%), of missed identifiers were in consult case reports, which is not surprising since these types of reports often contain listings of slide labels, accession numbers, outside report numbers, and patient identifiers which are directly dictated into the text of the reports. They also contained the majority of HIPAA specified identifiers that were missed, 12 of 19 (63%). Names are one of the identifiers of greatest concern since these provide a fairly high level of identification on their




Table 2: Performance summary of the deidentification software

Dept. A

Dept. B

Dept. C

Total

Reports Reports with any identifier Unique identifiers Unique identifiers per report Unique identifiers removed Unique identifiers remaining, total Unique HIPAA identifiers remaining % Unique identifiers removed

600 415 1079 1.8 1057 22 11 98.0%

600 239 338 0.6 320 18 1 94.7%

600 600 2082 3.5 2062 20 7 99.0%

1800 1254 3499 1.9 3439 60 19 98.3%

Unique over-scrubs Unique over-scrubs per report % unique phrases removed that were identifiers

1126 1.9 48.4%

961 1.6 25.0%

2584 4.3 44.4%

4671 2.6 42.4%

own without reference to any other source of information. Fortunately, in many institutions patient names are only rarely included in the body of the report. However, some institutions have gross dictation policies that include documenting the name that is present on specimen containers (e.g. "Received in formalin labeled 'Mary White, left fallopian tube' is a piece of pink tan soft tissue..."). When the name is spelled correctly and it matches either a name in the header information of the report or one of the words in the proper name database, it can be successfully removed. However, when the name is misspelled, we must rely upon other clues, such as inclusion in quotation marks, capitalization, proximity to a correctly spelled proper name, or proximity to a marker word such as "labeled". As evident in Table 3, all correctly spelled names were removed, but seven misspelled names were not removed. Four of these names were first names and three were last names. All seven misspelled names came from the reports of one hospital, indicating that this is an issue related to the style of reporting at that institution.

Numbers are another source of concern since some types of numbers can be uniquely identifying when cross referenced, such as Social Security numbers or medical record numbers. Many such numeric identifiers are easily dealt with since the length and form may be known in advance (e.g. Social Security numbers, zip codes and phone numbers). Medical record numbers can usually be easily removed since they tend to be similar from institution to institution. In this trial, there was one medical record number missed from an in-house case. The number was missed because it had an unusual spacing pattern dividing the seven digit number into three groups of two and three digit numbers. This particular type of error should be easy to rectify by altering the pattern matching algorithm to take account of common spacing characters such as dash, period, space, slash, backslash, etc. Accession numbers provide a more complicated problem than medical record numbers. In our sample, there were 10 accession numbers that were not removed, all of which

Table 3: Summary of identifiers that were not removed.

Identifier Accession number Patient Name Misspelled Correctly spelled Medical record number Date HIPAA subtotal Institution address, partial Age

Development and evaluation of an open source software tool for ...

Development and evaluation of an open source software tool for ...

Suggest Documents

Development of an open source multi-platform software tool for ...

Development and Evaluation of an Open-Source Software Package ...

development and evaluation of an open source

open to development: open-source software and

ConceptBuilder: An open-source software tool for measuring ...

FLR: an open-source framework for the evaluation and development ...

An Open Source Simulation Model of Software Development and ...

An evaluation criterion for open source software projects ...

An open source tool for automatic spatiotemporal

Development and evaluation of an open source Delphi ... - ScienceOpen

and Open Source Software Development - Springer Link

and Open Source Software Development - Springer Link

OPEN-SOURCE SOFTWARE DEVELOPMENT AND USER ...

Evaluation and Comparison of Open Source Software Suites for Data

Phenobook: An open source software for

Sleep: An Open-Source Python Software for

Phenobook: An open source software for

DeconvolutionLab2: An open-source software for deconvolution ...

DeconvolutionLab2: An open-source software for deconvolution ...

Open Source and Closed Source Software Development Methodologies

Open source software adoption evaluation through ...

An Example of Open Source Software for Journal Management and ...

open source software development, innovation ... - cat.middlebury.edu

Comparative evaluation of open source software for ... - BioMedSearch