Data Quality Challenges in Healthcare Claims Data

ORNL/TM-2014/147

Data Quality Challenges in Healthcare Claims Data: Experiences and Remedies

April 2014

Prepared by Sreenivas R. Sukumar1 Natarajan Ramachandran2 Regina K. Ferrell1 1

Oak Ridge National Laboratory, 1 Bethel Valley Road, Oak Ridge, TN 37831 2

Tennessee Technological University, 38505

DOCUMENT AVAILABILITY Reports produced after January 1, 1996, are generally available free via the U.S. Department of Energy (DOE) Information Bridge. Web site http://www.osti.gov/bridge Reports produced before January 1, 1996, may be purchased by members of the public from the following source. National Technical Information Service 5285 Port Royal Road Springfield, VA 22161 Telephone 703-605-6000 (1-800-553-6847) TDD 703-487-4639 Fax 703-605-6900 E-mail [email protected] Web site http://www.ntis.gov/support/ordernowabout.htm Reports are available to DOE employees, DOE contractors, Energy Technology Data Exchange (ETDE) representatives, and International Nuclear Information System (INIS) representatives from the following source. Office of Scientific and Technical Information P.O. Box 62 Oak Ridge, TN 37831 Telephone 865-576-8401 Fax 865-576-5728 E-mail [email protected] Web site http://www.osti.gov/contact.html

This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

ORNL/TM-2014/147

Data Quality Challenges in Healthcare Claims Data: Experiences and Remedies

Sreenivas R. Sukumar Natarajan Ramachandran Regina K. Ferrell

April 2014

Prepared by OAK RIDGE NATIONAL LABORATORY Oak Ridge, Tennessee 37831-6283 managed by UT-BATTELLE, LLC for the U.S. DEPARTMENT OF ENERGY under contract DE-AC05-00OR22725

CONTENTS

Page ABSTRACT ........................................................................................................................................... 1 1. INTRODUCTION ........................................................................................................................... 1 2. SOURCES, CONSEQUENCES AND REMEDIES ........................................................................ 2 2.1 Relevance and Context ............................................................................................................... 3 2.2 Entry Errors (Manual, Man-machine interface, Software Tools) .............................................. 3 2.3 Diversity and Evolving Standards .............................................................................................. 4 2.4 Data Staging ............................................................................................................................... 4 2.5 Entity Resolution ........................................................................................................................ 5 3. CONCLUSIONS .............................................................................................................................. 6 4. REFERENCES ................................................................................................................................ 7

iii

ABSTRACT The current trend in Big Data Analytics and in particular Health information technology is towards building sophisticated models, methods and tools for business, operational and clinical intelligence; but the critical issue of data quality required for these models is not getting the attention it deserves. The objective of this report is to highlight the issues of data quality in the context of Big Data Healthcare Analytics. The common sources of errors, the consequence of these errors and potential solutions to mitigate errors are discussed in the healthcare context.

1. INTRODUCTION Today Big Data’s key dimensions of volume, velocity and variety are driving the development of scalable storage infrastructure, algorithms, software tools and newer models for analytics. From enterprise practice, we are realizing that value and veracity are emerging dimensions of Big Data [1]. Value refers to the cost-benefit to the decision maker through the ability to take meaningful action based on insights derived from data. Veracity is defined by Webster [2] as “conformity with truth or fact”. In the context of ‘Big Data’, this refers to any source that influences accuracy and/or introduces uncertainty in the inference from data such as inconsistencies, missing data, ambiguities, deception, fraud, duplication, spam and latency. In that sense, data quality is subsumed under the definition of veracity. The five V’s of Big Data are all inter-related. The link between data veracity and value is direct and clear, i.e., GIGO (Garbage In! Garbage Out). With the other V’s the relationships maybe subtle and less obvious. For instance, volume can mask bad quality of data; velocity can rapidly propagate poor quality; and variety can create datacontext ambiguities. From a financial perspective, the healthcare sector takes a significant share of the United States GDP. There are tremendous opportunities for Big Data Analytics to impact the productivity and quality of the healthcare sector [3, 4]. To name a few, the expected roll-out of value-based purchasing; changes to business and clinical models; and, realizing efficiencies through smarter delivery of care, all rely on analytical insights gleaned from meaningful analysis of healthcare data. Researchers, analytic-vendors and software developers that prepare and deploy sophisticated infrastructure and organization-specific intelligence tools for clinicians and healthcare policy-makers towards business goals, assume that the data quality is assured by the data-providing organization. Data quality is often being taken for granted. Unlike sectors in manufacturing, where market expectations drive the quality of a product - the market forces in the Big Data industry have not imposed a similar standard on data quality. This is particularly true for healthcare data. We illustrate a typical lifecycle of health data and its use in the healthcare domain in Figure 1. In Section 2, we highlight the factors that contribute to low veracity and their consequences in the health-data lifecycle. The sources of errors we discuss are not statistical errors due to sampling. Our focus is on non-sampling sources of errors. We recognize that non-sampling errors are difficult to quantify in estimates of +/- type margin of errors, but the statistical sampling topic, important as it is, is not addressed in this report.

1

Fig 1: The lifecycle of health-data and the potential sources of data quality errors in the lifecycle. We dig deeper into the quality errors in this article with some real world examples in Section 2.

2. SOURCES, CONSEQUENCES AND REMEDIES An assessment of data quality in healthcare has to (1) address problems arising from errors or inaccuracies in the data itself and (2) consider the source or sources of the data and how the purpose and business model of their collection impact the analytic processing and knowledge expected to extracted from them. A consideration of errors and inaccuracies in the data would include data entry errors, errors arising from transformations in the extracting and transforming process for analytics, missing data, etc. An examination of the source(s) of the data can reveal limitations and concerns as to its appropriateness to the type of analysis being performed (financial data for evaluation of treatments), variations due to data merged from two different business models, variations due to entity and identity disambiguation (or a lack thereof), and variations due to changing and merging business models. In addition, data veracity issues can arise from attempts to preserve privacy and following HIPPA guidelines where obscuring data is intentional. Also, data veracity is a function of how many sources contributed to the data collection process and their similarities and differences. Often datasets integrated from multiple sources are characterized by different levels of data quality. This can result in degradation of the overall quality of the integrated data to the lowest level of data quality of the contributing sources. In this section of the report, we broadly classify sources that contribute to questionable veracity in healthcare into five categories. We describe some common modes of quality corruption that occur in real healthcare data with some examples in the text below.

2

2.1

RELEVANCE AND CONTEXT

Analysts often fall prey to the uncritical acceptance of data without quantifying accuracy and consistency. Recognition of the interplay between settings in which data is collected and the settings to which they can be unambiguously applied often is a root-cause for misleading insights. Ensuring that there is no misrepresentation with respect to the context of extracting insights is a major challenge with healthcare. As illustrated in Figure 1, majority of the healthcare data we collect is for financial accounting and legal/federal regulation purposes. While health datasets may exceed expectations for the financial mission, they may not meet the stringent quality requirements for clinical or epidemiological research. The completeness and accuracy of key data fields are dependent on the purpose of the data collection effort. Data collected for financial reimbursement alone may not be the best data for inferring clinical diagnosis and care information. Conversely, data collected for a clinical trial may not be indicative of typical costs of treatment options. Although for the sake of efficiency, analysts would like to leverage available datasets for newer applications, the resources needed to collect new data should be balanced with the risk of existing data being irrelevant or out-of-context. One solution to the out-of-context problem is to encourage creators and the users of the data to develop a deep understanding of the phenomenon and the context that gives rise to data and document the understanding as meta-data supporting the data for future use. This has become difficult for an individual analyst in the age of Big Data and massive databases to develop, maintain and share quality understanding of the data from several sources. However, commercially available knowledge management tools [5, 6] can enable the analysts to overcome this hurdle that can archive meta-data in machine readable and humansearchable formats. 2.2

ENTRY ERRORS (MANUAL, MAN-MACHINE INTERFACE, SOFTWARE TOOLS)

We would like to highlight the key difference between data entry mechanisms in Big Data vs. Big Data in Healthcare. The process of data creation in academic and commercial settings outside of healthcare is highly automated, such as the recording and ordering of key strokes for website access, financial transactions, etc. However, in the healthcare world, most of the collection, integration and organization of data is manual. The critical difference is the data input by humans who can both, intentionally and unintentionally, introduce systematic data errors. Incorrect entry of a name, address, or key id field such as social security number or insurance id can lead to ambiguous data records that could be attributed to the wrong person or could lead to multiple records for a single person. When the data provided is biased through omission or inaccurate entries, correlations may be found that are inaccurate or important relationships could be missed [7]. Though data maybe input via forms that attempt to limit data errors where possible, by providing dropdown menus wherever applicable, users can still accidentally pick the wrong selection. Forms may have some selections that pre-populate parts of the form that may not always be accurate, but that do not get corrected. Certain data fields may be regularly used and populated by some practitioners or clerks and these same fields may not be routinely populated by others. Forms may also allow optional fields for entry. In such cases, the completeness aspect of data quality is not well defined. A possible solution to data entry errors is automation and mistake-proofing. Simpler user-interfaces for data entry and domain specific ruleengines for error-checks have reduced data entry errors. Some commercial tools are SAS Data Flux [8], Informatica’s [9] and Stanford’s Data Wrangler [10, 11]. Open source data quality check tools are also available [12]. On the other hand, we have observed automation creating data redundancies. While making entries into electronic health records, physicians often use templates or copy-paste commands to generate text that

3

will comply with guidelines and regulations of the insurers such as Medicare. This practice can obscure variations across patient records that could be valuable for clinical discoveries. Data generated by automated software with auto-fill options, speech to text converters, optical-character-recognition devices that digitize health data - all of which are common practices - can produce systematic and random errors that can vary from person to person, tool to tool and are hard to quantify and avoid. Furthermore, there can be ambiguity in semantic description of diagnosis and procedures across physicians and how billing agents encode the semantic variations in claim forms. When reimbursements are at stake, there could be incentive to even deliberately distort the data. For instance, physician may mention treatment for herpes and the billing agent is highly likely to choose a treatment for herpes-2 versus herpes-1, which may not be reimbursed. 2.3

DIVERSITY AND EVOLVING STANDARDS

Healthcare data often consists of several referential codebooks. For example, race and gender codes in beneficiary datasets; medical practice taxonomy and specialty codes in provider datasets, diagnosis and procedure codes (CPT-HCPCS, ICD, etc.) in claim datasets, the NDC codes on prescription drug events. While some of these codebooks may be standardized, codebooks can also be specific to a data system. Medical datasets may sometimes use several standards - some 50 years or older. Suppose a decision maker poses the question, “What is the distribution by race of patients undergoing heart surgery?” - The corruption of the data because of the variety in the codebooks and standards makes it impossible to produce a reliable answer. To illustrate, let us consider the following complicating issues for such an analysis, (1) Hospital A uses a codebook with nine race codes, while Hospital B uses only five, (2) Hospital C could be using ICD-9 while some clinics have transitioned to ICD-10 and, (3) old software systems still use HCPCS codes for procedure claims although most insurers prefer the ICD system. In addition, some legacy systems may have not kept up to evolving standards and even modern systems may not be flexible to incorporate evolving standards. A solution, albeit expensive, are commercially available data products and services that ensure computability and equivalence by mapping diverse codes to the extent possible [13]. The ideal solution to prevent this issue would be a capability to look-up a date-indexed centralized repository for every codebook in the healthcare universe.

2.3

DATA STAGING ERRORS

Serious quality errors can occur in the pre-processing and staging of data for analysis. Usually, data staging can involve data migration, integration, machine-to-machine translation and database-to-database conversion. Decisions made during the extract, transform, load (ETL) process - such as using metric units vs. British units, allowing the choice of leaving a key cost-field blank (which could be encoded in a legacy system as ‘88888’ or ‘99999’) can have down-stream ramifications in the analytics workflow. When data has to go through multiple ETL processes in the business workflow, relationships between entities (e.g. patient-claim, patient-provider and provider-claim) can be lost or corrupted. The ETL process during data integration from multiple sources can propagate errors. As electronic submissions are accepted and merged from different organizations or even different sources in the same organization, certain data cleaning and transformation operations are initiated to prepare data for storage and analysis. For example, if a field is supposed to hold a date, checks are made that the data supplied is of proper size, value and format to translate to a valid date that conforms to the constraints in the new database. Dates can be observed that are unrealistic e.g., 6/30/1802 for date of birth, a string of characters that is not a valid date,

4

or just an expiration date to represent an open-ended time in the future such as 12/31/2099. If the data does not translate to a valid date, the ETL process would enforce field value to be changed to a predetermined value, or left blank. In some situations, where the date is a key important field, the entire record may be rejected and flagged. If an analyst is looking for data that occurred within a date range and that data is not available for a large number of records (lost during the ETL process), this could have a big impact on the analysis and conclusions. Here is another situation that could occur during integration of claims data from two hospitals. Two hospitals handling two different payers (Medicare, Medicaid, BlueCross, etc.) that use the same standardized structure for filing claim forms may have a different adjudication process. Although, the data and its organization may look similar post-integration, the system has to account for the fact that the adjudication process is not. Otherwise, the system cannot ascertain if there are duplicate claims leading to financial implications. Our recommendation to avoid error propagation, maintain and improve data quality throughout the analytical workflow is to check the data for anomalies and outliers by computing summary statistics, maxima and minima at every ETL process in the analytical workflow. The quality inspection can be done using logical constraints derived from interaction with subject matter experts. Trained quality analysts can then use interactive graphics and exploratory data analysis (EDA) tools such as stem-and-leaf and box plots to look at the data in different ways. They can then flag outliers for further investigation and action. Effective inspection of data using the above tools requires skills in knowing what to look for and how to recognize the anomalies. These skills relate back to the familiarity and the understanding of the data generation process. 2.5 ENTITY RESOLUTION Healthcare data involves a complex web of entities such as providers, patients, and payers and their policies. There is a critical need to know and track every entity within the system with a high degree of confidence. Often referred to as the identity disambiguation problem, it is one of the major, if not the toughest data quality challenge in healthcare. Accurate association of health care episodes of a patient who may be visiting multiple healthcare providers is absolutely essential to documenting and retrieving a complete history of health-related events. For example, if a patient happens to have two near-identical records as John Doe and John H. Doe in the system and therefore different care episodes get assigned to each record, neither of these identities will provide a complete record of patient’s health history. This can have serious consequences from a care perspective. A majority of medical errors and law suits center around complications arising from incomplete patient history. Another consequence of low data veracity with respect to identity disambiguation would be allowing fraudulent providers to hide within the system using multiple identities. Fraud detection software will not be able to find such providers because the suspicious activity can be masked as multiple instances of normal activity while in reality it is actually the work of one greedy individual. Entity resolution products such as IBM’s Initiate [14] and Informatica’s Master Data Management Platform [15] can identify records from different sources representing the same real-world entity. These commercially available packages perform a probabilistic match of an entity’s key data (e.g., data of birth, social security number, address, phone, etc.) to discover rules that define when two entry records can be linked or merged with confidence. Maintaining an active master data management solution to track patients, providers and changing health insurance coverage can resolve entity resolution ambiguities. .

5

3. CONCLUSIONS We have discussed major sources of data quality errors in healthcare, presented potential consequences of quality issues in the insights drawn based on the analytics. We proposed best practices and tools that address these issues also in the previous section. We note that the sources of errors discussed in this report are not exhaustive. Other potential sources of data quality problems, that may be more technical and domain specific were considered beyond the scope of this report. Our conclusions are purely based on our experience working with health data analysts and handling health data to build analytical applications. 

The value derived from the use of analytics should dictate the requirements of data quality. Based on this premise, healthcare enterprises embracing Big Data should have a roadmap for organizing the efforts towards a systematic approach to data quality.



Today, data quality issues are diagnosed and addressed in a piece-meal fashion. We recommend a data lifecycle approach which is more appropriate with the dimensions of Big Data and fits different stages in the analytical workflow.



It may be noted that commercial tools for data quality assessment and management can be are expensive, but open source alternatives are available. Open-source alternatives take longer and need dedicated resources for deployment.



Automation in the form of data handling, storage, entry and processing technologies is to be viewed as a double-edged sword. At one level, automation can be good solutions, while on the other hand they can create a different set of data quality issues.



Healthcare quality problems can be so very specific that organizations might have to build their own custom software or data quality rule engines. Commercial software tools may still need usecase specific work.



Analytical insights in healthcare should always be probed for data quality problems. We emphasize this because; most analytical tools assume that the data is of very high quality. We do not have analytical algorithms that are robust to uncertainty in the data.

This report documents our experience with bad data quality from practice on real healthcare claims data. We touched upon several problem areas to be wary of as an analyst conducting data analysis while also identifying potential opportunities for research in data quality assessment. Our future efforts will target these opportunities.

6

4. REFERENCES

1. Manyika, James, et al., "Big Data: The next frontier for innovation, competition, and productivity". McKinsey Global Institute Quarterly May 2011. 2. Dictionary, Merriam-Webster. The Merriam-Webster Dictionary. Merriam-Webster, Incorporated, 2006. 3. Groves, Peter, et al., "The ‘Big Data’ revolution in healthcare." McKinsey Quarterly, 2013. 4. Murdoch, Travis B., and Allan S. Detsky., "The inevitable application of Big Data to health care." JAMA 309.13 (2013): 1351-1352. 5. Copperman, Max, et al., "System and method for implementing a knowledge management system." U.S. Patent No. 7,401,087. 15 Jul. 2008. 6. http://www-03.ibm.com/software/products/en/category/enterprise-content-management (Accessed April 1, 2014) 7. Adler-Milstein, Julia, et al., "Healthcare’s “Big Data” Challenge." The American journal of managed care 19.7 (2013): 537-538. 8. SAS Institute, Data Flux in SAS Data Integration Studio 4. 3: User's Guide. Ed. SAS Publishing. SAS institute, 2011. 9. http://www.informatica.com/us/products/data-quality/ (Accessed April 1, 2014) 10. Kandel, Sean, et al., "Wrangler: Interactive visual specification of data transformation scripts", in the ACM Proceedings of the 2011 annual conference on Human factors in computing systems, 2011. 11. http://www.trifacta.com/ (Accesses April 1, 2014) 12. Barateiro, José, and Helena Galhardas., "A Survey of Data Quality Tools." DatenbankSpektrum 14.15-21 (2005): 48. 13. http://www.findacode.com/search/search.php (Accessed April 1, 2014) 14. IBM Initiate Work Bench User Guide, 2013 (http://pic.dhe.ibm.com/infocenter/initiate/v9r5/topic/com.ibm.initiatepdfs.doc/topics/i46wecu g.pdf) (Accessed April 1, 2014). 15. http://www.informatica.com/us/products/master-data-management/mdm/ (Accessed April 1, 2014)

7

INTERNAL DISTRIBUTION THROUGH THE PUBLICATION TRACKING SYSTEM EXTERNAL DISTRIBUTION NONE