Scaling a Data Retrieval and Mining Application To the Enterprise ...

4 downloads 11797 Views 1MB Size Report
we describe the development ofa data retrieval and mining application, Goldminer, which allows authorized personnel at our institution to query clinical and ...
Scaling a Data Retrieval and Mining Application To the Enterprise-Wide Level Daniel J. Nigrin, M.D., M.S., Isaac S. Kohane, M.D., Ph.D. Informatics Program Children's Hospital, Boston

Most medical institutions have had diqffculty in adopting practices that use stored clinical and administrative data effectively. This stems in part from the lack of available tools to easily and accurately retrieve datasets of interest. In this work, we describe the development of a data retrieval and mining application, Goldminer, which allows authorized personnel at our institution to query clinical and demographic data stores through a graphical, non-programmer interface. It builds upon DXtractor, our previously described tool that retrieves data from a smaller, more specialized dataset. We discuss the dificulties encountered in scaling this application to the enterprise-wide level, and our solutions.

Background Data retrieval and analysis applications geared toward medical institutions are needed. Although most facilities routinely collect electronic data from patient encounters, far fewer of them routinely use this data to its fullest. When the data is analyzed, it is most often for financial reasons: to minimize costs, enhance payments, or maximize revenue. In the commercial sector, these techniques have been responsible for huge financial gains; a large investment of both time and resources are allocated to these efforts in those corporations. The medical community has had a more difficult time in adopting practices that routinely access and review this data to improve care and to back research efforts. Much of the difficulty lies with the lack of available tools required to easily and accurately retrieve datasets of interest. Most data retrieval and analysis tasks require the assistance of a highly trained database specialist familiar with an institution's data storage schema.

Introduction This paper describes the development of a new tool, Goldminer, that allows for non-programming clinicians, researchers, and administrators to more effectively access and "mine" both clinical and administrative data stored by our institution. It builds upon our previous work with the DXtractor applications, which retrieves data from a smaller, more specialized dataset. A number of both technical and usability issues arose in this scaling process. We will discuss these issues, and describe our solutions in detail. Specifically, we will address the difficulties involved in maintaining and querying a much larger underlying database, as well as operational issues involved with deploying the application to a larger user base.

Nevertheless, there have been several previous efforts in providing hospital personnel with enterprise-wide data analyses and reports based on electronically stored data. Providers at Intermountain Health Care can retrieve reports of key measures of performance in their specific clinical areas2. These indicators are generated automatically from data stored in the enterprise's database systems. Such measures are also used by department leaders and institution administrators to continuously measure quality and performance. Infection control personnel at Beth Israel Hospital use stored clinical and demographic data to routinely monitor for bacterial resistance and outbreaks within the institution3. PSNWeb4, developed by Boston's Caregroup, is a managed care resource that allows for intranet-based review of utilization and performance measures across providers and enterprise-specific groupings. Several other clinical data extraction approaches have been described5-7.

Although not a classical data mining application insofar as it does not perform statistical or knowledge discovery analyses of stored data, Goldminer nonetheless is an integral part of the data mining process. By generating data sets specific to an area of interest, it performs a critical data prospecting step. When dealing with large, enterprise-wide repositories, isolating the population and/or data of interest is often the most difficult task in the process.

1091-8280/99/$5.00 C 1999 AMIA, Inc.

901

In many of these instances however, clinicians and administrators cannot themselves generate novel queries of the database. They are limited to queries which have been already created, and which are merely invoked at the time the user executes them. This limits the clinical usefulness of such efforts to only those queries that have already been generated. The task of creating a new query lies with the application programmers; this leads us away from our goal of allowing non-programming clinical, research, and management personnel to query the data sources in an ad hoc fashion.

subpopulations. This is done using logical Boolean set operations (AND, OR, and NOT, including parentheses to any degree of nesting), as well as temporal set operations. These operations allow for time-based relationships within or between subpopulations to be expressed, using a set of simple temporal operators we have defined (EARLIEST, LATEST, BEFORE, AFTER, EQUALS, WITHIN, and BY). The ability to combine sets in this way imparts Goldminer with the ability to generate complex overall queries, despite relatively simple individual data requests.

Many commercially supplied OLAP (for Online Analytical Processing) tools are currently available to help with analysis of large, complex databases8. These applications help to visualize and explore data stored in these repositories, and are able to deliver extremely fast performance through data optimization and pre-computation. Although able to provide results of static, previously configured queries as in the cases above, these applications also allow for rapid exploration of data through novel user queries. Unfortunately, users still need to understand the data model of the underlying databases to create such custom queries. Since this is beyond the level of knowledge of the majority of medical personnel, the utility of the stored data is limited to the analysis of pre-configured queries. Therefore, it is highly unlikely that OLAP applications alone would be readily useable by medical or administrative staff without a significant amount of prior training.

In order not to burden the performance of production clinical systems, we duplicated the contents of relevant data elements from the production databases to a separate data repository. This repository is maintained in an Oracle 8 database. The contents of this database are kept updated by routinely run SQL (Structured Query Language) scripts, which copy new or modified data from the production system to the Goldminer repository.

Currently, the stored data includes labs, demographics, inpatient and outpatient visits, and text document descriptions for all patients (inpatient and outpatient) seen at our institution since 1993. It comprises approximately 14 gigabytes of information, representing data on over 1 million patients, and containing over 30 million laboratory studies. Data for inpatient medications and charges will soon be added to the database.

Functional Specification

Scaling Issues

Goldminer is a web-based Java applet, which is deployed within the Children's Hospital firewallprotected Intranet. It resides on a password-protected web page, allowing access only to authorized users. It is based on the querying and set combination principles developed in DXtractor, and which have been described previouslyl. Briefly, the fundamental query that Goldminer performs is a population query; interactive "wizards" guide the user through a variety of parameter specifications to retrieve a particular group of patients. Only clinical terminology and concepts are presented to the user; the abstraction barrier between clinical and database terminology is maintained throughout the application (Figures 1 and 2).

Database Size The significantly larger size of the database to be queried called for techniques to specifically ensure adequate application performance. First, the database required careful optimization; this was primarily accomplished through invocation of cost-based rather than rule-based query optimization. In addition, adequate buffer space and temporary table size were also allocated.

Second, the use of a new Oracle 8 feature known as table "partitioning" was used. This allows tables to be transparently segmented into smaller tables, based on the value of a particular instance attribute or attributes. For example, the main laboratory result table is over 30 million rows long, as mentioned above. To reduce the quantity of data scanned for each query, we partitioned this table based on the date of the laboratory test. This created several smaller table partitions, each containing data for only

Once a patient subpopulation is queried for and retrieved (e.g. the group of patients seen in the outpatient Hematology program in the past year, or patients with a history of an elevated blood lead level), it may be combined with other patient

902

Goldminer

12

Date; 12 (-Value HB.1H8 Al C; January 1, 10 OutpatientVisits: Clinic)- OENETICSiMAIN: December 1. 10