MEDINFO 2001 V. Patel et al. (Eds) Amsterdam: IOS Press © 2001 IMIA. All rights reserved
XML-based Visual Data Mining in Medicine Martin Dugasa, Ellen Hoffmannb, Sabine Jankob, Silke Hahnewaldb, Tomas Matisb, Karl Überlaa a
Department of Medical Informatics, University of Munich, Germany b
Department of Cardiology, University of Munich, Germany
medicine this problem is relevant, because biomedical data structures are characterized by a high level of complexity. The term "database" is applied by different authors in quite different settings: Sometimes a flat file consisting of five columns is called a database; but there are also systems with over 100 tables covering thousands of patients, e.g. in cancer registries.
Abstract Medical databases in general are characterized by a high degree of complexity in terms of quantity of items, number of parameter values and data types (free text, categorical, numerical and other). Substantial domain knowledge is required for adequate formalization of medical entities. In this context we developed medical database plot (mdplot), a data mining tool to visualize both structure and quality of data in medical databases to identify items suitable for evaluation.
Clinical databases are predominantly free text oriented while scientific datasets contain mostly categorized and numerical data. Given these heterogenous data structures: How can items suitable for statistical evaluation be identified?
Data models are provided in XML format. Missing data is identified to enable targeted efforts to improve data quality prior to analysis. Database items are classified as 1:1related to the patient (i.e. variables are collected once per patient) and 1:n related.
Assessment of data quality is crucial, because on the one hand results should be provided as soon as possible and on the other hand biased or imprecise results must be avoided. Although measuring data quality in general is quite difficult, completeness of data is a prerequisite for a highquality dataset and can be quantified. In the clinical context, there is a relationship between database design and quality of data: if the data structure is too complicated or too simple, the system will not be adopted by the medical staff and the resulting data quality will be unsatisfactory.
mdplot provides a list of all classes contained in a database, the number of records each and a condensed bar chart for semi-quantitative description of completeness according to four types of items: categorical, numerical, text and other. All items in a category are grouped from left to right, the height of each bar represents the proportion of non-missing values with respect to the total number of records in the class; thus the amount of content in a specific class is visualized.
The number and complexity of medical databases is growing continuously; for this reason methods are required to get an overview of a medical data structure and to identify topics which can be answered with a given dataset:
By selection of a specific class, a detailed description of it is provided including mean completeness in each item category as well as number of values per item. The new methodology was applied to a cardiological research database consisting of 619 items on 88 patients. Keywords: Data mining, XML, data modelling, visualization, medical datasets
data
quality,
-
What kind of items does it contain?
-
How complete is the documentation?
-
Is it suitable for a cross-sectional study (i.e. most items are collected once per patient) or a longitudinal study (i.e. there are many follow-up items) ?
In the following section a XML-based visualization technique for medical databases is presented which can be used to quantitatively describe the structure and quality of medical datasets.
Introduction In 1994 M. Kahn described the "Garbage in - Gospel out"phenomenon in the field of medical informatics [1], which relates to the danger of misinterpreting databases containing incomplete or imprecise data. Especially in the field of
1324
Chapter 16: Data Systems
Materials and Methods
number of values for all items. Figure 1 shows a sample XML description including aggregated information about number or records and item completeness. The complete Document Type Definition (DTD) of is available via the mdplot website [10].
XML XML, "the universal format for structured documents and data on the Web" [2] was applied to describe the medical datasets, because it is platform-independent and applicable to complex, hierarchical structures. For the implementation of the data mining system standard XML-tools were applied: XML-Parser [3] and XML-DOM (Document Object Model)[4][5].
... ... ... ...
XML descriptions were generated by a dedicated web based documentation system [6][7][8][9] which provides rapid prototyping of data entry masks for scientific documentation systems. During several iteration cycles the data model was refined, until the physicians and nurses were satisfied. The data structures were exported in XML format. Concept of mdplot The basic idea is to visualize at the same time structure and quality of data - in terms of completeness - to identify items suitable for statistical evaluation. Missing records and missing items should be identified to enable specific efforts into improving data quality prior to statistical analysis to avoid bias and imprecise results. In medical research the distinction between longitudinal and cross-sectional studies is important, thus items which are collected at different time points per patient (1:n relation) must be distinguished from variables which are collected once per patient (1:1 relation).
Figure 1: Sample XML description (small section) of the AF
database. is the root node which must be followed by at least one node. Its attributes contain an ID (classid), the overall number of records in this class (count) and the relation of items with respect to the patient (1:1 or 1:n). The node partitions items by their semantics. The attributes of an node are specifying category (numerical, text, categorical, other) and overall number of non-missing entries in the class (count attribute).
Real-world medical datasets are characterized by high complexity - the database presented in the next section consists of 619 items per patient. mdplot provides a list of all classes contained in a database, the number of records each and a bar chart for semi-quantitative description of completeness per item. With respect to the evaluation it is important to distinguish between numerical, categorical, text and other items, because different methods can be applied: statistical procedures like mean and standard deviation for numerical values or frequency tables for categorical variables. Text items require natural language processing or manual recoding. The category 'other' includes date, time, complex and blob (binary large objects, e.g. multimedia) items.
Server concept
A condensed bar chart for each class and each category of items is provided. The x-axis corresponds to the sequence of items, the y-axis represents the proportion of nonmissing values with respect to the total number of records in the class.
Atrial Fibrillation (AF) database
The technical concept ist based on established Internet tools: An Apache [11] webserver (version 1.4.2) on a Linux [12] machine (Distribution SuSE 6.3) performs PERL [13] (version 5.005) programs generating HTML output.
Results
The mdplot visual data mining approach was applied to a cardiologic research database. Atrial Fibrillation is the most common cardiac arrhythmia and is associated with major complications. Ongoing research is focused on new pacing devices for alternative treatment of this disease. The objective of an AF registry is to store prospectively all relevant data covering clinical information, quality of life and device parameters and by this means provide a platform for long-term follow-up.
The goal is to visualize both quantity and type of available contents (e.g. are there many numerical items in a specific class and is the proportion of missing values low?). By selecting a specific class a detailed description is provided including mean completeness in each item category and the
1325
Chapter 16: Data Systems
Figure 2: Section from the follow-up form of the Atrial Fibrillation database (in German). Data on demographics, general and diagnostic programming are presented as well as counters. All device data is transferred by means of a specific interface directly from a floppy disk into the research database to avoid typing errors.
Figure 3: Visualization of the Atrial Fibrillation Database. 7 out of 8 classes are located on the right hand side of the plot, i.e. most classes contain items that are collected at several time points per patient. For this reason this data structure is predominantly longitudinal. The condensed bar charts indicate how much content is contained in each item category (numerical, text, categorical and other items). It can be seen for example that follow-up and quality of life data is quite complete in contrast to medical history.
1326
Chapter 16: Data Systems
Discussion
The AF database currently consists of eight tables with an overall number of 619 items indicating a high level of granularity which is required for this complex research topic. At present detailed information on 88 patients is recorded.
Focus on data quality Data quality - both in terms of completeness and accuracy – is crucial for any statistical analysis in medical research. It is important to get an overview of the structure and quality of data before a statistical evaluation is carried out. The overall goal of mdplot is to identify items, which are suitable for further analysis and which are not or not yet. If a relevant proportion of an important outcome variable is missing, there should be strong efforts on data monitoring to complete the dataset before the evaluation is performed to avoid selection bias. With an appropriate data mining tool data quality problems can be detected at a very early stage and appropriate measures can be taken - both organizational and technical, e.g. specific reports on missing or wrong data. The later shortcomings are detected, the more efforts are required to fill the gaps. Correctness cannot be measured in a domain-independent manner; therefore completeness was chosen as a quantifiable measure for data quality.
The documentation is divided into several sections which are subdivided into item groups. E.g. medical history consists of AF history, symptoms, induction and termination of AF. Most of the documentation is structured to enable statistical evaluation; in addition supplementary free text items (e.g. comments) are provided. Another table provides a systematic documentation of drug therapy. For each medication dosage and time intervals are recorded. Clinical reports can be generated directly from the research database. The family physician gets immediately detailed information about the course of the cardiological therapy. Subjective perceptions of the patient and quality of life are also included. By means of an atrial fibrillation diary frequency and duration of arrhythmic episodes are recorded. An intranet based SF-36 form [14] for assessment of quality of life is collected during each follow-up.
By analysis of the structure of a given database it can be determined what kind of analyses are reasonable. Longitudinal-oriented systems are characterized by many follow-up items and are typical for databases on patients with chronic diseases. There is a relation between structure and quality of data: if the model is too complicated or too simple, the system will not be adopted by the medical staff and the resulting data quality will be unsatisfactory.
Figure 2 presents a section from the follow-up form of the AF registry which is dedicated primarily to management of pacing parameters. Besides demographic data item groups for general and diagnostic programming are recorded as well as counter data. All device parameters are transferred by means of a specific interface directly from a floppy disk of the pacing system into the research database to avoid typing errors.
Visual Data Mining
For a limited number of atrial fibrillation episodes the pacing device provides detailed information like day and time of onset, episode duration etc. Using a specific software provided by the manufacturer of the pacemaker, each onset is classified by the physician as associated with certain trigger mechanisms (e.g. PAC = premature atrial contraction). The goal is to get a better understanding how these cardiac arrhythmia episodes are initiated.
There is a great variety of visualization techniques in data mining, a comprehensive overview is provided bei Keim [16]. Current research in data mining is applying interactive techniques, e.g. for visualization of decision trees [17]. By this means the user can complement the computer analysis with his domain knowledge. Medicine is characterized by a high proportion of non-formalized knowledge and therefore interactive approaches can help to avoid "data dredging" [18].
mdplot of AF database Figure 3 presents the visualization of the AF database. It is evident that 7 out of 8 classes contain items that are collected at several time points per patient (1:n relation), i.e. there are many follow-up items. It can be seen that the number of records per class and the number of items per class is varying considerably: in the AF diary class there are only 4 records while the pacing parameter class contains 163300 records. The medication class consists of 229 items, while the pacing parameter class comprises only 2 items. The condensed bar charts indicate how much content is contained in each item category (numerical, text, categorical and other).
Observational studies based on data from clinical systems gain more and more importance. In a recent publication in the New England Journal of Medicine Benson and Hartz [19] concluded: "Our results suggest that observational studies usually do provide valid information. They could be used to exploit the many recently developed, clinically rich databases." In contrast to datasets for clinical studies hospital databases are characterized by a wealth of variables and a high quantity of records, thus sophisticated data mining tools are required to provide reliable analyses. Impact of XML The separation of data structures from specific programs is a major step in the field of biomedical informatics. Due to the rapid evolution of the Internet, the number of XML-
1327
Chapter 16: Data Systems
[6] Dugas M, Bosch R, Paulus R, Lenz T: Intranet-based multi-purpose medical records in Orthopedics. Medical Informatics 1999 24(4), 269-275
based applications is growing rapidly; 481 industrial members in the World Wide Web Consortium [20] (as of November 2000) including all major software companies provide evidence for a strong industrial commitment to this new technology.
[7] Dugas M, Scheichenzuber J, Hornung H: An IntranetBased System for Quality Assurance in Surgery. J Med Syst. 1999 Feb;23(1):13-9
In the field of medical informatics, there are many ongoing activities concerning XML. Dolin [21] presented an XML based Patient Record Architecture (PRA) to 'enable pooling of content from documents created on systems of widely varying characteristics'. Many scientific problems in medicine might be resolved faster and better, if pooling of data between institutions was easier. But aggregation of content requires common minimal data sets.
[8] Dugas M, Überla K. Intranet Based Clinical Workstations. In: Medical Informatics, Biostatistics and Epidemiology for Efficient Health Care and Medical Research. Victor N, eds. Munich, Germany: Urban und Vogel; 1999: 235-238. [9] Dugas M, Überla K. Intranet-Based Clinical Data Entry. AMIA Proc 1999:1051 [10] Dugas M: mdplot website. URL: http://mdplot.ibe.med.uni-muenchen.de. Accessed 2000 Nov 24.
There are two main obstacles: At first, there is no ultimate data model for a specific medical topic; it is an iterative process with a continuing, intensive dialogue between physicians and computer scientists.
[11] The Apache Software Foundation Home Page. URL: http://www.apache.org. Accessed 2000 Jul 1.
Secondly, medical data models are complex. To cover a single cardiologic disease entity far more than 100 items were appropriate. But there are thousands of different diseases. This example might be not representative, but structural complexity - in terms of number of items - is a major problem that requires inter-institutional cooperation.
[12] SuSE Linux Home Page. URL: http://www.suse.de. Accessed 2000 Jul 1. [13] Wall L, Schwartz RL. Programming PERL. Sebastopol, CA, USA: O'Reilly & Associates, Inc.; 1992. [14] SF-36. URL: http://www.sf-36.com/. Accessed 2000 November 27.
Although there is no general solution to these problems in sight, open data structures could facilitate the dialogue between institutions and would be an important step on the long road to interoperability. To keep track of complicated data models a tool like mdplot could be useful. In this context efforts should be undertaken to motivate researchers as well as commercial companies, to publish relevant parts of their data models in XML format.
[15] World Wide Web Consortium. URL:http://www.w3.org/. Accessed 2000 Aug 15. [16] Keim DA: Visual Database Exploration Techniques. Int. Conference on Knowledge Discovery & Data mining, Newport Beach, CA, 1997. URL: http://www.informatik.uni-halle.de/~keim/PS/KDD97.pdf
Accessed 2000 Aug 15.
Conclusion
[17] Ankerst M, Ester M, Kriegel HP. Towards an Effective Cooperation of the User and the Computer for Classification. ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining (KDD'2000), Boston, MA, 2000
XML-based Visual Data Mining can facilitate data monitoring and help to provide valid statistical evaluations of complex medical datasets.
[18] Stead WW, Miller RA, Musen MA and Hersh WR: Integration and Beyond: Linking Information from Disparate Sources and into Workflow. J Am Med Inform Assoc 2000; 7: 135-145.
References [1] Kahn MG: The Desktop Database Dilemma. In: J.H. van Bemmel (ed.) IMIA Yearbook of Medical Informatics. Schattauer Verlag Stuttgart (1994), 218221
[19] Benson K, Hartz AJ: A comparison of observational studies and randomized, controlled trials. N Engl J Med 2000; 342: 1878-86
[2] Extensible Markup Language (XML). URL: http://www.w3.org/XML. Accessed 2000 Nov 24.
[20] World Wide Web Consortium (W3C). URL: http://www.w3.org/. Accessed 2000 Nov 24.
[3] XML-Parser for PERL URL: http://www.perl.com/CPAN-local/authors/id/ C/CO/COOPERCL/XML-Parser-2.29.tar.gz . Accessed 2000 Nov 24.
[21] Dolin RH et al: HL7 Document Patient Record Architecture: An XML Document Architecture Based on a Shared Information Model. Proc AMIA 1999:5256
[4] XML-DOM for PERL. URL: http://www.perl.com/CPAN-local/modules/ by-module/XML/XML-DOM-1.25.tar.gz . Accessed 2000 Nov 24.
Address for correspondence Dr. med. Dipl.-Inform. Martin Dugas University of Munich, IBE, Marchioninistr.-15, D-81377 Munich
[email protected]
[5] Document Object Model (DOM) URL: http://www.w3.org/DOM/ . Accessed 2000 Nov 24.
1328