Integrated Approach for Designing Medical Decision Support Systems with Knowledge Extracted .... with no symptoms or physical signs of liver disease have.
Integrated Approach for Designing Medical Decision Support Systems with Knowledge Extracted from Clinical Databases by Statistical Methods Ewa Krusinska 1X3, Ankica Babic 213, Shamsul Chowdhury 3, Ove Wigertz 3, Goran Bodemar 4 and Uhik Mathiesen 4
Institute of Computer Science, University of Wroclaw, Poland Facult, of Electrical and Computer Engineering, University of Ljubljana, Yugoslavia Department of Medical Informatics, University of Link6ping, Sweden 4 Department of Internal Medicine, Linkoping University Hospital and Oskarshamn Hospital,Sweden 1
2
In clinical research data is often studied by a particular method without previous analysis of quality or semantic contents which could link clinical database and data analytical (e.g. statistical) procedures. In order to avoid bias caused by this situation, we propose that the analysis ofmedical data should be divided into two main steps. In the first one we concentrate on conducting the quality, semantic and structure analyses. In the second step our aim is to build an appropriate dictionary of data analysis methods for further knowledge extraction. Methods like robust statistical techmiques, proceduresfor mixed continuous and discrete data, fuzzy linguistic approach, machine learning and neural networks can be included. The results may be evaluated both using test samples and applying other relevant data-analytical techniques to the particular problem under the study.
concentrate on the quality, semantic and structural analyses of the data contained in the database. In the second one, our aim is to build an appropriate dictionary of data analysis methods for further knowledge extraction. It is possible to take into account multivariate statistical procedures, AI techniques and neural networks. This is presented in Fig. 1.
Introduction Data collected in cliiical databases can be interpreted in different ways and can also reflect different realities of underlying medical phenomena. In clinical research, they are usually analysed by one or several methods which are suitable for the problem currently studied. Commonly, no special attention is paid to the preliminary quality or semantic analyses of the database contents. The lack of these analyses causes bias in choosing proper dataanalytical methods and can lead to results which have no commonsense interpretation [1]. In order to avoid the bias caused by such a non-systematic data analysis, we propose an integrated approach which can be understood as: - quality and semantic analysis, - analysis of the data structure in
the statistical sense that can provide us appropriate choice of methods from an
available dictionary. Thus, the main research, when dealing with medical data, can be divided into two steps. In the first one, we 0195-4210/91/$5.00 © 1992 AMIA, Inc.
353
Fig. 1. Linking clinical database to data-analytical methods thrugh data quality, semantic and structure analyses.
Our approach integrates two areas in clinical research and medical decision-making, which commonly are treated quite separately. The large part of research devoted to computer applications in medicine deals with efficient redescription and reformulation of information in clinical databases [2]. On the other hand, various statistical techniques are applied to the knowledge extraction in medicine. But even when the knowledge extracted by a particular procedure is incorporated to an user-friendly support system [3, 4], there is no detailed link studied between the data structure and the variety of methods which could be implemented to reach equivalent results.
The proposal for more integrated analysis is given in [5], but it does not deal with statistical structure analysis which, from our point of view, is inevitable for the proper choice of methods for further knowledge extraction. Our approach is especially important for retrospective studies with data already collected in databases, since in the prospective studies more planning is possible to be done. Steps of Analysis Data Quality Analysis Aspects of data quality which are needed to be analysed are correctness, consistency and completeness. Correctness of medical data can be checked comparing the database records with information in traditional medical records which is usually done using a sample of records [6]. Records from both sources should be exactly the same. All the data measured for the patients under the study and reported in traditional medical records, should appear in clinical database. Format and form of data as well as misprints and outstanding observations in relation to medical standards and norms should be checked. Additionally, an answer to the question, if data available suits for solving the current problem under the study, should be given. Completeness of medical data is a property that can not be easily satisfied due to many reported reasons like malfunction of measurement equipment or (deliberate) omissions when recording or inputting data into a real database [6]. In order to prepare data for statistical knowledge extraction, different procedures are available for handling missing values such as 'delete method', 'subspace method' and 'estimate and replace method' [7]. Consistency should be checked whenever we input new medical data or whenever we deal with pooled databases which are result of fusing smaller databases from several independent studies. 'Although pooling increases the overall size, data quality tends to suffer' [6]. Data collected for the same findings in different centres should be consistent according to definition, measurement units and format. When fusing different parts of database for further statistical analysis, it should be checked if there is no statistically significant difference between variables collected, especially when the measurements are done in different laboratories.
Semantic Analysis Having completed data quality analysis our intention is to proceed with information analysis, formulating relational data structure and transforming data into a relational database [8]. The relational database schema may further be modified by performing semantic analysis [9,10]. The techniques as correspondence and cluster analysis between objects, relational parameter values, retrieval functions, for instance, are found to be useful for discovering hidden semantics concerning database [10]. Both quality and semantic analysis can provide data without 'contamination' for further processing.
354
Data Structure Analysis Since clinical databases often contain non-typical values as outliers (observations distant from the centre of sample) or leverages (observations with large Cook distances) as well as mixtures of continuous and discrete features (coded history and physical examination data mixed with measurable laboratory findings) our intention is to pay special attention to this problem [11]. Therefore, we propose that analysis of statistical data structure following the quality and semantic analyses should proceed the statistical modelling and knowledge extraction. Traditionally, if done, it is understood as a part of statistical analysis. From our point of view, it is supplementary to both previous analyses and may assure a proper choice of data-analytical methods.
Statistical Modelling of Clinical Data The analyses discussed can be understood as a link to efficient multivariate modelling of data. Our intention is to build a dictionary of data-analytical methods which are useful to analyse clinical databases.The general approach to applied data analysis should contain the choice of method according to the nature of the problem studied and the structure of data. E.g., the problem of outliers and leverages in the data can be solved with multivariate robust methods. Examples of robust modelling as applied to assistance of medical diagnosis in respiratory diseases, estimation and prdiction can be found in [11-13]. Other modern statistical methods deal with mixtures of both continuous and discrete data [e.g. 14]. This concerns regression, discrimination and cluster analysis, as well. Additionally, some procedures from fuzzy sets approach can be applied to define linguistic variables as transformations of groups of discrete features describing the same underlying (medical) phenomenon. This technique has been used in discriminant and cluster analysis [15, 16]. In all methods mentioned in, the reduction of the number of features to the most informative ones is very important. The procedures suitable for this purpose should be included to the methods dictionary, too. Classically, statistical analysis is performed by data analyst, very often using statistical packages. A few of them provide database management facilities, but for flat files, only. Beside this problem, almost no automatization of statistical analysis is done at present in the sense of expert systems or support systems. The problems are discussed extensively in [17]. Evaluation of Methods Used There are two main directions we want to discuss. First, developed approach should be tested using the new data denoted as test set. The clinical database can be divided from the beginning into two subsets: learning and test seL The other possibility is to take new incoming data or data from the similar databases with necessary adaptation. Secondly, the methods can be evaluated by applying some other relevant procedures using the same data. Procedures used must provide analogous possibilities. Al
approach called inductive or machine learning and probabilistic neural networks can, for instance, be applied to evaluate the results of statistical discriminant analysis.
the case of alcohol consumption analysis the consistency between consumption habits (yes,no) and the reported amount consumed per month has been checked. Several inconsistent cases have been corrected.
Application Materials Data for patients with abnormal liver function tests but with no symptoms or physical signs of liver disease have been collected during visits to specialists and during laboratory, ultrasound and biopsy examinations. Each of them were followed by corresponding questionnaire and by traditional medical records. The same data has been recorded in flat files corresponding again to examinations listed above. Two medical centres (Linkoping, Oskarshamn) contributed to the sample describing a group of 159 patients. Some of them were proved to be non-sick. Diagnoses confirmed due to biopsy result were steathosis; steathosis and inflammation; steathosis and fibrosis; steathosis and fibrosis and inflammation; inflammation; inflammation and fibrosis; fibrosis; cirrhosis; cirrhosis and inflammation.
Quality Analysis Correctness of data files could be easily checked, since traditional medical records and questionnaires were available. Contents of data files was compared with corresponding ones in medical records. Except for a few cases the correctness was fulfilled. Completeness of data was also quite satisfactory. The occurrence of missing values was mostly connected with data describing laboratory findings. Due to high number of missing values, 13 out of 59 laboratory findings were excluded from further analysis as well as those ones measured only for patients from one of the medical centres. For the rest of the findings, there were not patients with a lot of missing values. One of the questions of interest was the relation between some laboratory tests and alcohol consumption. For this problem 12 patients were additionally excluded from the analysis, because no answers on alcohol consumption habits were collected for them. Consistency of data has strongly improved after quality analysis. Though the same approaches to medical examinations had been used in both centres, data was not completely consistent. Some of the laboratory findings had been measured differently and therefore recalculation to the same measurement units was needed. This concerns also the change of measurement method during the investigation time for one of the findings recorded. In such a case the discretization according to normal limits is necessary. Problem of consistency was also discussed for terms describing biopsy and ultrasound examinations, which can differ depending on medical understanding. Attention was paid to make understanding of 'diagnosis', 'disease' and 'pathology' consistent, as well. E.g. consistency in occurrence and degree of portal and periportal inflammation has been checked. The inconsistencies found have been studied and corrected. In
355
Reformulation to Relational Database and Semantic Analysis The flat file database was converted to a relational database and resulted in the model given in Fig.2.
Fig. 2. Entity model of liver diseases database Entity PATIENT consists of attributes such as age, sex, body-mass index. It is related to entities HISTORY, EXAMINATION-TYPE and DIAGNOSIS. Different data concerning medical history, which are collected in questionnaires, form the entity HISTORY. Attributes of DIAGNOSIS have been collected during clinical practice both as disease and diagnosis. Finally, diagnoses discovered during examination are related to the latter through entity EXAMINATION-DIAGNOSIS.
SEMANIC ANALYSIS
Fig. 3. Semantic analysis schema for queries of interest
To check if the database structure is suitable for queries of medical interest, i.e. for queries on diagnosis assistance and on alcohol consumption, semantic analysis has been done for both data structure and data contents (entity model) as well as for different queries possible (action diagram). Fig. 3 illustrates a position and a role of semantic analysis. In our analysis, we have concentrated on forming clusters of entities used in the same query. The best model to organize entities as described in Fig. 2 has been found. The reformulation to relational database and semantic analysis enable easy access to data and the possibility to reach subfiles needed for statistical analysis in an efficient way using database query language.
ALT*AGE*SEX, AST*AGE*SEX, SMOKING*AGE* SEX. It resulted in 17/46 misclassifications in the nonconsumer group and 8/75 misclassifications in the consumer group using the classical logistic regression. This result was better than for the classical linear discrimination. Using robust logistic regression [12] the outcome has been slightly corrected (18/46 and 6/75 classifications, respectively). This can be explained in the way that the sample is not very much affected by large outliers or influential observations (leverages). Additionally, the machine - learning technique is planned to be used for evaluation of statistical discriminant
Link to Statistical Analysis To perform the statistical analysis of the problems stated, it was necessary to check the data structure: theoretical
Diagnosis Support Diagnosis support under consideration concerns predicting biopsy result in terms of Fibrosis, Steathosis and Inflammation, their coexistence, type and degree.
assumptions (e.g. normality), outliers, leverages. SPSSx package has been used for this purpose. A part of statistical modelling under consideration has been done by original logistic regression software, which enables to perform its classical as well as robust versions. This has caused the necessity to prepare the data set in form suitable for the package as well as for programs which accept ASCII data files. In the near future we want to automatize this link by preparing the special software tools giving the possibility of fast data transformations. The statistical structure analysis proceeded the statistical modeling under interest. First, the univariate structure has been checked to find univariate outliers and to exclude eventual misprints. Next the Mahalanobis as well as Cook distances were calculated. Some outliers and leverages have been found. Additionally, correlations and interdependances between variables have been studied. For missing values the 'delete method' as well as estimation techniques [7] have been used. Relationship between Alcohol Consumption and Some Laboratory Tests The variable set under consideration consisted of 10 features. These were: age in years, sex coded: 0 - woman, 1 - man, alanine aminotransferase (ALT) in mkat/l, aspertate aminotransferase (AST) in mkat/l, gamma glutamyl transferase (g-GT) in mkat/l, mean corpuscular volume of read blood cell (MCV) in fl, iron (Fe) in mmol/l, triglicerydes (TG) in mmol/l, immunoglobuline A (IgA) in g/l and smoking coded: 0 - no, 1 - yes. In the sample of 147 patients, 121 of them had no missing values. Statistical modelling has been done in the steps refereeing to different variable sets used: 1. ALT, AST, gGT only; these findings together with SEX and AGE; the full set of laboratory findings; the set with laboratory findings and their AGE and SEX interactions of the first order; the latter set enriched by the second order interactions and SMOKING. It appeared that the laboratory findings themselves did not possess the ability of predicting alcohol consumption. The best model has been obtained for the set consisting of AGE, AGE*SEX,
356
pocedures.
VERIFICATION _9
_w
-*
Al
0
Fig.4. Knowledge extraction and its updating basing of methods' dictionary First, 20 variables (laboratory and ultrasound findings as well as alcohol consumption habits and information about blood transfusion) were used for this purpose. Their predictive value has been checked in the univariate way. This is presently followed by multivariate discriminant analysis and regression analysis (prediction of the degree
of 'pathology'). The schema of the entire analysis is presented in Fig.4. The extraction of decision rules and design of decision algorithm is a special task because of coexistence of different kinds of pathology. Thus, it has been decided to perform it according to decision tree structure supported by multivariate discriminant rules (of different kinds as related to data structure). First, control (without pathology detected by biopsy) and non-control cases are differentiated. As the second step, the pure steathosis is diagnosed against steathosis contaminated by fibrosis and inflammation. This is because steathosis is the most common and relatively the least dangerous defect. Then, inflammation and fibrosis (+cirrhosis) are diagnosed. When inflammation is recognized, it is possible to check , whether it is of intralobular kind or if only portal and periportal inflammations are present. In Fig. 4, the idea for the evaluation of decision rules, their 'corrections' with information coded in new incoming patients and for dataibase update is presented. Additionally, the entire set of laboratory, ultrasound and history findings is currently being tested to check whether some of them (not taken into account previously) can posses any discriminatory ability in recognition of liver defects.
[4] [5]
[6]
[7]
[8] [9]
[10] Conclusions and Further Research
Our integrated approach gives a possibility to avoid errors in information processing that can happen when analysis is done in separate steps without linking database and data-analytical dictionary by rudimentary quality, semantic and structure analyses. In the future, we want to build more formal links between steps of analysis in order to support them by efficient, user-friendly software and to automatize them. The problem of automatization (or semi-automatization) concerns also the part of the system devoted to statistical data analysis. In our research, we would like to support it by software enabling the preliminary interpretation of results.
[11]
[12]
[13]
Acknowledgements
This research has been partly supported by the Swedish Institute and DAGMAR50 project to Ewa Krusinska as well as by Research Committee of Slovenia to Ankica Babic during their stay at the Department of Medical Informatics, University of Link6ping, Sweden.
[14]
[15]
References
[1]
Campbell M. J., Machin D., Medical Statistics A Commonsense Approach, Wiley, 1990. [2] Clyman J.I., Miller P.L., An Enviroment for Building and Testing Advice-Giving Systems in Medicine, in: Miller R.A (Ed.), SCAMC 90, pp. 584-588. [3] Polaschek J. X., Lenart L. A., Garber A. M.,
[16] [17]
357
A Computer Program for Statistically-based Decision Analysis, in: Miller R.A. (Ed.), SCAMC 90, pp. 795-799. Haug P., Hoak S., Veristat: A Support Tool for Knowledge, in: Miller R.A. (Ed.), SCAMC 90, pp. 650-654. Rossi-Mori A., Pisanelli D.M., Ricci F.L., Evaluation Stages and Design Steps for Knowledge-based Systems in Medicine, Medical Informatics 15, 1990, pp. 191-204. Shortliffe E.M., Perreault L.E., Widerhold G., Fagan L.M.(Eds), Medical Informatics, Computer Application in Health Care, Addison-Wesley Publishing Company, 1990. Chowdhury S., Bodemar G., Haug P., Babic A. and Wigertz O., Methods for Knowledge Extraction from a Clinical Database on Liver Diseases. Comp. Biomed. Research, 1991 (to appear). Ullman J.D., Principles of Database Systems, Pitman, Sec. Ed., London, 1983. Barsalou T., Wiederhold G., Applying Semantic Model to an Immunology Database, in: Stead W.W.(Ed.), SCAMC 87, pp. 871-877. Missaoui R., Applying Data Analysis Technique to Acquire Knowledge about Database Use, in Schader M. and Gaul W. (Eds) Data and ComputerAssisted Decisions, NATO ASI Series, Vol. F 61, Springer-Verlag Berlin, 1990, pp. 349-360. Krusinska E., Liebhart J., Robust Multivariate Methods in Laboratory Techniques and in Assisting Medical Diagnosis, Medical Informatics 15, 1990, pp. 133-139. Krusinska E., Liebhart J., Robust Logistic Discriminant Functions in Diagnosing Chronic Obstructive Airways Disease, Comp. Biol.Med. 20, 1990, pp. 351-359 Krusinska E., Liebhart J., Robust Selection of the Most Discriminative Variables in the Dichotomous Problem with Application to Some Respiratory Disease Data, Biometrical Journal 30, 1988, pp. 295-303. Krusinska E., Variable Selection in Location Model for Mixed- Variable Discrimination - a Comparative Study, in: Diday E. (Ed.), Data Analysis and Informatics, V, North-Holland, 1988, pp. 57-67. Krusinska E., Liebhart J., A Note on the Usefulness of Linguistic Variables for Differentiating Between Some Respiratory Diseases, Fuzzy Sets and Systems 18, 1986, pp. 131-142. Boryslawski Z. R., Krusinska E., Fuzzy Linguistics Concept in Redescription of Vegetation Data, Coenoses 3, 1989, pp. 169-173. Chowdhury S. I., Computer-based Support for Knowledge Extraction from Clinical Databases, Ph. Dissertation No. 240, Linkoping University, Linkoping, 1990.