Mining data from a knowledge management

0 downloads 0 Views 52KB Size Report
a model M1 that contains a full set of prognostic rules derived from the data on the basis of ... both on the prognostic factors derived from the literature and on the physicians ex- .... some missing data, and the factors with the highest percentages of missing data .... In parentheses is reported the performance of the rule (n/m,.
Mining data from a knowledge management perspective: an application to outcome prediction in patients with resectable hepatocellular carcinoma

Riccardo Bellazzi1, Ivano Azzini1, Gianna Toffolo2, Stefano Bacchetti3 and Mario Lise3 1

Dipartimento di Informatica e Sistemistica, Università di Pavia, Pavia, Italy {ric,ivano}@aim.unipv.it 2 Dipartimento di Ingegneria Elettronica e Informatica, Università di Padova, Padova, Italy [email protected] 3 Dipartimento di Scienze Oncologiche e Chirurgiche, Sez. Clinica Chirurgica, Università di Padova, Padova, Italy, [email protected]

Abstract. This paper presents the use of data mining tools to derive a prognostic model of the outcome of resectable hepatocellular carcinoma. The main goal of the study was to summarize the experience gained over more than 20 years by a surgical team. To this end, two decision trees have been induced from data: a model M1 that contains a full set of prognostic rules derived from the data on the basis of the 20 available factors, and a model M2 that considers only the two most relevant factors. M1 will be used to explicit the knowledge embedded in the data (externalization), while the model M2 will be used to extract operational rules (socialization). The models performance has been compared with the one of a Naive Bayes classifier and have been validated by the expert physicians. The paper concludes that a knowledge management perspective improves the validity of data mining techniques in presence of small data sets, coming from severe pathologies with relative low incidence. In these cases, it is more crucial the quality of the extracted knowledge than the predictive accuracy gained.

1 Introduction In almost all clinical institutions there is a growing interest in summarizing the experience collected over years on the diagnosis, treatment and prognosis of relevant diseases. Such interest is particularly high in case of severe pathologies with relatively low incidence; as a matter of fact, these problems are usually managed by the same experienced medical team over years. A summary may be useful to the team for performing self-assessment, or to the institution, in the light of preserving its intellectual asset when the team changes. The application of statistical or machine learning methods is a way to extract knowledge from the available data, or, in other words, to make available within the hospital institution the implicit knowledge that resides in the data (socialization) or to transform it in explicit knowledge (externalization). Therefore, in

this context, the goal of the application of modern data analysis methods is basically institutional knowledge management (KM) [1]. This fact is related with two important limitations in the exploitation of the results obtained: 1. The data are typically observational and collected retrospectively. They do not come from trials and they shouldn’t be used for evidence-based purposes. 2. The data are the results of a local complex process. Their validity and the outcomes they show, in particular in presence of small data sets, may be limited to the institutional conditions in which they are collected. 3. The goal of the data analysis may be also related to a better comprehension of the information that is contained in the data, thus highlighting cases that do not confirm well-established knowledge or problems in the data collection procedures. Unfortunately, this crucial point is usually neglected in the available literature. Medical literature shows often extreme attitudes with respect to the exploitation of this kind of data: the data may be neglected at all, since they are not collected from evidencebased studies, or they are claimed to support results that hold general validity. This is often the case when the problem under study do not allow for clinical trials, such as in low incidence surgically treatable cancers. Rather curiously, also the machine learning community seem to have underestimated the relationships with the KM issues when arguing about the usefulness and comprehensibility of the representation of the extracted knowledge. Recently, a commentary of Pazzani [2] discussed the controversial opinions about the usefulness and understandability of representation formalisms, highlighting the need of cognitive psychology to support knowledge discovery. The consequence of this statement is that also the most effective Data Mining (DM) method for a certain problem (in presence of a similar performance) may be related with the particular KM issues to be faced with. Differences in the background of physicians and of the KM goal (externalization or socialization) can lead to different choices. In this light, we have dealt with the problem of outcome prediction in patients with resectable hepatocellular carcinoma: this problem is fairly representative of a large class of prognostic problems that could be properly studied with a KM perspective. The work herein presented lies in the area of machine learning for outcome analysis; such area represents one of the most interesting directions for the application of DM in medicine [3].

2 The medical problem Hepatocellular Carcinoma (HCC) is a severe disease with an yearly incidence of 3-5% in cirrhotic patients. HCC may lead to liver resection; such clinical decision depends both on the prognostic factors derived from the literature and on the physicians experience[4]. In the Institute of Clinica Chirurgica 2 of the University of Padova, more than one hundred patients underwent liver resection in the last twenty years. After this long time span, the need of quantitatively summarizing the experience gained was apparent. Such synthesis is also related to the identification of clinical or epidemiological factors that are able to predict the prognosis of resectable patients. The externalization of such knowledge may help to improve the indications for liver resection

and the overall cost-effectiveness of the surgical procedures. The socialization of the knowledge may give some “day by day” prognostic rules that may be applied and progressively refined by a new clinical team. The overall data analysis procedure is useful to understand the quality of information contained in the data themselves. Table I. Prognostic Factors Prognostic Factors (Abbreviation) Gender (Sex) Age (Age) Cirrhosis (Tum) Preoperative level of Albumin (Alb) Preoperative level of -GT (GaGT) Preoperative level of GOT (GOT) Preoperative level of GPT (GPT) Preoperative level of LDH (LDH) Preoperative level of PT (PT) Preoperative level of α-fetoprotein (AFP) Child’s class (Child) Number of nodules (Nod) Diameter of largest nodule (Diam) Preoperative chemoembolization (Chemo) Intraoperative blood loss (Perd) Experience of the team (Exp) Type of hepatic resection (Res) T classification (TNM) Grading (Grad)

Categories M vs F Continuous 0,1,2 Continuous Continuous Continuous Continuous Continuous Continuous Continuous A (1) vs B (0) Integer Continuous Yes (1) vs No (0) Continuous Before (0) vs after ’88 (1) Anatom (0) vs non Anatom (1) T1-2 (0) vs T34 (1) G1-2 (0) vs G34 (1)

Availability Pre Pre Pre Pre Pre Pre Pre Pre Pre Pre Pre Pre Pre Pre

# data 77 77 77 75 70 74 74 66 72 71 76 77 77 76

Intra Intra

73 77

Intra

77

Post

72

Post

67

2.1 Data From 1977 to 1999, 117 patients with HCC underwent liver resection at the Institute of Clinica Chirurgica 2 of the Università di Padova. Indications for surgery were made on the basis of clinical assessment and of a number of clinical exams and laboratory tests which are recognised to be prognostic factors for the HCC. The patients followup included α-fetoprotein (AFP) dosage every 3 months, abdomen US or CT scan of the liver every 6 months and chest X ray every 12 months. Tumor recurrence was defined as a lesion in the liver or in other organs, confirmed by percutaneous or open biopsy. From the clinical records, it was possible to derive a database of 90 patients that contained relevant clinical, surgical and pathological data. The outcomes were evaluated on the basis of a minimum follow up of 18 months, which leaded to reduce the number of cases to 77. Twenty prognostic factors, evaluated in the analysis were selected on the basis of the literature and of the physicians experience. Fifteen factors

are pre-operative, while 5 are collected intra- or post-operatively. The list of factors is reported in Table I. Recurrences occurred in 64 patients. Forty seven of these died of disease while 17 were alive with disease at the time of analysis. On the basis of these results, two categories of outcomes were considered: patients with early recurrence (18 months) or no recurrent. The cut-off of 18 months was chosen since it represents the mean time of recurrence. The number of early recurrence patients was 39 and the number of late recurrence patients was 38.

3. Data analysis 3.1 Prognostic models for liver resection in HCC In a related work, some of the authors of this paper studied the general problem of deriving a prognostic model for liver resection in HCC [5]. The approach followed was to test first classical survival models and in particular to extract the most relevant features through a multivariate score models based on the Cox’s multivariate analysis for disease-free survival. This approach turned out to be unsuccessful for the prediction of early and late recurrent cases. Therefore, the data analysis was devoted to the application of methods capable of managing non-linearity and variable interactions, such as Naive Bayes classifier (NBc), Decision Trees, Feed-forward Neural Networks, Probabilistic Neural Networks and Support Vector machines. In this analysis, the NBc implementation described in [6] outperformed all the other methods. The NBc model was based on 8 factors only. Such factors were selected on the basis of a step-wise strategy and by exploiting the physicians experience. In the following we will refer to this model as the reference model, or model M0.

3.2 Building models with a KM perspective Although the selected NBc showed a relatively low generalization error, it doesn’t fulfill the main purposes of the analysis performed: it does not provide information on the relative importance of the different factors, and it is not able to derive any information on critical variable values. Therefore, we looked for the revision of the results obtained by better analyzing the problem characteristics and the physicians needs. • As reported in Table I, the data set was not complete. Twenty-five patients had some missing data, and the factors with the highest percentages of missing data were LDH (14%) and Grad (13%). Moreover, a specific need of physicians was the chance of evaluating the quality of the knowledge derived (i.e. the prognostic model) also on other retrospective data, that might be made available by the collection of information from paper-based clinical records. Such data, by nature, are prone to contain missing values. This fact forces to use methods that are able to properly handle missing values.



The first analysis performed by clinicians was devoted to look for a scoring model, that should be able first to select the most relevant factors and second to predict the outcome. Such capability should be preserved also by other methods. • It is important, for KM purposes, to derive models useful both for externalization and for socialization. To this end, it might be crucial to provide users with both a relatively deep and understandable model for increasing explicit knowledge and a more simple, but relatively accurate model for supporting socialization. • It is important to derive models whose predictions are justified by declarative statements. Since the data set is very small with respect to the dimensionality of the problem, it is crucial to be able to validate the extracted classifier and the knowledge contained in it. For the reasons listed above, we devoted our study to revise and improve the results of induction decision trees (DT), since they seemed to accomplish for the three specific needs of the problem at hand. Moreover, the availability of easy to use software solutions makes easy and reproducible the performed analysis. 3.3 Model selection The basic goal of the DT induction process was to derive two different kind of models: • Model 1 (M1): a DT able to express structured knowledge on the entire factor set. M1 should be used to reason about factors and data, and to accomplish for the task of externalization. • Model 2 (M2): a DT with a minimum number of factors, that should be used to give a hint about patients prognosis during day by day clinical activity, i.e. for socialization. The first direction that was looked for was related to improve feature selection. This step in model building is known to be critical, since partially or completely irrelevant/redundant to the target concept can affect the learning process. Moreover, some factors that are known to weakly affect the outcome may have an effect not measurable in small data sets. All these problems may be dramatically true for our classification problem, in which the outcome of the pathology is difficult to forecast. In particular, we have used a simple and integrated method based on the Information Gain (IG), described in [7]. In detail, we run the DT induction algorithm C5 with all (20) available factors: as is well known, C5 induces trees by selecting the splits that maximize IG. After training and pruning, a subset of factors are retained; such factors are considered to be the optimal ones with respect to IG. A new tree is then induced, with only the selected factors. After this procedure, only 8 factors were kept. The obtained DT was assumed to be our model M1. It is interesting to note that the derived features set was only partially overlapping with the ones used in M0. To obtain model M2 we applied the same procedure described above on the factor set used in M0, since we wanted to derive a model that was already accepted by physicians. We obtained a model with two features only. The selected factors for the different models are reported in Table II. Table II. Selected prognostic factors

Model M0

M1

M2

Factors Pre: GPT, Sex, LDH, Diam, Tum Intra: Res, Perd Post: Grad. Pre: GPT, GOT, AFP, Alb, Chemo Intra: Res, Perd Post: TNM. Pre: Diam Post: Perd

It is interesting to note that the model M2 founds its basis in the observation that the factor Diam is a statistical summary of the factors of model M1 (without Perd). In fact, we were able to derive a DT that is able to discriminate between the classes Diam ≥ 6 and Diam 3.2 and Chemo = 0 and TNM = 0 then class 0 R4: (4) GPT 34 Then class 1 Default class: 0

The model M2 is translated into the simple rule set shown in Table V. The model M2 has been also tested on a validation set of 9 cases. For such cases it was only possible to derive from the clinical record the factor set used in M0. The total accuracy was 7/9, the sensibility was 3/5 and the specificity was 4/4. Rather interestingly, the same results were obtained with the model M0. Table V. Rule set for model M2 R1: (2) Diam > 14 Then class 0 R2: (50/16) Diam 400 Then class 0 R3: (15/1) Diam > 6 and Diam 7 Then class 1 R2: (14/3) Perd > 1600 Then class 1 R3: (42/10) Perd