Key words: Decision Support, Data Mining, Case Based Reasoning, Data ... nationwide survey and ongoing program on general practice activity in .... decision making in healthcare should be influenced by the best available evidence and.
Combining Data Mining and Case-based Reasoning for Intelligent Decision Support for Pathology Ordering by General Practitioners Zoe Y. Zhuang a, Leonid Churilov a, *, Frada Burstein b, Ken Sikaris c a
Department of Accounting and Finance, Faculty of Business and Economics, Monash University, Australia b Caulfield School of IT, Faculty of Information Technology, Monash University, Australia c Melbourne University Clinical School - St Vincent's & Geelong Hospitals, The University of Melbourne, Australia ( * contact author)
Abstract Pathology ordering by General Practitioners (GPs) is a significant contributor to rising health care costs both in Australia and worldwide. A thorough understanding of the nature and patterns of pathology utilization is an essential requirement for effective decision support for pathology ordering. In this paper a novel methodology for integrating data mining and case based reasoning for decision support for pathology ordering is proposed. It is demonstrated how this methodology can facilitate intelligent decision support that is both patient-oriented and deeply rooted in practical peer-group evidence. Comprehensive data collected by professional pathology companies provide a system-wide profile of patient-specific pathology requests by various GPs as opposed to that limited to an individual GP practice. Using the real data provided by XYZ Pathology Company in Australia that contain more than 1.5 million records of pathology requests by general practitioners (GPs), we illustrate how knowledge extracted from these data through data mining with Kohonen’s Self-Organizing Maps constitutes the base that, with further assistance of modern data visualization tools and on-line processing interfaces, can provide “peer-group consensus” evidence support for solving new cases of pathology test ordering problem. The conclusion is that the formal methodology that integrates case-based reasoning principles which are inherently close to GPs’ daily practice, and data-driven computationally intensive knowledge discovery mechanisms which can be applied to massive amounts of the pathology requests data routinely available at professional pathology companies, can facilitate more informed evidential decision making by doctors in the area of pathology ordering. Key words: Decision Support, Data Mining, Case Based Reasoning, Data Clustering, Kohonen’s SelfOrganizing Maps, Health Care Systems
1. Introduction Increasing use of clinical pathology services has long been recognized as a worldwide phenomenon in counties with different healthcare systems, and has attracted the attention of researchers, practitioners and governments all over the world. In Australia, general
practitioners (GPs) order and manage most of the pathology requests (Cohen et al., 1998). According to the Bettering the Evaluation And Care of Health (BEACH) study, a nationwide survey and ongoing program on general practice activity in Australia (Britt et al., 2004), there has been a significant increase in the number of pathology tests ordered per 100 consultations, from 19.7 in 2000-01 to 35.2 in 2003-04, representing an increase of almost 20% over the recent 4 years of the BEACH program. There are various recognized systemic factors influencing the growth of GP pathology utilization (Guibert et al., 2001); one of these factors is that the appropriateness of doctors’ decision making when ordering the pathology services is not assured (Vinning and Mara, 1998; van Walraven and Naylor, 1998; Lundberg, 1998; Smellie, 2003). Stuart et al. (2002) argue that the wide variation in test-ordering, particularly when tests are used for diagnostic purposes, means that some tests may be unnecessary or ordered inappropriately. Smellie et al (2002) extend this argument to suggest that large differences observed in general practice pathology requesting are accountable for by individual clinical practice and are therefore potentially amenable to change through more consistent and better informed decision making by GPs. In order to influence doctors’ decision making as far as test ordering is concerned, a number of specific interventions have been attempted over the past two decades. Some of these interventions aim to support GPs at the point of care (mainly implementation of various guidelines and protocols), while others stress the continuing effects on ordering behavior (educational programs, utilization audits, feedbacks, incentives, etc.) (AxtAdam et al., 1993; Solomon et al, 1998; Isouard, 1999; Stuart et al., 2002; Verstappen et al, 2003).
However, recent systematic analysis conducted by Rao et al. (2003)
demonstrates that the effects of many interventions are frequently short lived due to the combination of human factors and inappropriate or incomplete nature of clinical evidence-based tools to support doctor’s decision making in pathology ordering. From the point of view of evidence-based medical practice, the level of evidence may vary from relatively reliable sources generated by a committee of experts (such as
guidelines and protocols) to less reliable information based on individual practice experiences (such as utilization audits and peer review). Young and Ward (1999) demonstrate that despite positive attitudes towards Evidence-nased medicine (EBM), general practitioners report low levels of use of either printed or electronic summaries of evidence, even among the doctors who were aware of these resources. The major barriers GPs perceived to using the evidence provided include time constrains, information overflow, and the lack of relevance and usefulness of the evidence (van der Weyden, 1999; Weekley et al., 2000). In other words, too often the available evidence was not specific enough, relevant, and/or appropriate to support doctors’ decision making activities. In the area of pathology ordering, according to Smellie et al. (2005), there is a wealth of guidance, consensus documents, national policy statements, and related documents sought to provide decision support for laboratory testing. However, available guidelines are mostly focused on a particular disease and provide (sometimes very) limited advice for specific patient-centric interpretation. The often highly non-trivial task of interpreting the guidelines and matching a particular patient case as specified by the guidelines is frequently left to the individual doctor’s clinical judgment with otherwise very limited decision support. Patient-specific information based on wider professional practice is usually not reflected in guidelines. Overall, current evidence bases in this area are often rather limited and rigid for the purposes of decision support for daily pathology ordering activities dealing with specific patients. Thus, the current inability to routinely generate specific, situationally relevant, and clinically appropriate evidence to support the GPs daily test ordering activities becomes a major obstacle in achieving effective test ordering decision support. This, in turn, hinders the achievement of long-lasting effect of interventions aimed at appropriateness of doctors’ pathology services ordering behavior. Thus, the existence of effective and robust methodology for generating the required evidence becomes a necessary precondition for its use by GPs for decision making and, by implication, is one of the important drivers for more appropriate pathology test ordering.
The objective of this paper is to demonstrate how integrated use of intelligent case base classification and case retrieval methodologies can generate patient-oriented, situationally relevant, and peer-group based evidence in order to facilitate interactive decision support for pathology ordering by GPs. Specifically, the formal methodologies discussed in this paper are data mining and casebased reasoning (CBR). Data mining is used for discovering and understanding hidden information from complex and large datasets to come up with meaningful patterns (Han and Kamber, 2001). One of the most frequently encountered instances of data mining is data clustering. Clustering involves the process of grouping the data into classes or clusters so that objects within a cluster have high similarity while objects from different clusters are dissimilar. The formal method used in this paper for clustering is Kohonen’s Self-Organising (Feature) Maps (SOFM or SOM) (Kohonen, 1982; 1990; 1997). SOM belongs to the class of neural network based tools for unsupervised learning and can be successfully used for data clustering and visualization. Case-based reasoning can be utilized to solve a new problem by remembering a previous similar situation and by reusing information and knowledge of that situation (Aamodt and Plaza, 1994). Instead of relying on general knowledge of a problem domain, or making associations along generalized relationships between problem descriptors and conclusions, CBR is able to utilize the specific knowledge of previous experienced, concrete problem situations (cases). In medical domains, CBR has mainly been applied for diagnostic and partly for therapeutic tasks (Schmidt et al., 2001). Data mining techniques have previously been combined with CBR systems for automated case generation (Clerkin et al., 2002), efficient case retrieval and case-base maintenance (Yang and Wu, 2000), and case based classifications (Aha and Bankert, 1994; Arshadi and Jurisica, 2005). The novelty of this paper is in combining SOM-based data clustering and CBR to facilitate the evidence based, situationally relevant, interactive, and flexible
decision support for pathology ordering activities by GPs. As the topic of the special issue is “Formal Methodologies and Tools for DSS”, the main focus of this paper is on the methodological aspects of the intelligent case base classification and case retrieval for decision support, while the issues of actual systems implementation and performance monitoring are discussed only briefly and are effectively treated as being outside the scope of the paper. The remainder of this paper is organized as follows: Section 2 presents the decision making context by concentrating on both the description of the decision problem and the existing tools for decision support currently available in the domain; the proposed approach for integrating data mining and CBR methodologies is discussed in Section 3; Section 4 is dedicated to the discussion of tools and techniques for clustering and cluster quality assessment; step-by-step implementation of the proposed approach is described in Section 5; Section 6 concludes the discussion by providing a brief discussion and formulating future research directions. The content of this paper is partially based on the results reported in Zhuang et al. (2006) and Zhuang et al. (2006a).
2. Decision-making context: decision problem and existing decision support tools Arguably the main feature of clinical pathology ordering as a decision-making context is its reliance on the principles of Evidence-based medicine. The essence of EBM is that decision making in healthcare should be influenced by the best available evidence and clinical experience, and the practice of EBM is to integrate individual clinical expertise with the best external evidence form systematic research (van der Weyden, 1999). According to Bichindaritz et al. (1998), the quality of evidence can be characterized by two main dimensions of the knowledge upon which it is grounded: reliability and certainty. In terms of reliability, the most reliable evidence is the one generated by a world-wide committee of experts (practice principles), then by a committee of experts
(practice guidelines), then by a group of experts (protocols), and then by one expert (judgment). As far as certainty is concerned, the quality of evidence can be stratified from the highest level of evidence obtained from randomized controlled trials to lower levels of consensus, expert opinions and individual’s experience. GPs make decisions on pathology ordering in specific clinical situations. Below the context of the decision-making process for pathology ordering is presented using the framework suggested by Davis and Cosenza (1993) that describes decision making as a multi-step process consisting of problem recognition, information search, problem analysis, alternative evaluation, and choice.
2.1. Decision problem In a pathology ordering context, the decision-making process begins with the recognition of a need to order certain tests for a particular patient with specific characteristics (such as demographics and past clinical history including past pathology ordering history) and existing symptoms or diseases. Information is then gathered to support the decisionmaking. Available information for pathology ordering may vary from standardized guidelines or protocols to subjective personal knowledge or experience. Using relevant information in the form of guidelines or similar scenarios from past experience, the GP comes up with a meaningful list of plausible tests to be ordered. Finally, the GP makes the choice regarding types of tests to order in a given patient situation. This choice can be influenced by many factors (Wertman et al. 1980, Lundberg 1983, 1998) including internal factors (such as clinical need for screening, diagnosis, disease monitoring or prognosis), and external factors (such as acting under patient pressure, exercising defensive behavior, following the guidelines, and system-wide economic and cost considerations). Figure 1 summarizes the decision context by depicting the decisionmaking steps and factors to be considered in each step:
-Patient characteristics -Symptoms
-Guidelines -Personal experience Information search
Problem recognition
-Internal influences -External influences
-Guideline matching -Case based reasoning
Choice
Problem analysis
Alternative evaluation
-A list of plausible tests
Figure 1. The decision context
To further illustrate the decision context, consider a hypothetic clinical case as an example. A 65-year-old woman with well managed chronic diabetes comes to see the doctor. Patient records indicate that during the past year she had 5 pathology requests with 24 various tests ordered. She is looking forward to seeing that her next set of test results remains good. In this case, the problem is to decide what tests (if any) to order for the elderly female patient with diabetes who had relatively high pathology ordering profile over the previous year. To find evidence to support the decision-making, the doctor can search for relevant guidelines for diabetes such as the “guidelines for the management and care of diabetes in the elderly”. If the guideline information is not sufficient enough to address the specific situation on hand, the doctor may recall a similar diabetes patient case he/she managed recently and may come up with a list of tests (such as full blood examination, lipid profile, HbA1c, etc.) that may be appropriate for this situation. Finally, the doctor may choose to order a combination of tests for diabetes monitoring for the particular patient. This decision can be influenced by several factors. For instance, tests ordered for disease monitoring differ considerably from those ordered for screening, diagnostic, or prognostic purposes. External factors of non-clinical nature can also influence the doctor’s decision making. In this case, the patient is actively anxious to make sure that her next set of results remains good and the doctor may order tests under patient pressure to ensure her psychological comfort.
2.2. Existing Tools for Decision Support For the purposes of this analysis, we adopt the broad definition of decision support activities (Marakas, 2003) as the set of activities within unstructured or semi-structured decision context that are aimed to support rather than replace the decision maker (DM), facilitate learning on the DM’s behalf, and are using underlying data and models to focus on the effectiveness of the decision making process. At present, main information sources available for decision support for pathology ordering include clinical guidelines, general feedbacks from expert pathologists, and the knowledge and experience of the GPs. Most clinical guidelines have not been developed in a format that allows for straightforward incorporation into computerized clinical decision support systems (Kidd and Mazza, 2000). In practice, clinical guidelines are commonly presented to doctors in paper format or, even when presented on a computer, only as pages of text. As a means for decision support, these guidelines are of very limited use in most clinical encounters where GPs are struggling with the information overload and the time constrains. In some medical areas such as drug prescribing (Ahearn and Kerr, 2003), guidelines are more likely to be presented to GPs as a series of brief prompts targeted to manage the individual patient. In pathology ordering, however, flexible and interactive guidelines suitable for decision support are yet to become a reality. General feedbacks provided to GPs from the pathology company usually contain general and brief information on the overall volume of test ordering by the GP during a period of time. These feedbacks are not designed to include detailed and specific information, such as tests ordered for a particular group of patients with a particular kind of disease. This means that the decision support provided by general feedbacks does not take patient characteristics into consideration and thus cannot support the GP by offering patientspecific and situationally relevant evidence.
In cases where no immediately relevant or accessible information is available, GPs are likely to base their decision on individual clinical experience or judgment. This kind of decision often exhibits the signs of low level of evidence as far as both reliability and certainty are concerned. As pointed out by Cox (2001), clinician’s everyday behavior is based on “sloppy” numerical thinking, and clinical memory can be selective and biased. In summary, current guidelines for pathology ordering are mostly disease-focused and only contain lower levels of evidence. More generally, according to Smellie et al (2000), in the field of pathology ordering it is difficult to envision the gold standard of randomized controlled trials being applied to all of the possible situations involving pathology tests. The evidence base of guidelines is limited to lower categories based on non-randomized trials, or on consensus opinion. Even when a high degree of evidence is available in a disease setting, information about the application of laboratory investigations in specific patient situations is often limited. As suggested by van der Weyden (1999), general practice centers on the individual patient-doctor relationship and may require more “circumstantial” evidence rather than the “watertight” evidence accrued by randomized controlled trials. In daily practice, it is still a norm for the GP to rely on personal experience and judgment which is based on no external evidence to make a decision on pathology ordering. Thus, in the decision making context of GP pathology ordering, current decision support provided by clinical guidelines, general feedbacks and individual experience can be considered as being of very limited effectiveness as far as interactivity, flexibility, situational relevance, and evidence base are concerned. In the next section, we propose an intelligent decision support methodology to better fulfill the contextual requirements of evidence based, patient situation relevant, flexible, and interactive decision-making at the point-of-care test ordering.
3. Approach and Methodology With massive amount of test requests data available in pathology laboratories, data mining techniques can be used to discover the requesting patterns in the large repositories held by pathology companies. This can provide GPs with the peer group evidence of test ordering by other doctors within the system as a comparison to their own ordering behavior. The case based reasoning approach, on the other hand, is useful in leveraging knowledge encapsulated in previously experienced and resolved ordering cases to support making a new test order.
3.1. Data mining and Case Based Reasoning Data mining is the extraction of implicit, previously unknown, and potentially useful information form complex and large datasets. Berry et al. (2000) identify several stages of a generic data mining project. Firstly, the objectives and requirement of the project are specified and an understanding of the data is developed. Then the data is prepared through cleaning and transformation for modeling. With the problem understood and data prepared, model building or pattern identification is undertaken through the application of algorithms appropriate to the problem. The models are evaluated for technical accuracy and suitability. Findings are then applied to the business settings. In the setting of this study, the objective of data mining is to discover hidden patterns in past pathology requesting data. Case Based Reasoning provides the decision maker with an ability to utilize the specific knowledge of previously experienced, concrete problem situations, or specific patient cases (Kolodner, 1993). Cases may be kept as concrete experiences, or a set of similar cases may form a generalized case. According to Aamodt and Plaza (1994), central tasks of CBR methods are to identify the current problem situation, find a past case similar to the new one, use that case to suggest a solution to the current problem, evaluate the proposed solution, and update the system by learning from this experience. A general CBR cycle may be described by four processes: retrieve the most similar case or cases;
reuse the information and knowledge in that case; revise the proposed solution; and retain the experience for future problem solving. The retrieval process involves the tasks of situation assessment, initial match and final selection. According to a recent overview of medical CBR systems and system development by Nilsson and Sollenborn (2004), CBR systems have been used in the medical domain for the purpose of diagnostic, classification, tutoring, and planning (such as therapy support). In the medical field attempts to apply the complete CBR cycle are rather exceptional (Schmidt et al., 2001). The most challenging task for the CBR method is that of adaptation. In medical applications it is almost impossible to generate adaptation rules to consider all possible important differences between current and former similar cases. Therefore, some adaptation solutions have been developed that are rather typical for medical domains. One of the solutions is to focus only on the retrieval of similar cases and present them as information to the user. The motivation for abandoning the adaptation task is two-fold: in health-related application domains it is too complicated or even impossible to acquire sufficient adaptation knowledge; also the physicians tend to be interested in getting information about former similar cases, but prefer to reason about current situation themselves. In this paper, we focus on case retrieval and providing relevant evidence for decision support.
3.2. Integrated approach The separate application of either data mining or CBR principles cannot fully achieve the aims for evidence based, patient situation relevant, flexible, and interactive intelligent decision support for pathology ordering. Table 1 compares these two approaches in terms of how well they can meet these decision support requirements. As discussed earlier in this section, while data mining is dealing with discovering knowledge from data, CBR is concerned with how to use that knowledge to solve a new problem. The natural proposition is then that these two approaches can complement each
other to better meet the evidence based, situational relevance, flexibility, and interactivity requirements for intelligent decision support at the point of care. Table 1 Comparison between data mining and CBR approach CRITERIA Evidence base Situational relevance Flexibility
Interactivity
Data mining Evidence based on past pathology ordering by large group of peer GPs Data is prepared from a patientoriented perspective Provides general information on patterns with possible drill-down capabilities Interactively presents knowledge discovered through mining via visual maps
CBR Judgment based on individual GP’s knowledge and experience Caters well to a particular patient situation Specific case dependent - GP collects information from similar individual cases Full cycles of past knowledge retrieval to support current problem
The underlying philosophy of the proposed integrated approach is to establish a patientcentric “peer-group based” knowledge repository of past experiences to support problem solving in new cases. In particular, once relevant knowledge is extracted from past data through data mining techniques, it can be retrieved and reused by the CBR process cycle. The suggested two-stage framework for implementing this approach is demonstrated in Figure 2. In the setting of this study, past pathology requesting data from one or more pathology companies can be explored and prepared with a patient-focused perspective for modeling. We stress the importance of patient (as opposed to the disease) perspective as a focus for the activities, because in real clinical scenarios, it typically is a particular patient (not a disease group) who presents the primary context for doctor’s decision making activities. Although many different clustering algorithms can be used for the purposes of data mining stage (Kantardzic, 2003), the procedure employed for data clustering in this study is based on Kohonen’s Self-Organizing Maps (SOM) (Kohonen, 1982; 1990; 1997). SOM architecture is an unsupervised neural network approach for data clustering and visualization. A number of grouping models can then be identified through clustering and further evaluated against each other using both quantitative (Bolshakova and Azuaje, 2003) and qualitative criteria (Siew et al., 2002). The most informative model can then be
chosen and analyzed for case-based reasoning to support GP decision making in pathology ordering.
Data mining stage The original data are cleaned and preprocessed for the purpose of this research
Data preparation
CBR stage
Data/Case base
The new case (together with the evaluation) is stored in pathology request database
Ten independent samples are randomly drawn from the study population
Sample generation
Case retaining
Pathologist’s evaluation or comment is integrated to the new case (as expected for an ideal system)
A clustering model is generated independently for each sample
SOM application
Case revision
The pathologist gives evaluation/comment on GP’s ordering for the new case
DBIs are calculated to assess the quality of each clustering model
Model assessment
Case reuse
The GP uses the information of past cases to make a decision of ordering for the new case
The best models are cross examined to ensure consistency
The representative model is selected based on quantitative and qualitative criteria The representative clustering model is analyzed with respect to demographics and test consumption patterns
Cross-examination Final selection
Model selection
Detailed Analysis
Initial match
Situation assessment
Information of past cases was selected at cluster, disease or the “best match” level The new case is classified into a patient type/cluster
Case retrieval
The new case is assessed according to patient characteristics and past pathology ordering
New case
Figure 2. Framework for the integrated approach
Once the structure of patient clusters is specified, this information can be used to support the CBR processes. In the retrieval process, a patient cluster with specific characteristics
similar to the current case can easily be retrieved for further selection. Current capabilities of data mining and visualization software often allow for automatic assignment of a new patient record to the best matching cluster within a pre-defined grouping model based on a clearly formulated but fully customizable notion of similarity (Deboeck and Kohonen, 1998; Eudaptics Software GMBH, 1999). The best matching cluster then can serve as the starting base that the doctor can further interactively narrow down to the most similar cases worthy of intensive consideration. Once the best “match” is found, the information and statistics of the selected case or subset of cases can then be used for resolving the actual problem in a standard CBR fashion. Note that in the true spirit of decision support (Gorry and Scott Morton, 1989), this approach is not intended to replace the decision maker, since, as discussed earlier in this section, clinical judgment of the doctor is required to re-live the logic and rationale behind the solutions to the past case (or groups of cases) and translate this previous ordering experience into a solution for current ordering, i.e. beyond the retrieval stage. Since the proposed approach focuses on supporting the GP for pathology ordering at the point of care, the immediate task ends after the GP uses the information of retrieved past cases and makes the decision of test ordering for the new case. At the same time, the whole CBR process may, and indeed should, continue until the new ordering case is evaluated by a pathologist and all the information concerning the case is stored in the pathology request database. Although it is beyond the scope of this paper, this progression opens up the opportunity for future research on the integrated CBR system that incorporates the workflows of both GP test ordering and pathology company.
4. Tools and Techniques In this paper the technique of clustering is used for the purpose of data mining. Clustering involves grouping of similar data items. The aim of clustering in the context of this study is to identify homogenous patient groups based on their demographic information and pathology consumption patterns. The clustering technique used in this research is based on Kohonen’s Self-Organizing Map (SOM) (Kohonen, 1982; 1990; 1997).
4.1. Kohonen’s Self-Organizing Maps Self-organization is a type of neural network which classifies data and discovers relationships within the data set without any guidance during learning (Smith, 1999). The objective of the SOM method is to find groups of records with minimal intra-group diversity and maximal inter-group separation. The basic principle of discovering these groups is to identify which input patterns are similar and should be grouped together (or clustered). The similarity of two input patterns is determined by the distance between these inputs in the (multidimensional) input space. The SOM method is driven by a non-parametric algorithm and relies on data, rather than domain-specific expertise. SOM generally employs large data sets, works well with many input variables and produces arbitrarily complex models unlimited by human comprehension (Kennedy et al, 1998). SOMs provide a visual understanding of patterns in data through a two-dimensional representation of all variables. The SOM algorithm repeatedly repositions records in the map until a classification error function is minimized. Records that have similar characteristics are adjacent in the map, and dissimilar records are situated at a distance determined by degree of dissimilarity. In particular, SOM consists of a layer of input vectors and a two-dimensional grid of output nodes. Each output node is connected to all the input vectors through the link of weights. When an input vector is presented, the closest match (most similar) of the output node is identified as the winning node. The input vector is thus mapped to the location of the winning node. The weights of the winning node and its neighbourhood are then updated closer to the original input vector. This process repeats until weights are stabilized, and all input vectors are mapped onto the output array. In this way, input vectors with similar data patterns are located into adjacent region while dissimilar vectors are situated at a distance in the output map.
Viscovery SOMine (Deboeck and Kohonen, 1998), the software tool used in this analysis, employs a variant of Kohonen’s Batch-SOM (Kohonen, 1995) guided by Ward’s classic Hierarchical Agglomeration algorithm (Ward, 1963) to determine the optimal number of clusters. It has advanced data visualization capabilities in particular allowing to project data into two-dimensional maps to allow for easier analysis and understanding of the results. Features of the data and the dependencies between the variables can be identified and evaluated from the map (Eudaptics Software GMBH, 1999).
4.2. Cluster Quality Assessment Tools When assessing the quality of clustering model for validation purposes, both qualitative and quantitative criteria can be used. The Davies-Bouldin index (DBI) is used to quantitatively assess the quality of cluster separation. It aims to identify sets of clusters that are compact and well separated. This validation framework has been successfully tested on clustering techniques such as the Kohonen Self-Organising Map algorithm (Bolshakova & Azuaje, 2003). The Davies-Bouldin validation index is defined as (Bolshakova & Azuaje, 2003):
DB(U ) =
1 c ⎪⎧ Δ ( X i ) + Δ ( X j ) ⎫⎪ max ⎨ ⎬, ∑ i j ≠ c i =1 ⎪⎩ δ ( X i , X j ) ⎪⎭
(1)
where ∆(Xk) represents the intra-cluster distances (“diameter”) of cluster Xk ; δ(Xi, Xj) defines the inter-cluster distance between clusters Xi and Xj ; and c is the number of clusters of partition U. Naturally, small index values correspond to “good” clusters, i.e. the clusters that are compact and their centres are far away from each other. In conjunction with the quantitative methods, the following qualitative criteria are used to select the representative model (Siew et al., 2002):
• Representability: The variables of each cluster should be distinct and carry some information of their own. When each cluster is analyzed, its profile should be unique and meaningful. • Explainability: The clusters themselves are distinct in terms of top tests ordered and top presenting clinical problems. If two groups could independently come up with different testing and clinical problem patterns, then these two groups are considered different from one another. • Level of sophistication: The total size of each cluster should be closely monitored. If the cluster is too large then it is possible that more distinct groups could hide in the cluster. If it is too small, then there is high probability that the cluster is artificial.
5. Implementation In accordance with the integrated approach described in Section 3, the aim of the data mining stage is to discover homogeneous patient clusters from massive pathology ordering data and extract knowledge within the clusters for the use of CBR at later stage. The data provided by XYZ Pathology Company in Australia contain 1,548,122 records of pathology requests by General Practitioners (GPs) within the period from 01 May 2003 to 30 April 2004. Each record represents an individual request for one or a group of pathology tests on behalf of a patient.
5.1. Data Preparation Every individual pathology request record in the original data set contains 15 fields relating to four broad categories: patient-related information, doctor-related information, request information, and billing information. Due to the scope of this study, only relevant patient-related and request information is selected for further analysis. The selected fields include: Patient ID, Patient Date of Birth, Patient Gender, Request ID, Date of Service, Referral Date, Tests Ordered (presented as codes), and Clinical Notes.
Clinical Notes are doctors’ notes in free text format providing information regarding patients’ health status. Although about one third of the total requests do not include clinical notes, these requests are still included in the analysis with a missing value for clinical notes. Records with empty values for key attributes such as Patient Date of Birth, Patient Gender, Referral Date, and Tests Ordered, are deleted as well as the records containing obvious error values. As the result, about 2% of the original records are deleted and 1,511,889 records of pathology requests are included in the analysis. The following pre-processing is performed: •
Patient age, number of tests per request, and service lag per request are calculated for each record. Service lag is defined as the time interval between the date of referral and date of service for a pathology request. Patient gender is recoded as a binary variable (0 – male; 1 - female).
•
As any given patient can potentially have more than one pathology request ordered per year, the original request-based data are aggregated into patient-based records with an additional attribute generated for each patient – number of orders per year. Requests for individual patients are aggregated into 764,470 patientspecific records so that each patient record includes the number of orders per year, total number of tests ordered per year, and average service lag for that year. (Tables 2 and 3 provide a comparison between the original request-based records structure and that based on the individually aggregated patient records.)
•
Clinical notes are transformed into problem codes (each representing a disease) by keyword extraction from the free text.
Table 2. Request level data set fragment Patient ID
Age
Gender
Number of tests per order
Servic e lag
396
19
0
1
0
400
59
1
5
100
400
59
1
6
13
400
60
1
7
41
402
55
0
5
42
402
55
0
7
3
404
80
1
1
1
404
80
1
5
0
404
81
1
7
8
Table 3. Patient level data set fragment Patient ID
Age
Gender
Number of orders per year
Number of tests per year
Servi ce lag
396
19
0
1
1
0
400
59
1
3
18
51
402
55
0
2
12
23
404
80
1
3
13
3
Table 4. Pre-processed data summary Min
Max
Mean
St. Dev
Patient Age (full years)
0
107
47.93
20.48
Patient Gender
0
1
0.66
NA
Number of orders/year
1
154
1.98
3.17
Number of tests/year
1
356
7.98
8.45
Service Lag (days)
0
1,774
6.05
22.01
Table 4 presents the statistical summary of the pre-processed population data. Patient age is approximately normally distributed, with most patients aged from 40 to 60. Patients with the age of “0” are those less than 12 months old. Females make up two thirds of all the patients. On average, a patient has two pathology orders with eight tests ordered during the year. About 65% of patients only have one pathology order. Almost 50% of patients have five or less tests ordered. In principle, a patient may have an extremely large number of tests (such as 356 tests in our sample) ordered because of certain critical illness which requires frequent monitoring testing. About 54% of the patients have the tests done on the day of referral. A patient may also have an extremely long service lag if he/she gets a request for a number of repeats of the same test that should occur over a long period of time. Tables 5 and 6 present the statistics on the ten most commonly ordered tests, and ten most frequently encountered clinical problems respectively.
Table 5. Ten most frequently ordered tests
Table 6. Ten most frequently occurring clinical problems
Rank
Pathology test
Number of tests
1
Full blood count
470,227 (7.71%)
2
Lipids
394,103 (6.46%)
3
Glucose
302,378 (4.96%)
4
Liver function
299,769 (4.92%)
5
Electrolytes, urea and creatinine (EUC) Erythrocyte sedimentation rate (ESR)
295,525 (4.85%)
Rank 1 2 3 4
Lipid disorder Urinary tract infection Hypertension
Number of encounters 56,100 (6.34%) 55,777 (6.31%) 40,399 (4.57%) 33,768 (3.82%)
7
INR for warfarin
274,631 (4.50%)
7
Fatigue Thyroid problem Pregnancy
8
Thyroid function
218,813 (3.59%)
8
Anaemia
10,684 (1.21%)
9
PAP smear
175,818 (2.88%)
9
Abdominal pain
10,031 (1.13%)
10
Urine MC&S
169,423 (2.78%)
10
Chest pain
9,758 (1.10%)
6
5
Clinical Problem Diabetes
277,192 (4.55%)
6
25,982 (2.94%) 25,788 (2.92%) 17,697 (2.00%)
5.2. Sample generation and SOM application At the next stage, ten independent random samples are drawn from the pre-processed data, each sample accounting for 6% of the population (around 46,000 patients for each sample). The limitation on the size of the samples is imposed by the available version of Viscovery SOMine software. Each sample data is independently and consistently (identical learning settings) processed by the software and a corresponding clustering model is generated. In the context of this research, it is expected that patients with similar characteristics (such as age, gender, number of tests consumed, service lags, etc.) should appear in the same cluster, while dissimilar patients are classified into different clusters. During the experiments, clinical variables such as top tests ordered and top problems occurred are intentionally not included as inputs to form the clusters. The aim is to determine whether each cluster would exhibit different clinical characteristics. The list of input variables (numerically encoded) for clustering are therefore: patient age, patient gender, number of orders per year, number of tests per year, and service lags. Clinical variables are superimposed once the clusters are formed and the representative clustering structure is chosen.
For the resulting ten clustering models, basic statistics for each cluster presents the material for investigation of cluster features and the comparison of clustering quality among the ten samples. The number of clusters generated for each sample is demonstrated in Table 7. For convenience, each cluster is given a semantically meaningful name according to the values of its main attributes. When compared across ten samples, the coarsest clustering structure consists of three clusters: female users, male users, and slow high users. The structure with four clusters separates the female users cluster into young female users and old female users. The structure with six clusters further separates slow high users into high users, slow users and old frequent slow high users. The seven cluster structure separates male users into young male users and old male users. The most sophisticated structure with eight clusters separates old female users into old female low users and old female high users.
5.3. Cluster quality assessment, cross-examination, and model selection The DBIs are calculated for the ten clustering models based on cluster statistics retrieved from the previous SOM application stage. The resulting DBIs are shown in Table 7. Note that DBI values vary according to the number of clusters in a given partition. The models with largest DBIs are those with the structure of four clusters (samples 2, 9 and 10). The models with smallest DBIs are sample 5 (seven clusters) and sample 6 (six clusters) with the DBI values of 1.1023 and 1.0924 respectively. These two models were selected for further cross examination. Table 7. Clustering statistics for ten independent samples Sample 1 2 3 4 5 6 7 8 9 10
No of patients 46,043 46,111 46,246 45,964 46,027 46,029 45,818 45,923 45,780 46,003
No of clusters 7 4 3 3 7 6 3 8 4 4
DBI 1.1228 1.4971 1.2187 1.2576 1.1023 1.0924 1.1962 1.2108 1.4650 1.4527
During the cross examination stage, clustering model A is applied to the data set that is originally used to generate model B. If most patients from a given cluster in model A are similarly classified into the semantically relevant cluster of model B, one may conclude that these two models are consistently similar in structure. The cross examination of the two clustering models reveals that, subject to the differences in granularity between six and seven groups-based clustering structure, most of the patients fell into the correspondent clusters of the reciprocal model. This validates the clustering models and demonstrates that these two models are consistent in grouping patients into distinctive clusters. Tables 8 and 9 summarize the results of cross examination stage.
Table 8. Re-allocating patients from seven
Table 9. Re-allocating patients from six clusters into
clusters to six clusters
seven clusters
Corresponding clusters Young female into Young female
% of patients 97.69
Old female into Old female
71.35
Young male into Male
99.08
Old male into Male
70.36
High user into High user
70.27
Slow user into Slow user
82.60
Old frequent slow high user into Old frequent slow high user
89.86
Corresponding clusters
% of patients
Young female into Young female
86.74
Old female into Old female
97.13
Male into Young male Male into Old male
53.19 43.86
High user into High user
52.45
Slow user into Slow user
90.10
Old frequent slow high user into Old frequent slow high user; Old frequent slow high user into High user
39.94 56.55
Based on the qualitative criteria used to select the representative model (Siew et al., 2002) discussed in Section 4, the clustering model identified for sample 5 (seven clusters) is selected as the representative model for further investigations. Its structure is superior when compared to the six clusters model as far as the balance between explainability and sophistication is concerned. To ensure that the patient sample used to generate the selected clustering model is in fact representative of the overall population as far as the key variables of interest are concerned, respective distributions of these variables are analyzed for sample 5 and the population. Table 10 presents the side-by-side comparison of these distributions. Observe
that for all the variables of interest the population mean lies within the 95% confidence interval generated using the sample mean. Table 10. Comparison of distributions for population and sample 5 Min
Patient Age (years) Patient Gender Number of orders/year Number of tests/year Service Lag (days)
Max
Mean
Population
Sample
Population
Sample
Population
0
0
107
104
47.93
0
0
1
1
0.66
1
1
154
107
1.98
1
1
356
149
7.98
0
0
1,774
1,242
6.05
St. Dev
Sample (95% CI) 47.83 (47.64;48.01) 0.66 2.00 (1.97; 2.02) 8.00 (7.92; 8.07) 6.08 (5.89; 6.28)
Population
Sample
20.48
20.56
NA
NA
3.17
3.25
8.45
8.51
22.01
23.03
5.4. Detailed analysis of patient groups and ordering patterns Table 11 describes the basic statistics on the resulting clustering structure. Having obtained the clustering structure, the variables of interest such as most frequently ordered individual tests and frequently encountered clinical problems are super-imposed on this structure. The resulting description combines the elements of knowledge generated in the course of the data mining exercise. Figure 3 illustrates the average and total number of pathology tests consumed by each cluster. Table 11. Basic statistics on the clustering structure C1 Young female
C2 Old female
C3 Old male
C4
C5
C6
C7
Young male
High user
Slow user
Old frequent slow high user
Matching records
14650
9727
7048
6284
4531
2959
828
Matching records (%)
31.83
21.13
15.31
13.65
9.84
6.43
1.8
Patient Age (years)
30.60
63.16
65.07
32.18
56.89
54.92
69.84
Patient Gender
1
1
0
0
0.86
0.63
0.51
Number of orders/year
1.20
1.45
1.69
1.22
4.36
1.84
18.58
Number of tests/year
4.66
5.06
9.18
5.74
21.32
9.34
31.19
Service Lag (days)
0.96
1.34
2.84
0.77
2.31
58.59
53.23
120000
30
100000
25
80000
20 60000
15
40000
10
20000
5 0
Total number of tests
Average number of tests
35
average total
0 Young female
Old Old male Young High user female male
Slow user
Old frequent slow high user
Figure 3. Mean and total number of tests for each cluster
Cluster 1 (31.83% by patient volume): Young female cluster consists of female patients with an average age of 30.6 years. This cluster is the least frequent pathology user (average 1.20 orders per year) and consumes the lowest volume of tests (~4.66 tests per year). Tests ordered for this cluster are usually performed on the date of referral or the next day (~0.96 days delay). Patients from this cluster have a high average number of PAP smear tests ordered, but have almost no INR for warfarin tests ordered. The clinical problems and conditions most common in this cluster are urinary tract infections and pregnancies. Cluster 2 (21.13%): Old female cluster consists of older female patients with an average age of 63.16 years. Patients in this cluster consume slightly more tests (~5.06 per year) and the tests are done slightly slower (within ~1.34 days). These patients also have a high average number of PAP smears ordered (but not as high as patients in the young female cluster), and have a relatively small average number of INR for warfarin tests ordered. Most frequently occurring clinical problems include urinary tract infections and lipid disorders. Cluster 3 (15.31%): Old male cluster includes male patients with an average age of 65.07 years. Patients in this cluster, although not being frequent users (~1.69 orders per
year), require moderately high volume of testing (~9.18 tests per year). Most of the tests are done within a reasonably short timeframe (~2.84 days). Old male patients have a relatively high volume of PSA (Prostate Specific Antigen) tests ordered. Diabetes and lipid disorders appear as most frequent clinical problems in this cluster. Cluster 4 (13.65%): Young male cluster is formed by younger male patients with an average age of 32.18 years. Like young female cluster, this cluster includes infrequent (~1.22 orders), low (~5.74 tests), and quick (~0.77 days) users of the pathology services. Young male cluster does not have specific test orders that are dominant in this cluster. Clinical problems related to lipid disorders and fatigue occur most frequently in this cluster. Cluster 5 (9.84%): High user cluster is predominantly female based (~86%) with an average age of 56.89 years. This cluster includes frequent users (~4.36 orders) and high volume test consumers (~21.32 tests) with reasonably short service lag (~2.31 days). Patients in this cluster consume high volume of Full blood count, Lipids, Glucose, Liver function, EUC (electrolytes, urea and creatinine), ESR (erythrocyte sedimentation rate), INR for warfarin, and Thyroid function. The most typical clinical problems include diabetes, lipid disorders, thyroid problems and urinary tract infections. Cluster 6 (6.43%): Slow user cluster includes slightly more female (~63%) than male patients with an average age of 54.92 years. The patients in this cluster are not frequent users (~1.84 orders), but consume moderately high number of tests (~9.34 tests). The distinctive feature of this cluster is that the average service lag is very long (~58.59 days), thus indicating that the patients usually wait for two months to get the tests done. This situation is quite common for patients having chronic diseases that should be constantly monitored. As the result, these patients are getting GP requests for a number of repeats of the same test that should occur over a relatively long period of time. This is confirmed by the fact that slow users have high average number of tests for lipids and slightly elevated average number of thyroid function tests ordered. Lipid disorders and diabetes are the most dominant problems in this cluster.
Cluster 7 (1.8%): Old frequent slow high user cluster consists of equal number of female and male patients. Despite being the smallest in terms of number of patients, this cluster is very important as it contains patients that are old (~69.84 years), have tests being ordered very frequently (~18.58 orders), consume the highest volume of tests (~31.19 tests), and have the tests done very slowly (~53.23 days service lag). A closer look at this cluster reveals that most of the patients in this cluster have the INR for warfarin tests ordered. The dominant problems in this cluster include diabetes, lipid disorders and thyroid problems. 5.4. Case retrieval With the clusters of similar cases identified and knowledge of each cluster extracted from the data mining stage, the CBR stage is concerned with the utilization of information from similar cases for decision support. Recall that a case is defined as a clinical problem situation where a patient with particular characteristics (including age, gender and past ordering pattern) and dominant symptoms or diseases is presented and the solution about what types of tests to order is provided. A cluster of past similar cases contains the information such as frequency distribution of each attribute for each cluster, most dominant clinical problems and most frequently ordered tests for each cluster. As discussed in previous sections, the proposed CBR method focuses on the retrieval of similar past cases and providing information of the retrieved cases to support GPs’ decision making of test ordering. The process of similar cases retrieval and relevant information collection can be described using the hypothetical clinical scenario presented earlier in this paper. In this example, a new test ordering case is presented when the elderly female diabetes patient comes to see the doctor for diabetes management. This patient had relatively frequent (5) and high volume of tests (24) ordering over the past year. Within the scope of the proposed integrated approach, retrieval of similar cases can be achieved at (at least) three levels. Firstly, at the cluster level, the doctor may want to find
a cluster of patients who are similar to the presenting patient in terms of age, gender, and past pathology ordering patterns. This, for example, can be achieved by using the “recall” functionality in Viscovery SOMine software: for the scenario under consideration the patient is automatically assigned into High user cluster. The relevant cluster statistics show that most dominant diseases in this cluster include diabetes, lipid disorder, thyroid problem, UTI, etc. The top tests ordered by this cluster include full blood examination, urea electrolytes creatinine, ESR, LTF’s, plasma glucose, TSH, etc. Secondly, at the disease level within the cluster, the doctor may be more interested in all the diabetes patients in the High user cluster. Tests most often requested by this group of patients are shown in Table 12.
Table 12. Tests most often requested by frequently reviewed high user diabetic patients
1 2 3 4 5 6 7 8 9 10
Top 10 tests ordered by these patients Lipid profile HbA1c Full blood examination Plasma glucose Urea electrolytes creatinine LFT's ESR TSH Multiple Biochem Analysis Urine MC&S
% of patients had at least one of the test ordered 90.9 86.5 84 70.9 68.3 67 61.7 54.1 35 32.6
Finally, at the individual patient level, the identified subset can be further drilled down to the “best match”: the most similar patient(s) to the current one. This can be done either visually using the corresponding components temperature maps in Viscovery SOMine, or by setting constrains on the attributes such as age, gender, number of orders, and number of tests requested per year. Once the best match is found, detailed information recorded on each pathology request for the patient(s) can then be retrieved and used for supporting the new ordering decision. Figures 4 and 5 visually present the corresponding clustering structure that includes “high user” cluster as well as simple statistical summaries for the cluster level retrieval strategy and the “best match” neighborhood retrieval strategy respectively.
Figure 4. Cases retrieval at cluster level
Figure 5. The “best match” cases retrieval
6. Discussion and Conclusions In this paper we propose a formal approach that integrates data mining and CBR methodologies to provide intelligent decision support for test ordering by GPs. The rationale for integrating data mining and CBR methodologies is to discover knowledge from past data using data mining, and to retrieve and enable the use of this knowledge through CBR for the purposes of decision support. Table 13 highlights that, as far as practical aspects of decision support are concerned, in comparison with the individual approaches of either data mining or CBR, the integrated approach presents the advantage of combining the strength and complementing the weaknesses of each of the individual approaches.
Table 3. Evaluation of the proposed model CRITERIA Evidence base Situational relevance Flexibility Interactivity
Combination of data mining and CBR Evidence based on past pathology ordering by large group of peer GPs Patient-oriented perspective catering to particular patient situations GP can retrieve cases and match information at cluster, disease, and individual levels Information is retrieved through continuous interaction with the GP and is interactively matched to relevant groups of cases via visual maps
At the methodological level, the potential advantage of the proposed integrated approach is in its ability to use the available information at either cluster level or disease level within the cluster in order to form a generalized case based on a set of similar cases. This presents a new perspective on the use of prototypes through case aggregation – one of the current trends of medical CBR systems according to a recent overview of medical CBR systems and system development by Nilsson and Sollenborn (2004). Adopting such a perspective better equips the designers of the decision support systems to address the most challenging task for the CBR method - that of adaptation. As discussed earlier in the paper, in medical applications it is almost impossible to generate adaptation rules to consider all possible important differences between current and former similar cases. Therefore, some adaptation solutions have been developed that are rather typical for medical domains. As one problem for adaptation is the extreme specificity of single cases, one of the proposed solutions is to generalize from single cases into abstracted prototypes or classes – something that should be effectively achieved at either cluster level or disease level within the cluster. These issues, as well as the issues of systems implementation and usability, constitute interesting and promising directions for future research in this area. The novel combination of SOM-based data clustering and case-based reasoning to facilitate the evidence based, situationally relevant, interactive, and flexible decision support for pathology ordering activities by GPs discussed in this paper presents a possible methodological foundation for these and other similar developments. The existence of such an effective and robust methodology for generating the required evidence forms a necessary precondition for its use by GPs for decision making and, by implication, is one of the important drivers for more appropriate pathology test ordering.
References: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16.
17. 18. 19. 20. 21. 22. 23. 24.
Aamodt A, Plaza E. Case-based reasoning: foundational issues, methodological variations, and system approaches. AI Communications. IOS press, Vol. 7:1, 1994, p. 39-59. Aha DW, Bankert RL. Feature selection for case-based classification of cloud types: an empirical comparison. Proceedings of AAAI-94 Workshop Case-Based Reasoning, 1994, p. 106-112. Ahearn MD, Kerr SJ. General practitioners’ perceptions of the pharmaceutical decision-support tools in their prescribing software. Medical Journal of Australia 2003; 179:34-37. Arshadi N, Jurisica I. Data mining for case-based reasoning in high-dimensional biological domains. IEEE Transactions on Knowledge and Data Engineering, Vol. 17:8, 2005, p. 1127-1136. Axt-Adam P, van der Wouden JC, van der Does E. Influencing behavior of physicians ordering laboratory tests: a literature study. Medical Care 1993; 31:784-94. Berry, MJA and Linoff, GS. Mastering data mining, New York, John Wiley and Sons 2000. Bichindaritz I, Kansu E, Sullivan KM. Case-based reasoning in CARE-PARTNER: gathering evidence for evidence-based medical practice. Proceedings of 4th European Workshop on CBR, Springer, Berlin, 1998, p. 334-345. Bolshakova N, Azuaje F. Improving expression data mining through cluster validation. Proceedings of the 4th Annual IEEE Conference on Information Technology Applications in Biomedicine, 2003, p. 19-22. Britt H, Miller GC, Charles J, Knox S, Valenti L, Henderson J, Pan Y, Bayram C, Harrison C. General practice activity in Australia 2002-03. Australian Institute of Health and Welfare (General Practice Series No. 14), 2004 p. 81. Clerkin P, Hayes C, Cunningham P. Automated case generation for recommender systems using knowledge discovery techniques. Trinity College Dublin Computer Science Department Technical Report, April, 2002. Cohen J, Piterman L, McCall L, Segal L. Near-patient testing for serum cholesterol: attitudes of general practitioners and patients, appropriateness, and costs. Medical Journal of Australia 1998;168:605-610. Cox K. Evidence-based medicine and everyday reality. Medical Journal of Australia 2001; 175:382-383. Davis D, Cosenza RM. Business research for decision making. Belmont, Calif., Wadsworth 1993. Deboeck G, Kohonen T. Visual Explorations in Finance with Self-Organizing Maps. London: Springer-Verlag, 1998. Gorry GA, Scott Morton MS. A framework for management information systems. Sloan Management Review, 1989. Guibert R, Wicker S, Horrocks M. Background Reading for QUP-GP workshop. The development of a research proposal to address the appropriate use of pathology to general practice – stage 1. The Royal Australian College of General Practitioners Research and Practice Support Directorate, prepared by 31 August – 1 September 2001. Han J, Kamber M. Data mining: concepts and techniques. Morgan Kaufmann, 2001. Isouard G. A quality management intervention to improve clinical laboratory use in acute myocardial infarction. Medical Journal of Australia 1999; 170:11-14. Kantardzic M. Data mining: concepts, models, methods, and algorithms. IEEE Press; Wiley Interscience, 2003. Kennedy R, Lee Y, van Roy B, Reed CD, Lippman, RP. Solving Data Mining Problems Through Pattern Recognition. Prentice-Hall: Englewood Cliffs, NJ, 1998. Kidd MR, Mazza D. Clinical practice guidelines and the computer on your desk. Medical Journal of Australia 2000; 173:373-375. Kohonen T. Self-Organized formation of topologically correct feature maps. Biological Cybernetics, 1982, 43, 59-69. Kohonen T. The Self Organizing Map. IEEE Proceedings 1990, 78(9), 1464-1480. Kohonen T. Self-Organizing maps (Second edn) Berlin: Springer, 1997.
25. Kolodner JL. Case-based reasoning, San Mateo, CA, Morgan Kaufmann Publishers, 1993. 26. Lundberg GD. Perseveration of laboratory test ordering: a syndrome affecting clinicians. Journal of the American Medical Association 1983; 249:639. 27. Lundberg GD. The need for an outcomes research agenda for clinical laboratory testing, Journal of the American Medical Association 1998; 280:565-566. 28. Marakas GM. Decision Support Systems in the 21st Century. Prentice Hall, 2003. 29. Nilsson M, Sollenborn M. Advancements and trends in medical case-based reasoning: an overview of systems and system development. Proceedings of the 17th International FlAIRS Conference, Special Track on Case-Based Reasoning, American Association for Artificial Intelligence, Miami, USA, 2004. p. 178-183. 30. Rao GG, Crook M, Tillyer ML. Pathology tests: is the time for demand management ripe at last? Journal of clinical pathology 2003; 56;243-248. 31. Schmidt R, Montani S, Bellazzi R, Portinale L, Gierl L. Case-based reasoning for medical knowledge-based systems. International Journal of Medical Informatics 2001; 64:355-367. 32. Siew E-G, Smith K, Churilov L, Ibrahim M. A neural clustering approach for Iso-Resource grouping for acute healthcare in Australia. Proceedings of the 35 Annual Hawaii International Conference on Systems Science (HICS35). IEEE Computer Society, Hawaii, USA, 2002. 33. Smellie WSA, Galloway MJ, Chinn D. Benchmarking general practice use of pathology services: a model for monitoring change. Journal of clinical pathology 2000; 53:476-480. 34. Smellie WSA, Galloway MJ, Chinn D, Gedling P. Is clinical practice variability the major reason for differences in pathology requesting patterns in general practice? Journal of clinical pathology 2002; 55:312-314. 35. Smellie WSA. Appropriateness of test use in pathology: a new era or reinventing the wheel? The Association of Clinical Biochemists 2003; 40:585-592. 36. Smellie WSA, Finnigan DI, Wilson D, Freedman D, McNulty CAM, Clark G. Methodology for constructing guidance. Journal of clinical pathology 2005; 58:249-253. 37. Smith KA. Introduction to Neural Networks and Data Mining for Business Applications, Emerald, Vic.: Eruditions Publishing, 1999. 38. Solomon DH, Hashimoto H, Daltroy L, Liang MH. Techniques to improve physicians’ use of diagnostic test. Journal of the American Medical Association 1998; 280(23):2020-2027. 39. Stuart PJ, Crooks S, Porton M. An interventional program for diagnostic testing in the emergency department. Medical Journal of Australia 2002;177:131-134. 40. Van Der Weyden MB. Databases and evidence-based medicine in general practice. Medical Journal of Australia 1999; 170:52-53. 41. Van Walraven C, Naylor CD. Do we know what inappropriate laboratory utilization is? A systematic review of laboratory clinical audits. Journal of the American Medical Association 1998; 280:550-8. 42. Verstappen WH, van der Weijden T, Sijbrandij J, Smeele I, Hermsen J, Grimshaw J. Effect of a practice-based strategy on test ordering performance of primary care physicians: a randomized trial. Journal of the American Medical Association 2003; 289:2407-12. 43. Vining RF, Mara P. General practitioners and pathology testing. Medical Journal of Australia 1998;168:591-592. 44. Eudaptics Software: Viscovery SOMine Standard Edition 3.0, Eudaptics Software Gmbh Wien, 1999. 45. Ward J. Hierarchical grouping to optimise an objective function. Journal of the American Statistical Association 1963; 58: 236–244. 46. Weekley JS, Smith BJ, Pradhan M. The intersection of health informatics and evidence-based medicine: computer-based systems to assist clinicians. Medical Journal of Australia 2000; 173:376-378. 47. Wertman BG, Sostrin SV, Pavlova Z, Lundberg GD. Why do physicians order laboratory tests? A study of laboratory test request and use patterns. Journal of the American Medical Association 1980; 243:2080-2082. 48. Yang Q, Wu J. Keep it simple: a case-base maintenance policy based on clustering and information theory. Proceedings of the Canadian AI Conference, 2000, p. 102-114. 49. Young JM, Ward JE. General practitioners’ use of evidence databases. Medical Journal of Australia 1999; 170:56-58.
50. Zhuang ZY, Churilov L, Sikaris K. Uncovering the patterns in pathology ordering by Australian general practitioners: A data mining perspective. In R. H. Sprague (Ed.), Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS-39), Big Island, HI, USA, 36 January, 2006. 51. Zhuang ZY, Churilov L, Burstein F, Sikaris K. Combining data mining and case-based reasoning for intelligent decision support for pathology ordering by general practitioners in Australia. In Proceedings of the International Conference on Creativity and Innovation in Decision Making and Decision Support (CIDMDS 2006), London, UK, 28 June - 1 July, 2006a.