Building a Case-Based Reasoner for Clinical Decision Support Anna Wills and Ian Watson Department of Computer Science University of Auckland, New Zealand
[email protected],
[email protected] www.cs.auckland.ac.nz/~ian/
Abstract. Orion Systems International Limited has recognised the need in the healthcare industry for an application to provide robust clinical decision support. One possible approach is to develop a case-based reasoner to support decisions made in the disease management process. We have undertaken a project to investigate the validity of using case-based reasoning for this task, specifically focusing on the management of treatment of diabetes patients. An application that uses case-based reasoning has been developed and tested. This paper describes the pre-processing of cases, the development of a case representation and similarity metrics and evaluation. Results show that casebased reasoning could be a valid approach, but more investigation is needed.
1 Introduction Orion Systems International Limited intend on entering a new market area with a clinical decision support system. A number of Orion's system modules already rely on some form of decision support, however this is minimal. This future system will support decisions made throughout the disease management process - ensuring the deliverance of the right care, in the right quantity, at the right place and at the right time. This project investigates the use of a case-based reasoning (CBR) method for implementation of such a system. The CBR-Works tool was used to create an application and a diabetes dataset from the UCI Machine Learning Repository was used to test it. Similarity functions were created both through knowledge elicitation and experimentation. CBR-Works is a state-of-the-art commercial CBR application development tool that provides support for a variety of retrieval algorithms and an excellent range of customizable similarity metrics (Schulz 1999). The following sections present some background, describe the application and its evaluation, discuss the results and draw some conclusions from the project's findings.
2 Background Why Build a System? The cost of diabetes treatment is enormous, with only 3.1% of the U.S. population being affected, but accounting for 11.9% of the U.S. healthcare expenditure. Tighter control over the blood glucose levels of diabetes patients C. Zhang, H.W. Guesgen, W.K. Yeap (Eds.): PRICAI 2004, LNAI 3157, pp. 554–562, 2004. © Springer-Verlag Berlin Heidelberg 2004
Building a Case-Based Reasoner for Clinical Decision Support
555
through more intensive management may incur higher up-front costs (labour, medication and supplies) but these would be overshadowed by a significant reduction in the expenditure relating to the development and progression of complications of diabetes and the frequency of these complications. The reduction of diabetes complications will also improve quality of life for many patients. (AACE 2002). Why Use CBR? The activities of the CBR cycle closely match the process requirements of a knowledge management system, making CBR a good contender for any decision support system (Watson 2002). Looking specifically at implementing a clinical decision support system, the most obvious competition for CBR would be a rule-based approach. However, rules for such a system would undoubtedly become very complex and difficult to understand. “Cognitive Science research shows that experts use experience when reasoning about a problem, rather than first principles”. (Sterling 2001). Storing experience lends itself well to a case-based approach, whereas encoding experience knowledge in rules is not as intuitive. CBR systems also automatically adapt with experience, while rule-based approaches may need more user-interaction and understanding. CBR has been used in other medical decision support systems. An integration of CBR and rule-based reasoning was used in systems for the planning of ongoing care of Alzheimer's patients (Marling and Whitehouse 2001) and for the management of Diabetes patients (Bellazi et al. 1999). Diabetes. Patients with IDDM (Insulin Dependent Diabetes Mellitus) are insulin deficient. Once being administered with insulin they are at risk of hypoglycemia (low blood glucose) and hyperglycemia (high blood glucose). The aim of therapy for IDDM is to keep the average blood glucose level as close to the normal range as possible. The blood glucose measurement of a patient is affected by diet, exercise and exogenous insulin treatments. Three types of insulin formulations are administered for patients in our dataset (Regular, NPH and Ultralente), each formulation having a different duration of action.
3 The Application 3.1 Case Acquisition Acquiring data proved to be a very difficult task, with obstacles ranging from privacy issues to the lack of a consistent medical record structure. We finally settled on using the "Diabetes Data" dataset from the UCI Machine Learning Repository to allow us to develop a proof of concept. This dataset consists of 70 files corresponding to 70 different patients, with each file containing a number of record entries for blood glucose readings, insulin doses or other special events. The patient files vary in size, number of entries and timeframe of entries. Each entry consists of an entry code, a value and a timestamp. The entry code specifies what treatment the entry relates to, for example, a pre-breakfast blood glucose measurement, a regular insulin dose, or an above-average exercise session. The value is then the corresponding measurement, for example, the blood glucose
556
Anna Wills and Ian Watson
measurement in mg/dl or the units of regular insulin administered. (The value field is not relevant for the special events, an entry with this event code simply represents the occurrence of that event). The timestamp is very fine-grained, detailing the date and time of treatment for the specific record entry. 3.2 Case Representation The data was pre-processed to delete invalid values that were not consistent with the type checking in CBR-Works; change entry codes to their corresponding string references (e.g. 33 to 'Regular Insulin Dose'), for ease of understanding while testing; and add patientID numbers. An important question that needed to be answered at this point was 'what is a case?' The progress of diabetes patients is usually monitored by analysing patterns. The HbA1c test is an example of this. The HbA1c test provides an average blood glucose reading over a period of six to twelve weeks by measuring the number of glucose molecules attached to hemoglobin. (CCHS October 2002, Geocities October 2002). Therefore, it makes sense for the final system to also analyse and compare patterns of events, rather than just single events. However, we started with a simple case representation of a single entry with attributes: patientID (integer), eventType (taxonomy), value (integer) and timestamp (timestamp). We then queried the casebase with single events to find similar cases. For this we set up the data in a Microsoft Access database and imported it into CBRWorks. Only a small amount of manual adjustment was needed once cases were imported. Once this basic retrieval system was working, cases were extended to represent all the events for a specific patient in a single day. Each day has attributes: DayID (integer), Date (date), PatientID (integer) and EntryRefs (set of references to entry concepts). This case structure required the cases to be entered manually because the variable length set structure (EntryRefs) is not recognised by simple relational databases and therefore could not be imported into CBR-Works in the same way as with the previous case representation. 3.3 Similarity Similarity measures are comprised of three elements: local similarity functions, attribute weights and an amalgamation function (Stahl 2002). This section describes the measurement methods used for each of these. The similarity measurement methods were constructed using a hybrid bottom-up and top-down approach (Stahl 2002). Measurement methods were defined as far as possible using domain knowledge available (bottom-up) and refined using a top-down approach, by analysing the output and in some cases, adjusting the method to produce a more desirable output. Local Similarity Functions and Attribute Weights. The attributes involved in the similarity calculation for comparing entries are eventType and value and these have equal weightings. Only the attribute containing the set of entry references is used for comparing days. Similarity functions for each these 'discriminant' attributes are described below.
Building a Case-Based Reasoner for Clinical Decision Support
557
'EventType' Local Similarity Function. This only compares 'like with like' for events, so EventTypes are 100% similar to an identical EventType and 0% similar to any other. 'Value' Local Similarity Function. For the basic case representation that compared a single event with another single event, rules were written to change the similarity function for the value attribute, depending on the EventType. For blood glucose measurements a symmetric polynomial function on [case value - query value] with a 'gradient manipulator' of seven was used (see fig. 1). For insulin doses and other events a symmetric step function at zero was used (see fig. 1), because these must be identical to be comparable. However, for the extended case, CBR-Works was too limiting to use the rules, so a more general similarity function was applied to all EventType values. A symmetric smooth-step function at one was used (see Fig. 1). When comparing insulin doses and other events with this function, a very low similarity measurement would serve as a warning for values not being identical, while non-identical blood glucose measurement comparisons would still show as being slightly similar. 'EventRefs' Local Similarity Function. A similarity function for EventRefs was programmed, which finds the average similarity over the set. Each set member in the
The bold line shows the similarity measurement used, depending on the difference between the case and query values.
Similarity
100%
0%
[case value - query value]
100%
100%
Similarity
Similarity
Polynomial function with 'gradient manipulator' of seven
0%
0%
[case value - query value]
[case value - query value]
Step function at zero
Smooth step function at one
Fig. 1. Similarity functions used for the value attribute
558
Anna Wills and Ian Watson
query is compared to each set member in the case and the maximum similarity is found for each query member. These similarity values are then averaged. Amalgamation Function. To calculate the similarity of days, an average function was used, with all discriminant attributes contributing equally to the similarity. A minimum function was used for entries, so the total similarity is the lowest of all discriminant attribute similarity values. This means that when the eventTypes are identical, the similarity of the value attribute will dictate the total similarity and when they are not the same the total similarity will be zero (i.e. the similarity of the eventTypes).
4 Evaluation Extensive informal testing was conducted during construction of the application, to assist with the development of suitable similarity functions. Two formal testing sessions were performed, the first after completing the basic event comparison application and the second after extending a case to represent the events in a whole day. 4.1 Basic Application Testing The purpose of the basic application was to design and test the local similarity functions. Its main functional use was to suggest what type and dosage of insulin to administer after observing a certain blood glucose level in a patient and when to administer it. For testing, this translated to the following steps: 1. Tester: Enter the query details - the blood glucose level, type of reading (e.g. pre-breakfast or post-lunch) and time of reading; 2. System: Find a set of similar blood glucose reading entries in the casebase; 3. Tester: For each of the returned cases, note the details of the case (patient, timestamp and similarity value in particular) and use these to find the next recorded insulin dosage(s) for that patient (some patients were administered two different types of insulin straight after one another); 4. Tester: For each event pair, compile the result as the event type and value of the insulin dose(s), the similarity value of the blood glucose reading and the time difference between the two events. In our study, the tester was human, however a front-end application could be written to perform the 'tester' tasks. Three different sized case-bases were tested: five patients with 2604 entries; nine patients with 3378 entries and fourteen patients with 4691 entries. Each case-base was tested with a set of queries. The query set consisted of a 'low', 'medium' and 'high' value reading for each of the blood glucose measurement eventTypes. In all the tests, the results were grouped by patient. For example, the 'low presupper blood glucose measurement' query on the five patient case-base returned ten 'regular insulin dose' predictions with dosage values of 6, 6, 7, 7, (from patient 1), 8, 8, (from patient 2), 10, 10, 10 and 10 (from patient 3). This shows that the type of
Building a Case-Based Reasoner for Clinical Decision Support
559
0.9
Similarity
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
1
2
3
4
5
6
7
Returned case patient and query case patient are the same
8
9
10
Patient Number
Returned case patient and query case patient are different Fig. 2. Similarity values for retrieved cases for 50 cases from the ten patient casebase
insulin and dosage value is dependent on which patient it is to be administered to, as well as the blood glucose level observed. The smallest case-base with entries from five patients returned good results with an average similarity of 0.973 over all the test cases (for all ten cases retrieved in every test). This increased slightly to 0.981 for nine patients and 0.984 for fourteen patients. However, more interestingly, the diversity in the returned set (both in dosage values and dosage types) increased with the larger case-bases. Again, this highlights the differences in treatment patterns between patients. Some of the dataset records were taken from paper records with "logical time" slots (breakfast, lunch, dinner, bedtime). These were assigned fictitious timestamps in the dataset of 08:00, 12:00, 16:00 and 22:00 respectively. This did not pose a problem for the testing, (as had been imagined), because all the time differences between blood glucose readings and dosage administrations for 'correct' times were three minutes or less anyway, the majority being zero minutes (i.e. immediately). Consequently, the 'time lapse before administering the dosage' value was not important. The accuracy of the system was also evaluated. The accuracy measurement used was the Magnitude of Relative Error (MRE) (Mendes et al. 2002), defined as: MRE = |Actual Value - Predicted Value| (1) Actual Value MRE was calculated for a randomly chosen set of 20 cases for each casebase size, using the leave-one-out testing method. All the test cases were for blood glucose measurement events and for each test the case(s) with the highest similarity was (were) found, along with the consequential insulin dose(s). These predicted insulin doses were compared with the actual insulin doses to find the MRE for each test. When the incorrect type of insulin was administered (e.g. NPH instead of Regular),
560
Anna Wills and Ian Watson
the case was flagged with an arbitrary high MRE value. A Mean Magnitude of Relative Error (MMRE) was then calculated for each test by averaging the MREs obtained for that test. No trends in the MMRE values were discovered when plotted against similarity, the main reason being that the similarity values were all around 100%. However, the MMRE values in general were not good, which suggests that more information may be needed to make more accurate predictions. 4.2 Extended Application Testing Testing was more difficult and time-consuming with the extended application. Because of the high manual component involved with the formulation of cases, the casebase testing sizes were fairly small: one patient, 50 days, 345 entries; one patient, 100 days, 729 entries; three patients 139 days, 1035 entries; and ten patients, 20 days each (200 in total), 1482 entries. The one patient, 50 days casebase was tested using a leave-one-out testing method on 12 randomly selected days. The average similarity for the most similar cases returned was 60%. The same 12 days were tested on the three patient, 139 days casebase and exactly the same most similar cases were returned and therefore the same average similarity was observed. This suggests that it is possible that only information recorded about the current patient (i.e. that patient's history) is useful. Patients have different insulin metabolism rates and insulin tolerance levels, which influence the decision on the type and amount of insulin to be administered. It is quite likely there are other factors not included in the dataset which should also be considered for these decisions. This observation was investigated further by testing the ten patient case-base. This case-base was made up of 20 days of entries for each of the ten patients. A leave-oneout testing method was used on five randomly chosen cases for each patient (50 in total). Figure 2 shows these results displayed in a scatterplot.15 out of the 50 most similar cases returned were from a patient other than the one in the query, (for patients three to seven only). The average similarity of these 15 cases was 69% (with two 100% values) and the average similarity of the other 35 cases was 54%. These results are more promising for the real-life situation of comparing a newly diagnosed patient with data from previous patients, however in this dataset the results show that there is only a small group of patients for whom this would be effective (patients 3 to 7). In a large dataset there are likely to be many different groups of similar patients. Therefore, a method for finding the correct group for a newly diagnosed patient is another issue for further investigation - more background patient data would probably be needed. The one patient, 50 days casebase and the one patient 100 days casebases were tested using a leave-one-out testing method on all 50 days from the first case-base. Only four of the 50 cases returned more similar cases with the larger case-base. Three of these four were in the last ten days, suggesting that the similarity of two days may also be related to the time difference between them.
Building a Case-Based Reasoner for Clinical Decision Support
561
1 0 .9
Similarity
0 .8 0 .7 0 .6 0 .5 0 .4 0 .3 0 .2 0 .1 0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
M M R E /S u m M M RE
S um
E xp o n. (M M R E )
L ine a r (S um )
Fig. 3. Accuracy (MMRE and the sum of MREs) versus similarity for 30 leave-one-out tests on the ten patient casebase
The extended application was also tested for accuracy using the MRE and MMRE tests on the ten patient case-base. A leave-one-out test was performed on 30 randomly chosen days. MRE was calculated for each entry in the query case and retrieved case. MMRE and the sum of all MREs for each of the 30 tests was calculated and plotted against the similarity of the two cases. Figure 3 displays the results from these tests. The downwards sloping trendlines represent the relationship between similarity and accuracy - as one decreases, so does the other. This is a desirable outcome for our application. The testing results for both applications are promising, but would be more reliable if tested on larger datasets. With larger datasets, we would expect more patients with similar treatment patterns, which should produce more convincing results. We would also expect patients with more diverse treatment patterns, to cover more patient types. The specialized similarity measurements that were implemented in the basic comparison application could be used in the extended application using a different development tool. This may also produce more interesting and conclusive results.
5 Conclusion We are unable to draw precise conclusions as to the appropriateness of CBR for Orion's Clinical Decision Support System. However, this research produced some promising results that are being investigated further, with a dataset with more patient information; such as: age, weight, medical history, insulin tolerance etc. One of the problems with case acquisition mentioned earlier was the lack of a consistent medical record structure. This is of importance when deciding what information is needed by the system, especially if it is to be used by different healthcare providers with a variety of medical record structures. The idea of progressively extending the representation of a case should be continued, comparing patterns over longer time periods such as weeks, fortnights and months.
562
Anna Wills and Ian Watson
Hybrid techniques should also be investigated for use in the Clinical Decision Support System. Combinations of CBR with more general knowledge based methods, such as rule-based and model-based systems can be effective. Rule-based and casebased reasoning integration has already been proven as a viable option. (Bellazi et al. 1999, Marling and Whitehouse 2001).
References AACE 2002. American Association of Clinical Endocrinologists, and American College of Endocrinology, 2002. The American Association of Clinical Endocrinologists Medical Guidelines for the Mangement of Diabetes Mellitus: The AACE System of Intensive SelfManagement - 2002 Update. Endocrine Practice Vol. 8 (Suppl. 1):40-82. Bellazi, R., Montani, S., Portinale, L. and Riva, A., 1999. Integrating Rule-Based and CaseBased Decision Making in Diabetic Patient Management. In ICCBR-99, LNAI 1650, 386400. Berlin: Springer. CCHS (Cleveland Clinic Health System), http://www.cchs.net/hinfo/, Last visited: October 2002. Geocities, http://www.geocities.com/diabeteschart/ hba1ctest.html. Last visited: October 2002. Marling, C. and Whitehouse, P., 2001. Case-Based Reasoning in the Care of Alzheimer's Disease Patients. In ICCBR 2001, LNAI 2080, 702-715. Berlin: Springer. Mendes, E., Watson, I., Triggs, C., Mosley, N., Counsell, S. 2002. A Comparison of Development Effort Estimation Techniques for Web Hypermedia Applications. In Proceedings IEEE Metrics Symposium, June, Ottawa, Canada. Schulz S., 1999. CBR-Works - A State-of-the-Art Shell for Case-Based Application Building. In Proceedings of the German Workshop on Case-Based Reasoning, GWCBR'99 (1999). Lecture Notes in Artificial Intelligence. Springer-Verlag. Stahl, A., 2002. Defining Similarity Measures: Top-Down vs. Bottom-Up. In Advances in Case-Based Reasoning. 6th European Conference, ECCBR 2002, Aberdeen, Scotland, UK, September 2002 Proceedings, 406-420. New York: Springer. Sterling, W., 2001. A Massive Repository for the National Medical Knowledge Bank. Teradata Development Division, NCR Corporation, CA. Watson, I. 2002. Applying Knowledge Management: Techniques for Building Organisational Memories. In Advances in Case-Based Reasoning. 6th European Conference, ECCBR 2002, Aberdeen, Scotland, UK, September 2002 Proceedings, 6-12. New York: Springer.