Statistical and Knowledge-based Approaches to ...

5 downloads 975 Views 29MB Size Report
assessors (Lichtenstein et al., 1977), possibly reflecting their training on the ..... B. T. (1982) Computer Aids to Clinical Decisions, Vols I and II, Florida: CRC ...
Statistical and Knowledge-based Approaches to Clinical Decision-support Systems, with an Application in Gastroenterology

BY

DAVID J. SPIEGELHALTER and ROBIN P. KNILL-JONES

Reprinted from

THE JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A (GENERAL) Volume 147, Part 1, 1984 (pp. 35-77)

PRINTED FOR PRIVATE CIRCULATION 1984

J.R.Statist.Soc. A (1984), 147, Part I,pp. 35-77

Statistical and Knowledge-based Approaches to Clinical Decision-support Systems, with an Application in Gastroenterology By DAVID J. SPIEGELHALTERt

MRCBiostatistics Unit, Cambridge, UK

and

ROBIN P. KNILL-JONES

Diagnostic Methodology Research Unit, Southern General Hospital, Glasgow and University Dept. Community Medicine [Read at a meeting organized jointly by the RSSand the Computer Committee of the Royal College of Physicians on Wednesday November 23rd, 1983, the President of the RSS Professor P. Armitage in the Chair) SUMMARY

Many attempts have been made to support clinical decisions by formal statistical reasoning, but the practical impact of these efforts has been limited. Developers of "expert systems", who use the techniques of artificial intelligence to represent clinicians' personal "knowledge", have suggested one reason for this lack of success may be that the probabilistic methodology itself is often inappropriate to the clinical problems or opaque to the user. We contrast the statistical and "knowledge-based" paradigms, with an emphasis on their different approaches to the manipulation and explanation of uncertainty. A statistical application to the diagnosis of "dyspepsia" is described, in which data are obtained by computer interview of the patient, and both diseases and symptoms form a hierarchy. We argue that the flexible use of "weights of evidence" overcomes many of the previous criticisms of statistical systems while retaining a valid probabilistic output. We conclude by discussing the complementary roles of deductive and probabilistic reasoning. Keywords: PROBABILISTIC DIAGNOSIS; DECISION-AID; LOGISTIC DISCRIMINATION; EXPERT SYSTEM; WEIGHT OF EVIDENCE

1. INTRODUCTION A quarter of a century ago, Ledley and Lusted (1959) discussed the reasoning processes of clinicians and proposed the use of computers to aid clinical decisions, while in Britain Card (1967) and others stressed the advantage of making clinical skills explicit and transferable. Since that time health care has become more complex and expensive, and yet expertise and resources still seem insufficient —or too inequitably distributed—to cope with demand. The falling cost of computing appears to justify continued efforts at improving the accuracy, consistency and possibly cost-effectiveness of clinical decision-making, either through direct implementation of formal decision-support systems or indirectly through their educational value in improving unaided decision-making. Ledley and Lusted identified three relevant mathematical disciplines, symbolic logic, probability and value theory, and each of these has led to separate, although interrelated, areas of study. Very broadly, the dominant approach until the early 1970s was based on probability theory, largely through the extensive use of Bayes theorem in diagnostic and prognostic exercises; in Section 2 we briefly review the progress of this "statistical" approach. In the past decade, however, much of the focus has shifted to the other two disciplines. Symbolic, also known as "categorical" or "qualitative", reasoning techniques have come from the field of artificial Present address: Dr D. J. Spiegelhalter, MRC Biostatistics Unit, MRC Centre, Hills Road, Cambridge, CB2 2QH © 1984 Royal Statistical Society 0035-9238/84/147035 $2.00

SPIEGELHALTER AND KNILL-JONES 36 [Part 1 intelligence (AI), and today some of the major showpieces of applied AI are medical consultatio or "expert" systems (Duda and Shortliffe, 1983). Meanwhile, explicit value judgements hav been introduced into a medical context and, in combination with probability assessments, ar part of a highly developed area of applied decision theory, with a journal {Medical Decision Making, Birkhauser: Boston), advanced computer programs (Pauker and Kassirer, 1981) and a number of educational books and articles for a medical audience (see, for example, Weinstein and Fineberg, 1980, Pauker and Kassirer, 1980, and particularly Wulff, 1981 for an overview o rational decision-making). Despite encouraging results in a research context, statistical systems have had limited practica impact. Friedman and Gustafson (1977) discuss the failure of such computer applications in clinical medicine, and characterize the systems as having been inflexible, with a poor interfac with the user and of no readily perceivable benefit to doctors or patients. The relative autonomy of senior clinicians also militates against procedures that tend to impose a degree of publicness and consensus-seeking on clinical interpretation and practice. An extreme, but arguable, view is tha only computer systems that provide simply instant information retrieval, such as PROMIS (Weed 1971), are acceptable to clinicians; formal attempts to weigh the information and produce judge ments are inevitably doomed to failure, given entrenched attitudes and beliefs. Of particular concern to statisticians, however, is the view commonly expressed in reports on Al-based systems that the probabilistic inference mechanism itself is often inherently unsuitable this is cited as a major reason for the failure of many early projects. Details of this argument are discussed in Section 3, but the general opinion appears to be that the statistical approach is often too simplistic for realistic clinical problems, inapplicable because there are insufficient data, and incomprehensible to the user. In partial response to these views we present in Section 4 an application of a probabilisti approach to a reasonably complex clinical problem —that of diagnosis of the causes of "dyspepsia", a term used in a wide sense to cover much of gastroenterology. In this system we have deliberately incorporated some aspects of the AI approach to complexity and explanation while at the same time attempting to generate reliable probabilistic output. Extensive use is made of "weights of evidence", (Good, 1950; Good and Card, 1971), and we show how this allow formal expression of concepts such as "ignorance" and "conflict of evidence". In this paper we are only concerned with systems that attempt to weigh up evidence for classify ing a patient into a diagnostic or prognostic class, and we do not extend our discussion to the decision-theoretic problem of selecting a therapy or further test based on explicit value judgement about states of health, as reviewed by Krischer (1980) and Spielgelhalter and Smith (1981). Neither do we discuss purely deductive systems, which only contain "categorical" reasoning and manage without any quantitative balancing of evidence. Straightforward deductive systems may be called algorithms, and have been discussed in detail in Williams (1982). These range from flowchart for dealing with a single medical problem such as dysphagia (Edwards, 1970), books of flowchart covering rural care in developing countries (Essex, 1980), to computerized treatment protocol (Wirtschafter et al., 1979). A more complex deductive system is known as ONCOCIN (Shortliffe et al., 1981; Bischoff et al., 1983) which uses some of the formalism of the MYCIN program discussed in Section 3, and exploits advanced AI programming techniques to carry out a dialogue to help a clinician managing patients with cancer. Clinical judgements have been encoded in ONCOCIN, but both these and the treatment protocols can be broken into deductive rules. A similar AI application in terminal care is discussed in Fox (1983) and partly illustrated in Fox and Alvey (1983). We do not discuss other computer-based approaches to helping clinicians, such as interrogation of data bases or mathematical modelling of disease processes, which are described for example, by Shortliffe et al. (1979). To anticipate our final discussion in Section 5, we believe that probabilistic reasoning, when handled carefully, is far more flexible and explainable to clinicians than has been apparent ir previous applications, and, if a deductive formulation of a problem is not possible, then dats should be collected and reliable predictive measures of uncertainty produced. We believe this should only aid, rather than hinder, acceptability.

1984]

Statistical and Knowledge-based Approaches to Clinical Decision-support Systems 37 2. STATISTICAL SYSTEMS Let Di,. . .,Dt denote a set of disease classes, where "disease" is used as a general term for identification which may include a prognostic class or a class of patients who respond to a particular treatment; it will be assumed initially that a patient belongs to one and only one disease class, although this restriction is relaxed in the example in Section 4. A set of indicants s = (sj, , Sp) is elicited from a patient, where "indicant" is used in the sense of Good and Card (1971) to mean any observed symptom, sign or test result. The indicants may be considered as realizations of random variables, or facets, Slt.. .,Sp. The aim of a statistical diagnosis system is to use some mathematical model to arrive at quantities p(Dt | s), i = 1, .. ., t, which summarize the support given to each disease class by the available evidence, on which, if appropriate, a classification could be based. The quantity p(Dt | s) may be intended to estimate the probability that the new patient is in Dt, but in a number of procedures it is only a measure of evidence. A wide range of medical problems have been attacked by "computer-aided diagnosis", as it is usually known. Wagner et al. (1978) provide a bibliography of over 800 references up to 1977, although since then applications have been decreasing steadily: Index Medicus cited 30 papers under "Diagnosis: Computer-assisted" in 1982 compared with 83 in 1977. In this section we have only space to give some examples of the type of statistical models that have been found useful; for further reviews see Patrick et al. (1974), Wardle and Wardle (1978), Rogers et al. (1979), Shortliffe et al. (1979) and Titterington et al. (1981). The most popular procedures have been Fisher's linear discriminant, logistic regression, and "independence Bayes" in which independence of variables conditional on disease class is assumed. The most successful use of the independence model appears to be the acute abdominal pain program whose performance was first described in Horrocks et al. (1972) and de Dombal et al. (1972). Implementation led to improved clinical performance (de Dombal et al., 1974) even though the computer's predictions were not available to the doctors when they made their decisions; this suggests that the discipline of data collection and feedback of performance were the important factors. Subjective conditional probabilities were found to lead to lower accuracy than using those derived from the data-base (Leaper et al. 1972) but this may not be surprising in view of the need for the clinicians to estimate the probabilities of over 130 responses for each of 8 diseases; the possibilities of variable reduction have not been reported. An experimental version has been in routine use in a number of hospitals (Gunn, 1976) and adapted for US Navy submarines (Henderson et al., 1978). A multi-centre controlled trial has recently begun in British hospitals using a microcomputer version with an interface suitable for direct use by junior doctors in casualty departments. On a technical note, means of incorporating interactions into the independence model have been discussed by Titterington et al. (1981), and Pantin and Merrett (1982) provide a recent example using clusters of dependent symptoms. More complex techniques include the pattern recognition approach of Patrick et al. (1974), nearest-neighbour models in Coomans et al. (1983) and kernel density estimation in Habbema et al. (1978). In contrast to the complexity of these models, Goldman et al. (1982) recently used recursive partitioning to produce a simple branching tree structure for decision-making. Subjective estimates of probabilities were found to be successful by, for example, Gustafson et al. (1973) and Gorry et al. (1973), while du Boulay et al. (1977) used a Bayesian weighting of subjective opinions and data. Largely subjective estimates of conditional probabilities are also used in the MED AS system when Ben-Bassat et al. (1980) describe as carrying out an interactive dialogue with the user with explanations of its conclusions in terms of the conditional probabilities. MEDAS allows multiple diagnosis by assuming variables are conditionally independent in the disease and its complement, thus updating the probability of each disease separately. This is also a common technique in AI systems, although the formal justification has been questioned by Pednault et al. (1981) and Szolovits and Pauker (1978). Croft (1972) found little to choose between different statistical models and concluded that t.-

_i

u

i_„

c~~..~ A n

Aa(-,r,ir,rr

Mi'ni^il f«rmc H p u p l n n i n o

Hqfa-hases and imnrovini?

SPIEGELHALTER AND KNILL-JONES 38 [Part 1, acceptance of decision-aids. In spite of his advice, comparative studies such as Titterington et al. (1981) and Coomans et al. (1983) have continued, with the general conclusion that provided commonsense is used in variable definition and selection the simple models are adequate for discrimination although the independence model will tend to give probabilities that are unreliable, or badly "calibrated", in the sense that of patients given, say, 90 per cent probability of a disease D, greater or less than 90 per cent will eventually be confirmed as having D. Calibration is only rarely mentioned in the literature (see, for example, Knill-Jones et al., 1973) and even explicitly rejected as unnecessary (Ben-Bassat et al., 1980). In conclusion, statisticians will probably be unimpressed by the general quality of modelling, analysis and evaluation in this area. Although there is now a general awareness of the bias in testing a procedure on the patients on which it was developed, and the need for variable selection has been acknowledged to a lesser extent, the problems of bias due to selection of patients (Dawid, 1976) appear to have gone largely unnoticed. However, it could be argued that methodological purity is unimportant compared with the reaction of clinicians and, unfortunately, it appears that these formal techniques have had little influence on routine clinical practice. We feel a major reason for this is the fact that Croft's advice has not been taken up, and little attention has been paid to turning the science into acceptable technology (Healy, 1978), applicable in areas of health care where a real need is perceived. 3. KNOWLEDGE-BASED SYSTEMS 3.1. Introduction Gorry (1973) has explained his reasons for changing from a statistical and decision-analytic approach (Gorry et al., 1973) to one based on artificial intelligence, arguing that if a program were to operate as a "consultant" outside a very restricted domain, it would have to incorporate the "knowledge" of the expert clinicians, as distinguished from the structured accumulation of data (Shortliffe et al., 1979). He saw this knowledge as consisting of heuristic "rules" which allow rapid exploration of a presenting illness, the rules being hedged with uncertainty and being "triggered" in reaction to specific problems encountered. He stressed the need for formal representation of medical concepts, and for a dialogue which would include explanations in terms that a doctor could understand. These aims coincided with those of the new direction in AI work—away from general problem-solving programs and towards systems that organized and made available knowledge about a specific subject (Duda and Shortliffe, 1983). The term "knowledge-based" will be used to describe a system which encodes expert judgements in a structure intended to bear some resemblance to human cognition; the phrase "expert system" is often used when such a program is designed to act as a consultant. Reviews of the intensive and obviously exciting work in this area are given by Duda and Shortliffe (1983), Buchanan (1982) and in the recent handbook edited by Barr and Feigenbaum (1981-82), who discuss, for example, successful applications in chemical analysis and in the configuration of computer systems. Medical applications have formed a significant part of the research work and are summarised in the papers in Szolovits (1982), while Shortliffe et al. (1979) provide an overall review of medical decision-aids from an AI perspective. In the remainder of this section we shall consider various aspects of medical expert systems, under the headings of knowledge representation, knowledge acquisition, inexact reasoning, control structure, explanation, performance and transferability. For each aspect we consider the possible failings of the statistical approach and contrast these with knowledge-based models, and in particular illustrate the latter using the MYCIN system (Shortliffe, 1976). This early program was concerned with the diagnosis and treatment of bacterial infections and is still widely discussed because of its incorporation of many sophisticated AI techniques. A "shell" program EMYCIN (E for Essential) was developed from an early version of MYCIN, where the shell consists of the general control structure into which rules covering any application could be incorporated. MYCIN was then rewritten in EMYCIN, which has also been used as an initial structure for many other problems (van Melle et ah, 1981; Aikins et al., 1983; Fox, 1983). Many of the comments in this

1984] Statistical and Knowledge-based Approaches to Clinical Decision-support Systems 39 section concern EMYCIN, and those that specifically relate to MYCIN will be in the past tense. Criticism of the AI perspective will be largely left until the discussion at the end of this section. 3.2. Knowledge Representation Statistical modelling of conditional distributions of variables given diseases has been criticized for its over-simplicity by Barr and Feigenbaum (1982, p. 179), Shortliffe et al. (1979) and Szolovits and Pauker (1978), particularly for the frequent assumption of conditional independence of variables and the restriction to mutually exclusive and exhaustive diseases (assumptions misleadingly summarized in the phrase "symptoms and diseases should be independent of each other" by Feinstein (1977), in an entertainingly vitriolic paper). The rich physiological knowledge and judgemental experience of clinicians is largely ignored in statistical summaries of data and, in contrast, AI techniques attempt to use various models of human cognition to derive structures to interrelate hypotheses (diseases) to findings (indicants) (Barr and Feigenbaum, 1981). For example, each hypothesis or finding may be represented as a node in a network, with links between them representing different relationships. CASNET (Weiss et al., 1978) is a causal model of the disease glaucoma, with a three-level hierarchy: observations, which are associated with causally-related by unobservable pathophysiological states, which are further classified into diseases. INTERNIST (Miller et al., 1982), which has developed into CADUCEUS (Pople, 1982), is a large diagnostic program covering 75 per cent of diagnoses in internal medicine, with about 500 diseases and over 3500 "manifestations" linked in a complex hierarchical network. A second structure creates a frame around each hypothesis consisting of the prototypical presentation (see, for example, Pauker et al„ 1976; Aikins, 1983), with links such as "may be caused by" or "may be complicated by" expressing relations with other hypotheses and findings. Finally, the production rule representation of knowledge has been increasingly used as a model for human cognition (Young, 1979), and provides a highly modular structure in which each "parcel" of knowledge is a rule of the type "IF (premise) THEN (action)". In EMYCIN a premise is the conjunction of a number of clauses comprised of specific patient data: for example, Barr and Feigenbaum (1982, p. 187) provide MYCIN'S Rule 050: "IF 1) the infection is primary-bacteria and 2) the site of the culture is one of the sterile sites and 3) the suspected portal of entry of the organism is the gastro-intestinal tract THEN there is suggestive evidence (.7) that the identity of the organism is bacteroides." (The quantity .7 will be discussed in Section 3.4.) According to Shortliffe and Buchanan (1975), production rules allow the coding of general clinical knowledge, specific knowledge about rare cases,' a modular structure for easy modification, automatic'consistency checking, easy explanation of reasoning and a capacity for instruction. About 500 such rules were used in MYCIN: additional control of a consultation was provided by a "context tree" which related, for example, cultures to the relevant infection, and multiple infections to the patient. This structure was adopted into EMYCIN, but has been considered to be too rigid for some applications by Cendrowska and Bramer (1983), while Aikins (1983) found the addition of frames useful for improving control of the consultation. 3.3. Knowledge Acquisition The appetite of statistical systems for extensive data is clearly a problem if a wide range of diseases and indicants are being considered, and reliable subjective probability estimates may be difficult to obtain and possibly prone to the systematic biases that can occur in eliciting probability assessments from untrained subjects (Kahneman et al., 1982). AI systems, on the other hand, may develop a prototype system fairly quickly in consultation with experts, and then steadily increase the knowledge base in response to practical problems that arise. Duda and Shortliffe (1983) discuss the time-consuming problem of obtaining expert judgements that clinicians

SPIEGELHALTER AND KNILL-JONES [Part 1, 40 often find difficult to formulate, and this is one reason behind the development of automatic rule acquisition programs (see, for example, Michalski and Chilausky, 1980) that infer rules from a series of actual decisions: such programs form part of the general study of "macmne-learriing". As MYCIN'S rule base steadily grew a built-in program checked the logical consistency of new rules with the existing set. An additional program, called TEIRASIAS (Davis, 1979), was intended to enable clinicians to track the line of reasoning and, if an inappropriate step had been taken, suggest rules that would prevent recurrence of the error. 3.4. Control Structure Statistical programs normally require a block of data to be entered initially. Those that elicit data sequentially may use either an information or utility theory selection procedure, each of which may, unless special precautions are taken, lead to "jumping around" in a manner alien to clinicians. AI systems attempt to model a clinician's behaviour when he is exploring a problem, and a number of heuristic strategies for controlling the consultation with the system have been adopted. In INTERNIST, for example, as one disease increases its lead in "points" over other contenders, the strategy changes from one of "discrimination", which asks questions designed to maximise the overall spread in points, to one of "pursuing", which asks questions designed to confirm the current favourite hypothesis (Miller et al., 1982). MYCIN'S strategy for searching its knowledge base was broadly "goal-driven". For example, suppose it had set itself the goal of finding the identity of an organism, and was considering Rule 050 mentioned in Section 3.2. If one of the clauses of the premise were unknown, say the identity of the infection, this became the new sub-goal and MYCIN would "backward-chain" to rules that would help it prove the premise; ultimately the user would be asked for the relevant piece of information about the patient. When finally a premise was fulfilled, the rule would be triggered and the conclusion added to the current data about the patient. There are many additional sophistications to this exhaustive strategy incorporated in EMYCIN, which are described in van Melle et al. (1981) and Cendrowska and Bramer (1983). An alternative control structure for production systems involves "forward-chaining", in which the search is made for rules whose premises are currently fulfilled. This "data-driven" strategy is considered by some to be a better model of human reasoning (Fox et al., 1980), but is more dependent on initial input of data: mixed strategies are now being used, for example, in ONCOCLN. 3.5. Inexact Reasoning Statistical approaches have been criticized for placing all shades of "inexactness" within a single probabilistic framework. For example, an observer may wish to express a degree of doubt about whether a finding is present or not, a system may wish to allow for within- or betweenobserver variation in eliciting symptoms, the quantifications in the program may be imprecise, a set of findings may give a degree of confirmation to an hypothesis, there may be ignorance when little evidence is available, and finally there is a frequency with which findings occur in an hypothesis. Szolovits and Pauker (1978) review a number of numerical techniques adopted in medical AI systems to handle "inexact reasoning". Miller et al. (1982) describe INTERNIST'S strategy in detail, in which the "evoking strength" of a finding towards a hypothesis (roughly corresponding to a degree of confirmation) is assessed separately from the frequency with which a hypothesis implies a finding, and an ad hoc formalism propagates "scores" through the network. Fuzzy set theory (Zadeh, 1965) appears to be popular on mainland Europe (Adlassnig, 1980; Wechsler, 1976; Smets, 1981), while the Shafer/Dempster theory of beliefs functions (Shafer, 1976, 1982), which allows a formal mechanism for handling doubt, ignorance and conflicting evidence, is receiving increasing attention in the AI literature (see, for example, Barnett, 1981, Wesley, 1983 and other papers in the respective conference proceedings). Fox et al. (1980) argue for purely semantic reasoning, involving no numerical manipulations. Almost all systems consider the "belief in each hypothesis in isolation to others, and so evidence is considered either for or

1984] Statistical and Knowledge-based Approaches to Clinical Decision-support Systems 41 against an hypothesis in the manner of the MEDAS system in Section 2; Pople (1982) contrasts this to a differential-diagnosis in which diagnosis by exclusion is a possible strategy. The "confirmation theory" approach originally implemented in MYCIN is descended from the distinction by Carnap (1950) between two types of probability: "degree of confirmation" and "relative frequency". Although it has undergone revisions as the program has developed, as first described in Shortliffe and Buchanan (1975) the degree of certainty (such as .7 in Rule 050 given above) was elicited from the expert by a question: "On a scale of 1 to 10, how much certainty do you affix to this conclusion?"; the number was interpreted as an increased measure of belief (MB) in the conclusion (h) given the premise (ej), and was defined to be MB,(ft) ={p(h | ei)-p(h)}/{\ (3.1) where p(h) and p(h | e x ) are, respectively, the prior and posterior probabilities of h. If a sec rule could be triggered with premise e2 and the same conclusion h qualified by MB2(h), combined increased measure of belief MBi j2 (/;) was given by the rule of combination 1 - M B 1 > 2 ( / i ) = { 1 - M B 1 ( / i ) } { 1 - UB2(h)}.

(3.2)

Any evidence for h, _the complement of h, was treated in an identical way and an increased measure of belief for h (or equivalently the measure of disbelief in h) simultaneously propagated through the consultation. The overall certainty factor (CF) was defined to be the difference MB(/z) - MB(/i), where - 1 < C F < 1 ; - 1 represents disconfirmation of the hypothesis and 1 represents confirmation. The certainty factors were then used to rank hypotheses: since a rule with conclusion h provides no evidence for h, the degrees of certainty in the rules may also be interpreted as certainty factors. This formalism therefore takes account of doubt when MBs are assigned by the user to findings, ignorance when both MB(ft) and MB(/j) are low, and conflict of evidence when both UBQi) and MB(/i) are high. However, Adams (1976) showed that (3.1) may be rewritten 1 - MBX(h) = p(h I d)/p(A) =p(c x ! and hence the rule of combination (3.2) would be that obtained by assuming items of evidence are conditionally independent both in the hypothesis and in the complement of the hypothesis, and unconditionally independent as well: Shortliffe and Buchanan (1975) were aware they were making strong independence assumptions, and advised that "dependent pieces of evidence be grouped within single rather than multiple rules". Moreover, Adams showed that according to definition (3.1) two hypotheses could have certainty factors in reverse ranking to their posterior probabilities, the same argument being used by Popper (1959-80, p.390) in his rebuttal of Carnap's theory that degree of confirmation could be interpreted as a probability. When EMYCIN was developed this formalism was changed. The separation of evidence for and against an hypothesis has apparently been dropped and for each "fact" only a single certainty factor, lying between - 1 and 1, is carried throughout the consultation. The formal interpretation in terms of probabilities has disappeared, and now they are interpreted as "single numbers combining subjective probabilities and utilities. As such they represent the importance of the fact" (van Melle et al., 1981). Clauses of a premise can also be assigned certainty factors, either because they have been implied by a previous rule, or because the user wanted to express doubt in the finding when providing the data. The total certainty factor of a premise is the minimum of the CFs of the clauses; if one of the clauses is itself a disjunction, the CF of the clause is the maximum CF of the elements in the disjunction. This is the standard fuzzy logic operator of minimizing for AND, and maximizing for OR (Zadeh, 1965). The certainty factor of the whole premise scales down the certainty factor of the rule: for example, if the clauses of Rule 050 have CFs 0.8, 0.6 and 1.0 attached to them, the certainty factor carried over would be 0.7 times min{0.8, 0.6, 1.0} = 0.42. Currently a premise is considered to be fulfilled and the rule triggered if the overall CF exceeds 0.2. Given two rules with the same conclusion and certainty factors CFj and CF2 respectively, a

42

SPIEGELHALTER AND KNILL-JONES

combined certainty factor CF l j 2 is given by l - C F l j 2 = ( 1 - C F 1 ) ( 1 - C F 2 ) , CF l f CF 2 > 0 ,

[Part 1,

OF, + CF2 l - m i n { | CF, | ,| CF2 | } ' CFj CF2 < 0 , 1 +CF,, a =(1 + CF,)(1 +CF 2 ), CF,,CF 3 < 0 . Were the minimization in the denominator to be replaced by a product, it can be shown that this rule of combination would occur were a belief function interpretation to be given to the measures of belief used in the original MYCIN, and Dempster's rule of combination used before calculating the certainty factor. 3.6. Explanation It has been said that for statistical systems "there is an unavoidable loss of comprehensibility to the physician using them", and that "when the list of symptoms is long, it may not be clear how each of them (or some combination of them) contributed to the conclusion" (Davis, 1982), and even that probabilistic conclusions are "often anathema to doctors" (Fox, 1982). This is said to be a major reason behind the slow progress of statistical systems (Fox et al., 1980; Fox, 1982), and a survey of attitudes by Teach and Shortliffe (1981) led them to conclude that clinicians would reject a system that gave insufficient explanation, even if it had good diagnostic accuracy. A major aim in using models of human cognition in AI is, therefore, to facilitate explanation. MYCIN gave a "trace" of its reasoning chain in response to a request, and TEIRASIAS was intended to provide a rather broader overview of the consultation strategy using "meta-rules". Impressive examples of dialogues with the system are provided in Davis (1979) and Barr and Feigenbaum (1982, p. 87), although Clancy (1983) and Aikins (1983) both discuss the problems in describing overall strategies with such a highly modular knowledge-base of production rules. 3.7. Performance Titterington et al. (1981) illustrate the approach to measuring the performance of probabilistic predictions. Similar large-scale exercises have not been reported for AI systems; this might reflect the complexity of the clinical problems tackled, or the explicit aim of mimicking expert judgement especially in areas where objective accuracy is hard to define. However, Fox et al. (1980) have shown that an independence Bayes model and a forward-chaining production system have equivalent performance in a particular diagnostic setting, while INTERNIST was recently evaluated on a series of 19 published case conferences as being of similar accuracy to hospital clinicians, but inferior to case discussants (Miller et al., 1982). Yu et al. (1979a, b) present two evaluations of the acceptability to sets of expert judges of MYCIN'S treatment recommendations on 15 cases of bacteraemia and 10 cases of meningitis, in which MYCIN'S therapy selections were judged to be "unacceptable" in 27 per cent and 35 per cent of expert/patient assessments, respectively. Although this may appear unimpressive, it should be noted that in the latter study MYCIN was shown to have marginally better performance than any. of eight clinicians of varying seniority, whose recommendations were considered even less acceptable by judges blinded as to which were the computer decisions. This illustrates the problem in deciding a "correct" therapy and assessing "objective" accuracy. 3.8. Transferability A number of technical difficulties may arise when trying to move statistical systems away from their place of development. Firstly, different clinical data may be routinely collected; hence only commonly available variables should be used for discrimination. Secondly, if the observer variation in eliciting indicants differs in the new setting this may bias the probabilistic predictions

1984] Statistical and Knowledge-based Approaches to Clinical Decision-support Systems 43 (Lindberg, 1981; Spiegelhalter, 1982b). Thirdly, the prevalence of diseases may vary from place to place, either due to genuine geographical variation (de Dombal et al., 1981) or to different reasons for referral to the clinical centre. Finally, the presentation of the disease may vary, due to a genuinely different disease process, to different definition of indicants or to the type of selection bias discussed by Dawid (1976); such variations have been shown to affect considerably a system based on a conditional independence model (de Dombal et al., 1981). These problems appear to apply equally to knowledge-based systems, though with the additional difficulty of the kind of local biases in expert opinion shown by Yu et al. (1979b). 3.9. Discussion The brief points made above are only a superficial view of the expanding field of expert systems in medicine, and we hope we have not given the impression that all AI workers reject statistical methodology completely. Indeed, Szolovits and Pauker (1978) and Shortliffe et al. (1979) emphasize that statistical systems are invaluable in restricted problem domains if data are available and good performance can be shown, and particularly if given an "acceptable" interface with the clinicians. Some of the comments given above may appear to statisticians to be somewhat unfair: for example, explicit independence assumptions have been criticized as being unrealistic, but are often replaced by even stronger implicit independence assumptions with unknown properties, to the extent that Adams (1976) suggested MYCIN'S strategy might only work if the chains of reasoning are kept short. Underlying such points appear to be two fundamental issues which may lead statisticians to be wary of aspects of the AI paradigm. The first issue is largely technical and concerns the attitude to complexity of modelling. Statistical theory and practice emphasize the need for parsimony in models whose primary purpose is prediction, since over-fitting to past data substantially decreases predictive accuracy on new cases. The contrasting AI view is well stated by Szolovits and Pauker (1979): "If the program's domain is logically consistent, then, in principle, the expert can correct the system's knowledge to achieve complete agreement of every case so far considered. The AI methodology emphasizes the refinement of the underlying model to account for all observed phenomena, whereas the statistical methods tend to acquiesce to simpler models and accept errors as consistent with expected variability." Furthermore, Buchanan (1982) states that "the basic mechanism we have for coping with uncertainty in expert systems is to exploit redundancy", and recommends the inclusion of redundant rules to allow many reasoning paths to a conclusion. The major question seems to be whether a complex entity about which one has incomplete knowledge, whether the judgemental process of a clinician or the disease process itself, is best modelled by a system that is increasingly large but essentially deterministic, or that is parsimonious and probabilistic. We believe that, if the aim of the system is good prediction of future cases, the latter is appropriate. The second fundamental issue concerns the use of ad hoc quantification in modelling an expert's judgement, and the undefined measures of support for hypotheses that result from the consultation. In MYCIN and INTERNIST neither the numerical inputs nor outputs appear to have any verifiable interpretation, and Duda and Shortliffe (1983) acknowledge that "the operational meaning of the numbers is not always clear". This is also a problem with statistical systems that may have well-defined numerical input but pay no attention to calibration; this allows Fox and Alvey (1983) to say that it is unclear whether the probabilistic output is an estimated frequency or a measure of evidence. We note that the kind of variation in clinical practice described in Section 3.7 is typical of much of medicine (see, for example, Thomas et al., 1980) and believe that if systems are to be of value they must attempt to transcend this disagreement. We therefore see little purpose in modelling a clinician's opinion using uninterpretable numbers, and instead believe that any "inexactness" in a logical proposition should be an estimate of a well-defined quantity, where the estimate may be subjective, based on data, or preferably a Bayesian combination of the two.

[Part In spite of the misgivings expressed above, we do consider that many of the criticisms statistical systems have some validity, and in the next section we describe a system th attempts to take account of many of the points raised by the artificial intelligence communit To help comparison the discussion follows the structure of this section, and we enclose t headings in quotes when they are terms that specifically apply to AI systems. 4. THE GLASGOW DYSPEPSIA SYSTEM 4.1. Background An editorial in the British Medical Journal (Anon., 1978) has argued that the large numb of inappropriate investigations for dyspepsia could be reduced by making accurate diagnos on the initial presentation of the patient. With this aim in mind, the Diagnostic Methodolog Research Unit of the Southern General Hospital in Glasgow has been collecting data on patien with dyspepsia who have been referred by their general practitioner to a specialist gastr intestinal clinic, where "dyspepsia" has been defined as "episodic, recurrent or persiste abdominal pain or discomfort or any other symptom referable to the alimentary tract exceptin rectal bleeding or jaundice". The data from the specialists' initial interview were available on proforma which included agreed definitions of all symptoms and signs. Full results sigmoidoscopy, endoscopy, radiology and psychological tests were usually also available, givi up to 400 items of information on a data base of 1200 cases at the time of analysis. One ma aim of the project is to produce a decision-support system for general practitioners with th hope of reducing referrals to specialist gastrointestinal clinics and improving other outcom measures without there being any drop in the standard of care delivered to the patient. Th system consists of four main stages. Firstly, clinical symptoms are collected from the patient by computer interview (Lucas et a 1976) using a special keyboard fitted to a standard microcomputer. This has been shown to highly acceptable to patients and at least as accurate as clinicians at eliciting information (Luca 1977; Card and Lucas, 1981). The data obtained consist of variables describing aspects of pai bowel habits, heartburn and other symptoms, together with background demograph information and details of drinking and smoking behaviour. The progress of the interview, whi usually takes about 30 minutes, is governed by the responses to key questions which trigg further sets of questions. In the second stage of the system a probabilistic diagnosis of the possib causes of the dyspepsia is obtained using the techniques described in this section. Thirdly recommendation for the management of the patient is derived from the probabilistic diagnosi for example, to give medical treatment for duodenal ulcer, to refer to hospital for investigatio of the oesophagus, or to give advice on alcoholism. Lastly, the important findings at inte view, the possible diagnoses, and the management suggestions are produced for the GP in a cle format as a computer-generated report, with the hope that this will allow the limited time contact with the patient to be spent in deeper exploration of his or her problem. In this sectio we present a brief overview of the structure of the diagnostic part of the system, without goin into a detailed statistical discussion. 44

SPIEGELHALTER AND KNILL-JONES

4.2. "Knowledge Representation " This takes the overall form of a tree structure for both diseases and indicants, where eac indicant has associated with it one or more "weights of evidence" for or against a disease. The disease classes form a three-level hierarchy. At the highest level, there are seven "generic classes which constitute the major types of disease. The oesophageal, peptic ulcer and bow classes may then each be divided into diseases that present in a fairly similar manner but f< which the appropriate investigations and treatment are quite different. This provides 10 "trea ment" classes that are fairly homogeneous at the level of treatment provided by the GP. The; 10 classes may then be further divided into the 27 specific diseases, a sample of which is show in Table 1. Noting the observation of Szolovits and Pauker (1978) that "the diagnosis needs to be on

1984]

Statistical and Knowledge-based Approaches to Clinical Decision-support Systems TABLE 1

45

Hierarchy of disease classes Seven "Generic" classes Oesophageal disease

Examples of 27 "Specific" classes

10 "Treatment" classes Simple oesophageal disease Severe oesophageal disease

(45) (63)

. * Duodenal ulcer disease

(330)

Gastric ulcer disease • * Irritable bowel syndrome - Organic bowel disease Gallstones • Alcohol-caused Gastric cancer * Non-organic

(74): (177) (63) (SO) (48) (32) (294)

Peptic ulcer Bowel disease Gallstones Alcohol-caused Gastric cancer Non-organic

(Main treatment diagnoses of 1176 patients shown in brackets.) (* Indicates disease could be initially managed from general practice)

Duodenitis • Duodenal ulcer - Scar of duodenal ulcer • Pyloric/Antral ulcer ^Gastric outlet obstruction • Gastric ulcer Scar of gastric ulcer

(11) (220) (74) (17) (8) (68) (6)

as precise as is required by the next decision to be taken by the doctor", our current program only considers the 10 treatment classes. The problem of multiple diagnoses has led us to adopt a "binary-task formulation" in the manner of many AI systems: evidence is describee! as being either for or against each of the seven "generic" diseases, and within each of the oesophageal, ulcer and bowel classes, evidence may be for one or the other member. These final three discrimination tasks, in which a group of patients requiring early hospital referral is to be identified, are of particular clinical interest. The indicants also form a hierarchy, reflected in the branching structure of the interviewing program. For example, everyone is asked if they have abdominal pain or not, and positive responders answer a series of further questions, including whether the pain comes in episodes. If episodic, further details are elicited; for example, whether episodes are worse in winter. The weights of evidence or "scores" relate the indicant tree to the disease tree: for example, "pain occurs every day" currently gives a score of +191 to gastric cancer (relative to not having cancer) and +101 to gastric ulcer (relative to having duodenal ulcer). The derivation of these scores is described in the next section. 4.3. "Knowledge A cquisition'' The weights of evidence are derived from analysis of the extensive data-base. About 150 symptoms have been elicited from 1200 patients by clinical interview using definitions which have now been transferred to the computer interview. About 2 per cent of the data is missing, due to some early changes in form design or incomplete patient interviews. After at least 6 months follow-up each patient was assigned, using "certain", "probable" and "possible" as qualifiers, to up to 3 of the 27 specific disease classes, and approximately 30 per cent fell into more than 1 of the 10 treatment classes. The numbers of patients with each of the treatment classes as their primary diagnosis are provided in Table 1 excluding 24 patients labelled "undiagnosable" or lost to adequate follow-up. Counting up the less severe disease totals we find that for 894/1200 = 75 per cent of patients visiting the clinic, the result of specialist investigation was essentially to rule out serious disease. We shall illustrate the derivation of the weights of evidence using the disease group "gallstones". The data-base is first split into the two relevant groups of patients: 57 patients with gallstones,

SPIEGELHALTER AND KNILL-JONES [Part 1, 46 denoted D, as one of their diagnoses are contrasted with 1119 patients who are confirmed not to have the disease, denoted D. We acknowledge that the latter group may be somewhat heterogeneous and dominated by the common diagnoses. "Possible" and "probable" gallstones (6 out of the 57 cases) show similar symptoms to the "certain" cases and are included in the analysis. Initial examination of two-way tables reveals a shortlist of variables which are candidates for inclusion^and these are shown in Table 2 with the frequency of occurrence of each indicant s in D and D. The "no" responses for the questions concerning type of pain include the few "logical-no" patients who did not have any pain, and there is a branch of questions that only concern those who reported pain in "attacks". The frequency with which the indicant s occurs in D may be viewed as an estimate of the conditional likelihood p(s | D), and the "independence _Bayes" model for discrimination is equivalent to multiplying the likelihood ratios p(s | D)/p(s | D) for the indicants that occur. Taking natural logarithms to give an additive scale, and for convenience multiplying by 100 and rounding to integer values, leadsto the equivalent discrimination procedure of adding "weights of evidence" 100 In {p(s | D)/p(s | £>)}, denoted by W(D:s). This definition of weight of evidence was used by Good (1950) who in a recent review paper (Good, 1983) traces the application of the concept primarily to the cryptanalytic work in World War II by Alan Turing and others. In general, given a hypothesis H. evidence E and existingJnformation G, the weight of evidence is the logarithm of the "Bayes factor" p(E | H, G)lp(E \ G), by which one multiplies the prior odds on H to give the posterior odds on H having observed E. Good (1960) argues that weight of evidence is the natural expression satisfying a slightly amended version of the desiderata listed by Popper (1959-80, pp. 387-419) for a measure of degree of corroboration, and Good and Card (1971) and Card and Good (1974) strongly recommend the use of this concept in clinical medicine. In our circumstances, H is a disease, E is a symptom and G may vary: for the question on length of history in Table 2, G is the fact that a patient has presented at a clinic with dyspepsia; for within the "branch" questions, G is strengthened to include the fact that the patient has complained of attacks of pain; in the discrimination task between DU and GU, G would represent a restriction to the peptic ulcer class. Let rD, r/5 represent the number of patients with an indicant s within the classes D and D comprising of n'D and njj patients respectively. We have estimated W(D.s) by 100 ]n{(rD +$)l(nD + l)}/{(r/j + 0); the addition of \s intended to help remove bias (Cox, 1970, p. 33). The estimated standard errors are given by 100 {{rD + l)" 1 + (rff + l)" 1 } i . The use of weights of evidence appears to be relatively straightforward to explain in clinical terms. For example, the finding "attacks of pain" would be said to have "sensitivity" (proportion of diseased patients with positive finding) 83 per cent and "specificity" (proportion of non-diseased patients with negative finding) 92 per cent (Galen and Gambino, 1975), so the weight of evidence when a symptom is present may be written using these terms as 100 In{sensitivity/(l-specificity)} and when absent as 100 ln{(l-sensitivity)/specificity}. An indicant such as being aged over 25 has high sensitivity but low specificity and hence only conveys evidence when absent. In contrast, pain radiation to the shoulder, with low sensitivity but high specificity, conveys a lot of evidence when present but little when it does not occur. This quantifies the idea that findings that describe a disease are not necessarily good for the differential diagnosis of the disease, except in ruling out the disease by their absence. Good (1950) also emphasizes the value of the expected weight of evidence E[ W(D : S) [ D] as a means of selecting appropriate questions. Let Wj denote the weight of evidence W(D:Sj), and w 0 denote 100 In {p(D)/p(D)}. Then "independence Bayes" is equivalent to assuming

1984]

Statistical and Knowledge-based Approaches to Clinical Decision-support Systems

w?a

t* os -,,..., Di0 are present or that there is insufficient evidence for confident diagnosis. In fact, making the admittedly somewhat unrealistic assumption that diseases occur independently, we may formally calculate the probability that "none of the diseases is present" as 10

p("none") = p(D1 &52&..&D10\s)

= J] {l -p(Dt I s)}. i= l

This provides a single measure of the degree to which the indicants match none of the diseases under consideration, and thus acts as a type of outlier rejection procedure. It has been found that many patients with non-organic or "nervous" dyspepsia obtain a high value of p("none") since they present with a set of symptoms from which no consistent pattern emerges. Having shown how the essential predictive uncertainty is estimated using the accumulated data, we consider the other aspects of "inexactness" mentioned in Section 3.5. Firstly, we currently make no provision for doubt expressed about a symptom, although in fact this information is routinely collected since the interviewee has the choice of qualifying their answer by "possibly", "probably" or "certainly" as part of the special keyboard. Secondly, there is

SPIEGELHALTER AND KNILL-JONES [Part 1, 50 no provision for changes in observer variation since we have the great advantage of an entirely consistent questioning instrument. Thirdly, we can, if necessary, quantify the imprecision in our "knowledge" by quoting approximate standard errors around our predictions, using the estimated covariance matrices obtained from the regression package but without allowance for selection effects. The total score from a main regression, (a0 ;a)'(l; w), has variance (1; w)' 2(1; w), where 2 is the covariance matrix of (a0; a). If a "within branch" discrimination has been carried out, from the argument in Section 4.3 the within-branch total score will be approximately a}wj with variance w}2 7 w/, where 2/ is the estimated covariance matrix of the within-branch regression coefficients a/. For the patient described above, the two independent contributions to the variance come to 1377 +381 =42 2 , giving a rough two standard error interval of (—40,128) for the total score which transforms to an interval of (0.40, 0.78) for the probability of gallstones. The width of this interval reflects the somewhat inadequate sample size, and more cases of gallstones are currently being collected. The fourth type of "inexactness" concerned a formal expression for the degree of ignorance at any stage. One way this might be implemented would be to give, after a limited amount of evidence is available, not only the current score and associated probability but also the predictive distribution of probabilities that could occur when the rest of the data is available. "Ignorance" would be expressed by the spread of this distribution of potential results, which would narrow as more evidence became available. In this context, however, where the clinicians do not examine the data sequentially, we have used an approach based on the original MYCIN strategy of keeping positive and negative evidence separate. Suppose that at any stage the positive score towards a disease D is +P, and the negative score —N. Then the balance of evidence is P — N, which, whe combined with the starting score, provides the probability of B, while P + N is a measure of the total evidence obtained. The three situations of ignorance, consistent evidence and conflicting evidence have the following interpretation: when one has obtained little evidence, P + N is low, so we are in a state of ignorance. When mainly consistent information is available, P + N will only be slightly larger than | P-N\. When conflicting information is obtained, \P-N\y be small but P + N large; we are currently experimenting with using (P + N)j\ — N\s a rough measure of conflict. For example, suppose a male presents suffering from nausea before breakfast, vomiting and painless diarrhoea. However, he steadfastly claims to be only a light drinker. Our current scoring system for alcohol-induced dyspepsia would give him a positive score of 410 and negative score of —226, with a "conflict ratio" of (410 + 226)/(410 -226) = 3.5, warning us that the symptomatology may be suspicious. This should be contrasted with the gallstone patient described above, for whom P = A2S, Na 81 and so (P + N)j\ — N\ 1.5, indicating reasonably consistent evidence. A high measure of conflict may also act as a guide to where our predictive model may well be inaccurate.

4.6. Explanation The doctor receives a report giving the major findings at the interview and a list of the final disease scores and probabilities. For plausible diseases we also provide a "balance of evidence" account, and an example of a recent patient is given in Fig. 1. This allows him to see the factors that have contributed to the conclusion and by how much; major conflicts may be identified and it is straightforward to find the effects of changing symptoms which he feels have been wrongly elicited by the interview. For the patient in Fig. 1, a pattern of ulceration builds up in steady stages. However, the short length of history is a source of conflict and the clinician may want to explore this further. We are currently experimenting with the format of the output, and the level of detail to provide has not yet been decided. 4.7. Performance The full strategy for evaluating this system is described in Spiegelhalter (1983) and detailed results will be reported after further development. For many patients in the data base, the clinicians had made a probability judgement for each of the 10 treatment classes at the time of

1984]

Statistical and Knowledge-based Approaches to Clinical Decision-support Systems

51

Evidence A GAINST Peptic Ulcer

Evidence FOR Peptic Ulcer Abdominal pain Episodic Relieved by food Occasionally woken at night and relieved by snack Epigastric Point at site of pain with fingers Family history of ulcer Smoker Vomits, then eats within 3 hours

(+9) (+19) (+44) (+25) (+28) (+19) (+39) (+41) (+54)

Length of history less than 1 year No previous operation for ulcer No seasonal effect on pain No waterbrash

(-75) (-5) (-9) (-29)

Balance of evidence Initial score Final score

+278 + 160 (Total evidence 396: conflict ratio = 2.5) - 8 4 (corresponding to prevalence of 30%) +76 = 68% chance of peptic ulcer

-118

Fig. 1. Preliminary form of summary table.

the initial interview, and we can therefore compare both the discrimination and the reliability of the judgements made by the doctor and the computer system. Of course, the clinician does have the advantage of an examination of the patient. We give only a sketch of how this comparison might take place and refer to Titterington et al. (1981) for a full discussion of the assessment of probabilistic predictions. First of all, one can assess retrospective discriminatory performance. For example, the gallstones scoring system has been applied back to 37 cases of gallstones and 545 cases without gallstones on whom clinicians' judgements were available. Using a probability cut-off of 10 per cent to indicate further investigation, the system detected 28 at a cost of 40 false positives, compared to the clinicians' detection of 28 at a cost of 26 false positives. Although this seems reasonable, retrospective analysis favours the system and there are insufficient cases with the disease to estimate the rehabihty of the judgements. In a preliminary prospective study, a scoring system for peptic ulcer developed on 923 patients was applied to a further series of 199 patients on whom clinicians' opinions were available. A "receiver operating characteristic" (ROC) curve is presented in Fig. 2, which shows the false-positive (proportion of D wrongly detected) and true-positive (proportion of D detected) rates for different probability thresholds. It is apparent that the "easy" 45 per cent of peptic ulcers are detected equally efficiently by clinicians and the system, but above this figure the system approximately doubles the false-positive rate. Table 3 gives a simple summary of the reliability of the predictions, showing the system is reasonably well-calibrated but still shows the over-confidence described by Copas (1983). The clinicians do not exhibit the over-confident behaviour found in other studies of probability assessors (Lichtenstein et al., 1977), possibly reflecting their training on the previous patients in the data base. TABLE 3

Computer predictions Number of PU patients Number of non-PC patients Predictive accuracy

(clinicians'predictions)

Probability of peptic ulcer

0 to < 0.1

0.1 to 0.50 are allocated to the "peptic ulcer" class).

4.8. Transferability In Section 3.8 we identified four problems that may occur in using a discrimination technique in an environment different from the research centre in which it was developed. The computer interview helps with problems in reliability and bias in eliciting information, since identical wording may be used throughout the country. (Some Glaswegian dialect is contained in the original version of the interview, but this can easily be removed or changed.) There may be a variation in numbers of consultations for a particular disease; for example, in the 1971-72 morbidity survey of general practice, a rate of 6.8 consultations for peptic ulcer per 1000 population in the North of England contrasts with 3.1 per 1000 in the South (OPCS, 1979). Such changes in prior probability only entail the "starting score" being adjusted; let Pb(D) denote the prevalence, expressed as a probability, of disease D in the data-base, and Pj^(D) denote the prevalence in the study population; then the starting score shall have a factor 100 ln{PN(D)PB(D)lPN(D)PB(D)} added. This adjustment will have to be made in any case, since the numbers seen in an outpatient clinic (Table 1) will not mirror the incidence in general practice. The use of the diagnostic paradigm may help to avoid the selection bias that might occur due to local variation in referral patterns, provided our diagnostic indicants include those which contribute to the referral decision (Dawid, 1976). There remains the problem of genuine variations in disease presentation, which would affect the scores given to indicants in the discriminant function for a disease. For example, it is predominantly males who suffer from alcohol-induced dyspepsia in Glasgow, while this may not be the case in areas of the South of England. We hope that such "variable by disease by region" interactions will be rare, but will become apparent if the performance of the system drops when implemented elsewhere.

1984]

Statistical and Knowledge-based Approaches to Clinical Decision-support Systems 53 4.9. Discussion The Glasgow dyspepsia, system (GLADYS) bears some resemblance to MEDAS in its "knowledge-representation", and to early MYCIN and INTERNIST in its emphasis on the separation of positive and negative evidence. However, in contrast to these systems, our aim has been to provide output with external interpretation in the form of reliable probabilities. We feel the workings of our system are simple to explain and, we hope, intuitive to doctors. We have dealt with the problems of multiple diagnoses, missing data, dependent variables and outlying patients. However, we acknowledge that use of the "disease-not disease" discrimination strategy and our omission of the interaction terms in the regressions are purely pragmatic devices to achieve such a simple system. We are adopting the AI approach of introducing additional complexity only in response to errors in performance: if high-order interaction terms are required the resulting patterns of indicants with attached weights of evidence would begin to resemble a rule-based system, which indeed may be appropriate in some areas of medicine. We also feel that our current scoring system is too sensitive to changes in individual variables, and therefore intend to investigate the suggestions of Copas (1983) which would involve more variables with reduced weights. The section of the system that provides recommendations is currently a simple list of thresholds. For example, if the probability of cancer exceeds 10 per cent, the recommendation "refer to G.I. clinic for endoscopy" is triggered, and added to the explanation of the evidence for and against gastric cancer. For some diseases both a "treatment" and an "investigation" threshold exist. If probability of duodenal ulcer is high, immediate drug treatment is recommended; if moderate, an X-ray recommended; and if low, no action is considered. The aim is to reduce unnecessary investigations at the same time as speeding appropriate treatment. If no thresholds are exceeded, a conclusion of "insufficient evidence" is drawn and the recommendation of "temporise" is made, the act of "creative indecision" as described by Szolovits and Pauker (1978). The thresholds are not set by the decision theoretic argument described in Pauker and Kassirer (1980), and instead are continually reviewed in a process by which the system is "tuned" until its performance is considered adequate in terms of a subjective trade-off between the different types of errors that can be made. This procedure of using "acceptable error rates" to set thresholds, and in turn to reveal implicit value judgements, has previously been applied to a problem in predicting outcome following severe head injury (Spiegelhalter, 1982a). Simulations on the data-base have suggested initial values for the thresholds, but both these, and the order in which the decision-rules are examined, are expected to change during the development and evaluation process currently taking place. Prospective studies on 140 cases using only data collected by computer interview have suggested that if a GP acted on the recommendation of the system, up to 40 per cent of referrals could be saved with no drop in the standard of care. Full details of these studies, and the final weights adopted for diagnosis, will be published in the medical literature. A prototype system is currently in experimental use in two outpatient clinics and a general practice, and the software is being adapted to record repeated interviews of patients and note changes over time. This also allows it to be used as an unbiased "observer" in clinical trials of symptom-relieving drugs. Future developments being considered are the introduction of digitized speech to accompany the interview, and the construction of a version for use in developing countries, for which Dr K. Jalan of the Kothari Centre for Gastroenterology, Calcutta, is currently collecting data. 5. GENERAL DISCUSSION The preceding application is in a restricted domain, for which extensive data were available, and does not approach the complexity of some of the areas addressed by AI programs. However, we are unapologetic about the requirement for data; if uncertainty exists we feel this should be reliably quantified and justified to clinicians on the basis of extensive recorded experience,

SPIEGELHALTER AND KNIXL-JONES [Part 1, 54 although, if strictly necessary, subjective weights of evidence could be adopted and updated as information becomes available. Clinicians are becoming increasingly familiar with prognostic probabilities and explicit evaluation of risk factors in cancer therapy (Peto et al., 19S0) and heart disease (WHO, 1974), and we hope this awareness can be extended to diagnostic problems. We consider it essential that every attempt is made to use probabilistic rather than ad hoc quantitative methods, although our justification is purely pragmatic compared to the theoretical argument oflindley (1982). It remains to suggest the appropriate roles, if any, for Al-based and probabilistic systems, bearing in mind that we are concerned only with areas where a deductive or "categorical" approach is insufficient. Sterling et al. (1967), in an early discussion of the problems facing computer-assisted decision-making, identify three possible areas of application: initial structuring of a clinical problem with little prior information, differential diagnosis within a restricted clinical problem area and automatic interpretation of test results such as EEGs and ECGs. The first area, considered by Sterling to be unsuitable for computerized help, seems precisely appropriate for AI systems such as INTERNIST/CADUCEUS (Pople, 1982), involving heuristic search techniques of a comprehensive knowledge-base, in order to progress towards a differential diagnosis (Blois, 1980). From a practical point of view, however, it does not seem clear who would be motivated to conduct lengthy dialogues with such a system other than in an educational setting. The second area, when a problem has been constrained into a familiar domain, appears appropriate for strict probabilistic reasoning. Szolovits and Pauker (1978) use the phrase "categorical proposes, probabilistic disposes" ,to distinguish between the heuristic process of hypothesis formulation and the subsequent weighing of evidence for and against each hypothesis. This structure has a parallel with the initial use of "exploratory data analysis" in a statistical investigation, in which a range of somewhat ad hoc procedures may be applied in order to manipulate the problem into a familiar structure, to which "confirmatory" techniques with known properties may be applied. Nevertheless, we still consider the interface with the clinician to be a great problem, and expect applications as screening systems to be used by paramedical staff or junior doctors. Finally, the interpretation of data that are not obtained directly by the clinician appears to be the area with the most potential and the one in which the demarcation between the two approaches is least clear. We regard our system as fitting with this category, as does a program known as PUFF, apparently the only AI medical consultation system in routine use (Duda and Shortliffe, 1983). PUFF (Aikins et al., 1983) interprets computerized lung function tests and was originally developed using EMYCIN, but since there is no dialogue with the clinician many of the AI sophistications are not used. Other systems with the same purpose in routine use (Hoffer et al., 1973; Geddes et al., 1978) are purely algorithmic. We conclude that a synthesis between the two approaches seems possible in many areas, with the logical medical knowledge organized using one of the AI representation structures, and any "inexactness" modelled using weights of evidence. One of us (DJS) is currently collaborating on the development of such a hybrid system for the interpretation of blood sample data in a leukaemia laboratory (Fox, 1983).

ACKNOWLEDGEMENTS The authors are indebted to the members of the Diagnostic Methodology Research Unit (Drs G. P. Crean, A. D. Beattie, R J. Holden and R. W. Lucas) for their close collaboration. Valuable comments were provided by members of the Royal College of Physicians Computer Workshop, particularly Drs John Fox and Jack Dowie, and by Dr Sheila Gore, Dr Colin Begg and the referees. We are grateful to Sally Stephenson for manuscript preparation and Marc Coghlan for computing assistance. Above all, our primary source of inspiration throughout this work has been Professor Wilfrid Card.

1984]

Statistical and Knowledge-based Approaches to Clinical Decision-support Systems

55

REFERENCES Adams, J. B. (1976) A probability model of medical reasoning and the MYCIN model. Math. Biosci., 32, 177-186. Adlassnig, K. P. (1980) A fuzzy logical model of computer-assisted medical diagnosis. Methods Inf. Med., 9, 141-148. Aikins, J. S. (1983) Prototypical knowledge for expert systems. Artificial Intelligence, 20,163-210. Aikins, J. S., Kunz, J. C. and Shortliffe, E. H. (1983) PUFF: an expert system for interpretation of pulmonary function data. Comp.Biomed.Res., 16, 199-208. Anon. (1978) Data base on dyspepsia (editorial).Brit.Med. J., 1,1163-1164. Baker, R. J. and Nelder, J. A. (1978) The GLIM System Release 3. Oxford: Numerical Algorithms Group. Barnett, J. A. (1981) Computational methods for a mathematical theory of evidence. In Proceedings of 7th International Joint Conference on Artificial Intelligence, Vancouver, pp. 868-875. Ban, A. and Feigenbaum, E. A. (1981-82) Handbook of Artificial Intelligence, Vols 1 and 2. Los Altos: Kaufmann. Ben-Bassat, M., Carlson, W. C, Puri, V. K., Davenport, M. D., Schriver, J. A., Latif, M., Smith, R., Portigal, L. D., Lipnick, E. H. and Weil, M. H. (1980) Pattern-based interactive diagnosis of multiple disorders: the MEDAS system.IEEE Trans. Pattern Analysis Machine InteU. 2,148-159. Bischoff, M. B., Shortliffe, E. H., Scott, A. C, Carlson, R. W. and Jacobs, C. D. (1983) Integration of a computer-based consultant into the clinical setting. In Proceedings 7th Symposium on Computer Applications in Medical Care, Baltimore. Blois, M. S. (1980) Clinical judgement and computers. N. Eng. J. Med., 303, 192-197. Buchanan, B. G. (1982) New research on expert systems. In Machine Intelligence, 10 (J. E. Hayes, D. Michie and Y-H. Pao, eds), pp. 269-299. Chichester: Ellis Horwood. Card, W. I. (1967) Towards a calculus of medicine. Medical Annual, 9, 9-21. Card, W. I. and Good, I. J. (1974) A logical analysis of medicine. In^4 Companion to Medical Studies, Vol. Ill, pp. 60, 1-23. Oxford: Blackwell. Card, W. I. and Lucas. R. W. (1981) Computer interrogation in medical practice. Int. J. Man-Machine Studies, 14,49-57.

Carnap, R. (1950) Logical Foundations of Probability. Chicago: University Press. Cendrowska, J. and Bramer, M. A. (1982) A rational reconstruction of the MYCIN consultation system. Technical Report, Mathematics Faculty, Open University, Milton Keynes. Clancey, W. J. (1983) The epistemology of a rule-based expert system-a framework for explanation. Artificial Intelligence, 20, 215-251. Coomans, D., Broeckaert, I., Jonckheer, M. and Massart, D. L. (1983) Comparison of multivariate discrimination techniques for clinical data-application to the thyroid functional state. Meth. Inf. Med., 22, 93-101. Copas, J. B. (1983) Regression, prediction and shrinkage (with Discussion). / . R. Statist. Soc. B, 45, 311-358. Cox, D. R. (1970) The Analysis of Binary Data. London: Methuen. Croft, D. J. (1972) Is computerized diagnosis possible? Comp. Biomed. Res., 5, 351-367. Davis, R. (1979) Interactive transfer of expertise: acquisition of new inference rales. Artificial Intelligence, 12, 121-157. (1982) Consultation, knowledge acquisition and instruction: a case study. In Artificial Intelligence in Medicine (P. Szolovits, ed.), pp. 57-78. Colorado: Westview Press. Dawid, A. P. (1976) Properties of diagnostic data distributions. Biometrics, 32, 647-658. de Dombal, F. T., Leaper, D J., Horrocks, J. C, Staniland, J. R. and McCann, A. P. (1974) Human and computer-aided diagnosis of abdominal pain: further report with emphasis on performance of clinicians. Brit. Med. J., 1,376-380. de Dombal, F. T., Leaper, D. J., Staniland, J. R., McCann, A. P. and Horrocks, J. C. (1972) Computer-aided diagnosis of acute abdominal pain. Brit. Med. J., 2, 9-13. de Dombal, F. T., Staniland, J. R. and Clamp, S. E. (1981) Geographical variation in disease presentation. Med. Decis. Making, 1 , 5 9 - 6 9 . du Boulay, G. H., Teather, D., Harling, D. and Clarke, G. (1977) Improvements in the computer assisted diagnosis of cerebral tumours. Brit. J. Radiology, 50, 849-854. Duda, R. O. and Shortliffe, E. H. (1983) Expert systems research. Science, 220, 261-268. Edwards, D. A. W. (1970) Flow charts, diagnostic keys, and algorithms in the diagnosis of dysphagia. Scot. Med. J., 15, 378-385. Essex, B. J. (1980) Diagnostic Pathways in Clinical Medicine, 2nd ed. Churchill Livingstone: London. Feinstein, A. R. (1977) Clinical biostatistics XXXDC. The haze of Bayes, the aerial palaces of decision analysis, and the computerised Ouija board. Clinical Pharmacology and Therapeutics, 21, 482-496. Fox, J. (1982) Computers learn the bedside manner. New Scientist, 95, 311-313. (1983) Formal and knowledge-based methods in decision technology. In Proceedings of 9th Research Conference on Subjective Probability, Utility and Decision-making, Groningen. Fox, J. and Alvey, P. (1973) Computer assisted medical decision making. Brit.Med. J., 287, 742-746.

SPIEGELHALTER AND KNILL-JONES [Part 1, Fox, J., Barber, D. and Bardhan, K. D. (1980) Alternative to Bayes?: a quantitative comparison with rulebased diagnostic inference.Meth.Inf.Med., 19, 210-215. Friedman, R. B. and Gustafson, D. H. (1977) Computers in clinical medicine, a critical review. Comp.Biomed.

56

Res., 10,199-204.

Galen, R. S. and Gambino, S. R. (1975) Beyond Normality: The Predictive Value and Efficiency of Medical Diagnoses. New York: Wiley. Geddes, D. M., Green, M. and Emerson, P. A. (1978) Comparison of reports on lung function tests made by chest physicians with those made by a simple computer program. Thorax, 33, 257-260. Goldman, L. C, Weinberg, M., Weisberg, M., Olshen, R. etal. (1982) A computer-derived protocol to aid in the diagnosis of emergency room patients with acute chest pain. N. Eng. J.Med., 307, 588-596. Good, I. J. (1950) Probability and the Weighing of Evidence. London: Griffin. (1960) Weight of evidence, corroboration, explanatory power, information and the utility of experiments. / . R. Statist. Soc. B, 22, 319-331. (1983) Weight of evidence: a brief survey. Invited paper, 2nd International Meeting on Bayesian Statistics, Valencia. Good, I. J. and Card, W. I. (1971) The diagnostic process with special reference to mots.Meth.Inf.Med., 10, 176-188. Gorry, G. A. (1973) Computer-assisted clinical decision-making, Meth.Inf.Med., 12, 4 5 - 5 1 . Gorry, G. A., Kassirer, J. P., Essig, A. and Schwarz, W. B. (1973) Decision analysis as the basis for computer aided management of acute renal failure. Amer. J.Med., 55, 473—484. Gunn, A. A. (1976) The diagnosis of acute abdominal pain with computer analysis. / . Roy. Coll. Surg. Edin., 21, 170-172. Gustafson, D. H., Kestly, J. J., Ludke, R. L. and Larson, F. (1973) Probabilistic information processing: implementation and evaluation of a semi-PIP diagnostic system. Comput.Biomed.Res.,6, 355-370. Habbema, J. D. F. and Gelpke, G. J. (1981) A computer program for selection of variables in diagnostic and prognostic problems. Comp. Prog. Biomed., 13, 251-270. Habbema, J. D. F., Hermans, J. and Remme, J. (1978) Variable kernel density estimation in discriminant analysis. In Compstat 1978 (L. C. A. Corsten and J. Hermans eds), pp. 178-185. Vienna: Physica Verlag. Healy,M. J. R. (1978) Is statistics a science? / . R. Statist. Soc. A, 141, 385-393. Henderson, J. V., Moeller, G., Ryack, B. L. and Shumack, G. M. (1978) Adaptations of a computer-assisted diagnosis program for use by hospital corpsmen aboard nuclear-submarines. In Proc. 2nd Symp.of Computer Applications in Medical Care (F. H. Orthner, ed), pp. 587-594. New York: IEEE. Hoffer, E. P., Kanarck, D., Kazemi, H. and Barnett, G. O. (1973) Computer interpretation of ventilatory studies. Comp.Biomed.Res., 6, 347-354. Horrocks, J. C, McCann, A. P., Staniland, J. R., Leaper, D. J. and de Dombal, F. T. (1972) Computer-aided diagnosis: description of an adaptable system and operational experience with 2,034 cases. Brit. Med. J., 2,5-9. Kahneman, D., Slovic, P. and Tversky, A. (1982) Judgement under Uncertainty: Heuristics and Biases. New York: Cambridge University Press. Knill-Jones, R. P., Stern, R. B., Girmes, D. H., Maxwell, J. D., Thompson, R. P. H., and Williams, R. (1973) Use of sequential Bayesian model in diagnosis of jaundice by computer. Brit. Med. J., 1, 530-533. Krischer, J. P. (1980) An annotated bibliography of decision analytic applications to health care. Oper. Res., 28, 97-113. Leaper, D. J., Horrocks, J. C, Staniland, J. R. and de Dombal, F. T. (1972) Computer-assisted diagnosis of abdominal pain using "estimates" provided by clinicians. Brit.Med. J., 4, 350-354. Ledley, R. S., and Lusted, L. B. (1959) Reasoning foundations of medical diagnosis. Science, 130, 9 - 2 1 . Lichtenstein, S., Fischhoff, B. and Phillips, L. D. (1977) Calibration of probabilities: the state of the art. In Decision Making and Change in Human Affairs (H. Jungemann and G. de Zeeuw, eds), pp. 275-324. Amsterdam: D. Reidel. Lindberg, G. (1981) Effects of observer variation on performance in probabilistic diagnosis of jaundice. Meth. Inf. Med., 20, 163-168. Lindley, D. V. (1982) Scoring rules and the inevitability of probability. Int. Stat. Rev., 50, 1-26. Lucas, R. W. (1977) A study of patients' attitudes to computer interrogation. Int. J. of Man-Machine Studies, 9,69-86. Lucas, R. W., Card, W. I., Knill-Jones, R. P., Watkinson, G. and Crean, G. P. (1976) Computer interrogation of patients. Brit. Med. J., 2, 623-625. Michalski, R. S. and Chilausky, R. L. (1980) Knowledge acquisition by encoding expert rules versus computer induction from examples: a case study involving soy-bean pathology. Int. J. Man-Machine Studies, 12, 63-68. Miller, R. A., Pople, H. E., Jr and Myers, J. D. (1982) INTERNIST-1, an experimental computer-based diagnostic consultant for general internal medicine. N. Eng. J.Med., 307, 468-476. OPCS (1979) Studies on Medical and Population Subjects No. 36, Morbidity statistics from General Practice. 1971-72. London: HMSO.

1984]

Statistical and Knowledge-based Approaches to Clinical Decision-support Systems

57

Pantin, C. F. A. and Merrett, T. G. (1982) Allergy screening using a microcomputer. Brit. Med. J., 2, 483-487. Patrick, E. A., Stelmack, F. P. and Shen, L. Y. (1974) Review of pattern recognition in medical diagnosis and consulting relative to a new system model. IEEE Trans. Syst.Man. Cyb., 4 , 1 - 1 6 . Pauker, S. G., Gorry, G. A., Kassirer, J. P. and Schwartz, W. B. (1976) Towards the simulation of clinical cognition: taking a present illness by computer. Amer. J.Med., 60, 981-996. Pauker, S. G. and Kassirer, J. P. (1980) The threshold approach to clinical decision making. N. Eng. J. Med., 302, 1108-1117. -(1981) Clinical decision analysis by personal computer. Arch. Intern. Med., 141, 1831-1837. Pednault, E. P. D., Zucker, S. W. and Muresan, L. V. (1981) On the independence assumption underlying subjective Bayesian updating. Artificial Intelligence, 16, 213-222. Peto, R., Pike, M. C, Armitage, P., Breslow, N. E., Cox, D. R., Howard, S. V., Mantel, N., McPherson, K., Peto, J. and Smith, P. G. (1977) Design and analysis of randomized clinical trials requiring prolonged observation of each patient. II. Analysis and examples. Brit. J. Cancer, 35, 1—39. Pople, H. E. (1982) Heuristic methods for imposing structure on ill structured problems: the structuring of medical diagnosis. In Artificial Intelligence in Medicine (P. Szolovits, ed.), pp. 119—185. Colorado: Westview Press. Popper, K. R. (1959-80) The Logic of Scientific Discovery, London: Hutchinson. Rogers, W., Ryack, B. and Moeller, G. (1979) Computer-aided medical diagnosis: literature review. Intern. J. Biomed. Comp., 10, 267-289. Shafer, G. (1976) A Mathematical Theory of Evidence. Princeton: University Press. (1982) Belief functions and parametric models (with Discussion). /. R. Statist. Soc. B, 44, 322-352. Shortliffe, E. H. (1976) Computer-based Medical Consultations: MYCIN. New York: Elsevier. Shortliffe, E. H. and Buchanan, B. G. (1975) A model of inexact reasoning in medicine. Math. Biosci., 23, 351-379. Shortliffe, E. H., Buchanan, B. G. and Feigenbaum, E. A. (1979) Knowledge engineering for medical decision making: a review of computer based clinical decision aids. Proceedings of IEEE, 67,1207-1224. Shortliffe, E. H., Scott, A. C, Bischoff, M. B., Campbell, A. B., van Melle, W. and Jacobs, C. D. (1981) ONCOCIN: an expert system for oncology protocal management. In Proceedings of the 7th International Joint Conference on Artificial Intelligence, Vancouver, pp. 876-881. Smets, P. (1981) Medical diagnosis and degrees of belief. Fuzzy Sets and Systems, 5, 259-266. Spiegelnalter, D. J. (1982a) Statistical aids in clinical decision-making. The Statistician, 31,19-36. (1982b) Comments of Lindberg's correction for the effects of observation variation on probabilistic diagnosis. Meth.Inf.Med.,21,114-116. (1983) Evaluation of clinical decision-aids, with an application to a system for dyspepsia. Statistics in Medicine,!, 207-215. Spiegelhalter, D. J. and Smith, A. F. M. (1981) Decision analysis and clinical decisions. In Perspectives in Medical Statistics (J. Bithell and R. Coppi, eds), pp. 103-131. London: Academic Press. Sterling, T. D., Nickson, J. and Pollack, S. V. (1966) Is medical diagnosis a general computer problem? /. Amer. Med.Assoc., 198, 281-286. Szolovits, P. (ed.) (1982) Artificial Intelligence in Medicine. Colorado: Westview Press. Szolovits, P. and Pauler, S. G. (1978) Categorical and probabilistic reasoning in medical diagnosis. Artificial Intelligence, 11,115-144. (1979) Computer and clinical decision-making: whether, how and for whom? Proceedings of IEEE, 67, 1224-1226. Teach, R. L. and Shortliffe, E. H. (1981) An analysis of physician attitudes regarding computer-based clinical consultation systems. Comp.Biomed.Res., 14, 542-558. Thomas, G. E., Cotton, P. B„ Clark, C. G. and Boulos, P. B. (1980) Survey of management in acute upper gastrointestinal haemorrhage./.Roy. Soc.Med., 73, 90-95. Titterington, D. M., Murray, G. D., Murray, L. S., Spiegelhalter, D. J., Skene, A. M., Habbema, J. D. and Gelpke, G. J. (1981) Comparison of discrimination techniques applied to a complex data set of head injured patients (with Discussion). J.R.Statist.Soc. A, 144,145-175. van Melle, W., Scott, A. C , Bennett, J. S. and Peaks, M. A. S. (1981) The EMYCIN manual. Rep. HPP-81-16. Computer Science Department, Stanford University, California. Wagner, G., Tauta, P. and Wolber, U. (1978) Problems of medical diagnosis-a bibliography. Meth. Inform. Med., 17,55-74. Wardle. A. and Wardle, L. (1978) Computer-aided diagnosis-a review of research. Meth. Inform. Med., 17, 15-28. Wechsler, H. (1976) A fuzzy approach to medical diagnosis.Int. J. Bio-med. Comp., 7,191-203. Weed, L. L. (1971) Medical Records, Medical Education and Patient Care. Chicago: Yearbook Medical. Weinstein, M. C. and Fineberg, H. V. (1980) Clinical Decision Analysis. Philadelphia: Saunders. Weiss, S. M., Kulikowski, C. A., Amarel, S. and Safir, A. (1978) A model-based method for computer-aided medical decision-making. Artificial Intelligence, 11,145—172.

SPIEGELHALTER AND KNILL-JONES [Part 1 Wesley, L. P. (1983) Reasoning about control: the investigation of an evidential approach. In Proceedings of 8, International Joint Conference on Artificial Intelligence, Karlsruhe, pp. 203-206. WHO European Collaborative Group (1974) An international controlled trial in the multifactorial preventic of Coronary Heart Disease. Int. J. Epidemiol., 3, 219-224. Williams, B. T. (1982) Computer Aids to Clinical Decisions, Vols I and II, Florida: CRC Press. Wirtschafter, D. D., Carpenter, J. T. and Mesel, F. (1979) A consultant-extended system for breast cane adjuvant chemotherapy. Ann.Intern.Med., 90, 496-401. Wulff, H. R. (1981) Rational Diagnosis and Treatment, 2nd ed. Oxford: Blackwell. Young, R. M. (1979) Production systems for modelling human cognition. In Expert Systems in the Mien electronic Age (D. Michie, ed.), pp. 35-45. Edinburgh: University Press. Yu, V. L., Buchanan, B. G., Shortliffe, E. H., Wraith, S. M., Davis, R., Scott, A. C. and Cohen, S. N. (1979; Evaluating the performance of a computer-based consultant. Comp. Prog, in Biomed., 9, 95-102. Yu, V. L., Fagan, L. M., Wraith, S. M., Clancey, W. J., Scott, A. C, Hannigan, J., Blum, R. L., Buchanan, B. C and Cohen, S. N. (1979b) Antimicrobial selection by a computer: a blinded evaluation by infectious diseai experts. J. Amer. Med. Assoc., 242, 1279-1282. Zadeh, L. A. (1965) Fuzzy sets./«/. Control., 8, 338-353.

58

DISCUSSION OF THE PAPER BY DRS SPIEGELHALTER AND KNILL-JONES

Professor M. J. R. Healy (London School of Hygiene): This evening's paper is of interest fo two rather separate reasons. It gives an account of a solution to a demanding statistical problem that of classification or discrimination with multiple class membership. Most previous worker have avoided this problem, partly because it does not occur in many taxonomic applications, am partly perhaps because the usual spatial analogy of clusters with boundaries is only with difficult applicable. But as well as this, it gives our Society an opportunity to discuss an area of research that into so-called Artificial Intelligence or AI, which has grown up over the past 10-20 year with the minimum of contact with statisticians and which is now cteiming to solve problems wit! a substantial statistical content. Several Fellows present tonight will have attended a whole-da conference on AI and statistics organized by the Society's Statistical Computing Study Group few weeks ago. The nature of the contrast between the AI and the statistical approaches is not easy to defint with precision. One total misconception has become embodied in the title of tonight's paper Diagnostic systems based on prior probabilities are every bit as "knowledge-based" as those relying on MYCIN-type rules; the knowledge is merely of a different nature and has to be handlec in a different way. It is this distinction which may be worth pursuing a little further. For many workers in AI, an important test-bed for proposed methodology is the game o chess, and in particular the solution of end-game problems. These problems have quite definite solutions (a given position must be a win for white, a win for black or a draw) and a good dea of expert knowledge is available —or rather, there are a lot of genuine experts about, which i not quite the same thing. Given a particular end-game position, it should be possible to specify a set of defining features and a set of rules using them which would unfailingly lead to a correc solution. The difficulty is that the features may not be obvious, and a particular set of rules may be exceedingly complex. Moreover, human experts turn out to be bad at describing the rules which they employ largely subconsciously and which they may never consciously formulate in the normal course of events. On the other hand, human experts do two things particularly well; they are good at specifying defining features, and they are good at solving individual problems (this is why they are called "experts"). It is to me an exciting development that by using the feature specified by an expert and by presenting him or her with a quite small number of problems to solve, a computer program can derive (I do not think "infer" is quite the right word) a set of rules of general applicability. Moreover, some refinement of the process, again guided by the human expert, produces not merely a set of rules which can be applied by a machine, but a se which are comprehensible by human beings. Now, all this is to me immensely impressive, but its field of apphcability seems to be quite closely defined. There is no place in the system for uncertainty; if two instances with identica features have different outcomes, the system demands that the feature-set be revised, probably by adding new items, until the ambiguity is resolved. This is quite unrealistic in fields such as that of medical diagnosis, where it is all too common for two patients with apparently identica

59 Discussion of the Paper by Drs Spiegelhalter and Knill-Jones 1984] presentations to fall into different diagnostic categories. This is a misleading description, of course; identical presentations do not in fact occur, and what is meant is that the diagnostic expert simply cannot specify the complete list of features which would guarantee error-free diagnosis. (It is arguable that error-free diagnosis may never be possible in biological individuals such as human beings; by definition, no two biological individuals are identical, and a perfect fit to a specification is not to be expected. It is possible to draw an analogy here with the logical difficulty of significance testing, that most null hypotheses are known in advance to be false, and similarly with a weakness of the Popperian picture of science, that the data always disagree with the conjectured hypothesis to an extent which may or may not be explicable by some often ill-defined error term.) The wish to extend rule-based systems to topics such as medical diagnosis was presumably what led to the introduction of uncertainty factors such as those described in tonight's paper. It is something of a scandal that this appears to have been done in such an amateurish fashion. As statisticians, we have a considerable knowledge-base (this seems to be the current jargon for "we know quite a lot") concerning the measurement of uncertainty in terms of odds or probabilities, the propagation of uncertainty along a chain of argument, and the elicitation of personal beliefs from domain experts in probability terms. Little of this knowledge appears to have been utilized by the leaders in the AI community. Instead, we have operationally meaningless "degrees of certainty" propagated by inappropriate formulae to outcomes whose ostensibly quantitative labels are justified only by loaded terminology. Our own community of statisticians has nothing to be proud of in allowing this situation to persist. I do not want to claim too much for the statistical approach to classification problems. A Bayesian attack on a chess end-game problem might be interesting, and probability concepts may have to enter even this field as the increasing difficulty of the problems tackled begin to stretch the experts' knowledge to breaking point, but certainties are known to exist, and certain conclusions should be the aim. In addition, certainties often arise at the head of a chain of reasoning, and it seems artificial to cope with these by equating them with probabilities of zero or one. There is also the important issue of human comprehensibility, which lack of time forces me to leave on one side. Is it not possible to get the best of both worlds? Tonight's authors believe that it is, and they have adopted some of the AI criteria in building their own system. I suspect that more could be done in this direction, though I am unable to confirm my suspicion directly. Looking at systems for medical diagnosis and other areas involving uncertainty, it seems to me that deductive and probabilistic reasoning are truly complementary, in that one comes logically and temporally before the other. There does exist a great deal of knowledge of anatomy, physiology and biochemistry, not to mention simple common sense, which can be used by deductive reasoning to fine down the diagnostic problem with effective certainty from the whole field of disease in general to a more or less narrow area. Evidence for this exists in the successful development of flow-charts and algorithms for medical purposes (Williams, 1982). Almost always in practice, however, the deductive rules leave open an area of uncertainty where clinical findings fail to determine unequivocal diagnoses, and at this point decent probabilistic reasoning should be able to take over. The best example known to me (though it stops short of explicit probabilities) is the dysphagia system of Edwards (1970); published in a second-line journal and not involving the use of a computer, its influence over more than 10 years has been less than it deserves. I refer to these views as unconfirmed suspicions because confirming them would need a more thorough knowledge of the application field-here, medical diagnosis and its allied sciences —than I possess. If progress is to be made, then the sort of collaboration between doctor and statistician exemplified by tonight's paper will be essential. It has been my privilege to hear the authors and other workers in the field presenting and discussing their findings at the workshops organised by the Royal College of Physicians' Computer Committee, the joint sponsor of tonight's meeting, and I am proud to know that doctors and statisticians can meet and work together on an equal footing, in such a way as to advance both of their professional disciplines to an extent which, with either side trying to go it alone, would not be possible. The paper we have heard is for me an example of applied statistics at its best, and I propose a hearty vote of thanks to the authors. Professor D. M. Titterington (University of Glasgow): I have to start my contribution by echoing Professor Healy's enthusiasm both for the material in this paper and for the opportunity

Discussion of the Paper by Drs Spiegelhalter and Knill-Jones [Part 1 60 it gives for wide discussion among exponents of the statistical and artificial intelligence approache to this problem and among clinicians who may or may not support the usefulness of these techniques. I hope that this opportunity will be fully exploited. In spite of a fairly long-term interest in discriminant analysis and its medical applications, must confess that only recently did I become fully aware of the vast literature on the use of the computer and related methods in this field. I was struck by the comparative lack of acceptance o these methods by the medical profession. The statistical approach has been hard to put across, a least partly because it does not seem to represent the way clinicians approach diagnosis. Freemon (1972), for instance, conducted a survey among senior medical students and housemen in the USA to find out what sort of mental processes lay behind their diagnostic decisions. Out of 170 people most used either a sequential process of ehmination of possible diagnoses or a matching procedure or both. Only 2 used the sort of scoring method used in the present paper and only 1 person use a probabilistic process similar to "Independent Bayes". The recent introduction of AI techniques has been motivated partly in order to simulate the clinician's thought processes,, on the understanding of course that the clinician is arguing in the right way in the first place. I do, however, support the authors and Professor Healy in their mis trust of the rules and concepts underlying the treatment of uncertainty in, say, fuzzy analysis and confirmation theory, in contrast with the coherence of probability-based manipulations (Lindley 1982). Does fuzzy algebra reflect the way clinicians should or even do process multivariate information on a given patient? Can certainty factors or fuzzy-set membership functions easily be elicited? What do clinicians understand by, say, the quantity 0.7 quoted in Rule 050 in Section 3.2? It has to be admitted that interpretation of probabilities is not all that easy either although translation into betting odds is bound to be helpful, and the facility for providing interva estimates, available from statistical theory, seems a crucial step in providing a full picture of the knowledge of a patient's condition. While strongly supporting the principle of a statistical approach, I have to admit that, from a pragmatic point of view, there may not be much to choose between philosophies, as far as results are concerned, at least whilst medical data-sets remain highly multivariate but small. For this reason variable selection is very important and even then simple models, such as the independence model, have often proved adequate. Lachenbruch (1981) claims rarely to need more than 3 or 4 predictor variables in a discriminant analysis and this is backed up by Titterington et al. (1981) and Section 2 of the present paper. Parsimony of this type seems to reflect good clinical practice in which the best clinical investigators are characterized by asking only a few but the right questions (De Dombal, 1978). Comparative studies of statistical, AI and clinicians' attempts at diagnosis would certainly be of interest although I do suspect that the results would often be very similar. Before closing, I would like to make a couple of technical remarks about the example in Section 4. (i) Although much can be said in favour of the diagnostic paradigm, in which p(D | s) is mod elled directly, it precludes the assessment of the atypicality of s itself, so that a patient belonging to an extraneous disease category might not easily be picked out as such. (This is another illustration of Box's, 1 980, point that a closed Bayesian analysis cannot criticiz itself.) It seems that the measure of conflict introduced in Section 4.5 is of help in this respect. (ii) I am a bit worried about the proposal of separate linear models for log {p(Dj | s)/p(Z),-1 s)} for each i. The usual multicategory logistic model would propose linear forms for, say, log {p(Z>,- | s)/p(Z>i |s)}, for each z > l . The latter approach fits in nicely with familiar techniques such as linear discriminant analysis, whereas the method used here certainly does not. Has anything been done to check the model in this example? No doubt there are hardly enough data to do this. My final thought is that the use of computer-based systems, statistical or not, is bound to increase in medical diagnosis. Undoubtedly this is enhanced by proper collaboration between statistician or expert-systems expert and clinician, and it will be easier to achieve if, as Miller (1983) remarks, the non-clinician does not presume too obviously to claim to know better than his or her medical colleague. I hope very much that there will be a wide and fruitful discussion of this paper and I am very pleased to second the vote of thanks to Drs Spiegelhalter and Knill-Jones for providing such a

Discussion of the Paper by Drs Spiegelhalter and Knill-Jones 1984] 6: flexible springboard. The vote of thanks was passed by acclamation. Professor A. F. M. Smith (University of Nottingham): The authors characterize a currently typical Al-based expert system as one which "encodes expert judgement in a structure intendec to bear some resemblance to human cognition". But in developing an "expert system" surely w< should examine more closely its ultimate purpose before deciding what constitutes an appropriatf form of encoding procedure and whether we really do wish it to resemble a human cognitive process? Large areas of AI research are concerned with fundamental scientific questions aboul human cognition and make use of would-be descriptive computer-based modelling. This is fine and very exciting. But it is not at all clear to me that concepts and techniques which are basic ir this latter enterprise have any relevance or use in developing technological aids for human decisionmaking under conditions of uncertainty. This activity surely lends itself to a prescriptive approach (albeit subject to pragmatic approximation and trimming of purist doctrine). Human beings are demonstrably bad at probability and risk assessment: why on earth—when we have in mind technology rather than science — should we be interested in modelling and reproducing thes human failings? The AI criticisms of the "statistical" paradigm are basically the following: (i) the logical structuring of a decision-problem is often far too open-ended and complicated to be constrained within the simple framework of a cr-field, based on an ultimate a priori partitioning into all possible outcomes; (ii) the kinds and levels of preference judgements required to provide the necessary quantitative inputs are unpalatable, if not impossible, given inadequate data, etc.; (iii) the mechanics of and form of output from the "statistical" inference process are unacceptable/ incomprehensible to users. I must confess that I think there is some substance in the first two of these criticisms, and that too little attention has been paid by statisticians to exploring the consequences of weakened preference axioms applied to weakened logical structures. However, the third criticism seems to me to be totally unacceptable and amounts to saying that attempts at rational analysis should be abandoned if current aspects of the status quo (let us take the organizational and educational framework of clinical medicine as an example) have created an unfavourable environment for their acceptance. After a thorough examination of the pros and cons, the authors wisely conclude that what is required is a synthesis of the "knowledge-based" and "statistical" approaches. Perhaps they should have added —drawing on the history of the Royal College of Physicians Computer Workshop — that, to be successful, such collaboration must take place within a framework of mutual recognition and respect. The Alvey directorate has recently issued a report entitled "Intelligent Knowledge-Based Systems: A programme for action in the UK". Interest and funding in the general area of IKBS is growing fast. And yet this three-volume, two-inch thick, supposedly comprehensive report reads almost as if statistics and statisticians did not exist! It seems to me vitally important that statisticians become involved in appropriate aspects of these developments — both as creative innovators (with GLADYS as an inspiring example) and also as potential critics of some of the currently fashionable ad hoc quantitative mumbo-jumbo which may otherwise come to characterize Al-based decision-making systems. Let us, of course, be open to new ideas from our AI colleagues: but, in return, let us not be too timid about drawing their attention to the existence and relevance of statistical expertise. Dr D. Teather (Leicester Polytechnic): The authors have provided an interesting and fair comparison of the statistical and AI approach to the provision of aids for diagnosis. The comparison has, if anything, been too fair to the AI community. I would like to endorse their view on the need for diagnostic aids based on hard statistical data. This point has been stressed by De Dombal (1983) and in a forthcoming paper by Morton et al. (1984). A well-calibrated statistical model is essential if the diagnostic advice provided to the clinicians is to be of value, and the failure of many systems to be accepted by clinicians may be partly due to the lack of reliability of the probability statements produced. The naive Bayes approach can, if used without caution, produce as meaningless numerical statements as the "inference procedures" of the AI approach. Possibly more important, however, is the need to design an acceptable user interface for the clinician. This interface must be tailored to the particular medical application, and the

[Part 1, Discussion of the Paper by Drs Spiegelhalter and Knill-Jones 62 involvement of the clinical end-user in all stages of the design process, is crucial. This approach, based on close collaboration between statistician, computer scientist and clinician, is being adopted at Leicester Polytechnic, in collaboration with the National Hospital for Nervous Diseases, London, to produce an operational system for the interpretation of CT scan images, and the diagnosis of Cerebral Disease (Innocent, 1983). The final system will contain a battery of "help" and "explanation" facilities similar to those found in some expert systems. The diagnostic advice and explanations are, however, based on the statistical analysis of hard data, moderated by expert radiological opinion. The diagnostic advice presented to the radiologist is qualified by indicating the accuracy of previous computer predictions of the disease suggested by the statistical model. A system for diagnosis can ultimately only be classified as successful if it is both acceptable to the clinician and used by the clinician. I would therefore welcome further details from the authors concerning the acceptability of their system and, in particular, whether or not clinicians are able to interpret the summary output of Fig. 1. Professor Wilfrid Card (Diagnostic Methodology Research Unit, Southern General Hospital, Glasgow G51 4TF): "Diagnosability''. It is a common experience among clinicians that certain disease classes, certainly in their fully developed form, are easy to diagnose. Others are notoriously difficult since they have few indicants which distinguish them from other disease classes. It is therefore reasonable to introduce the concept of "diagnosability" (Card, 1967) to express these differences and to measure this as the mean total weight of evidence in favour of the disease among those patients with the disease. This slide shows the distribution of total weights of evidence in 305 peptic ulcer patients and the distribution in 57 patients with symptomatic gallstones. The mean weight of evidence for the ulcer patients is 118 and for the patients with symptomatic gallstones is 269. In these terms we might say that the latter disease is more than twice as diagnosable. This finding is consistent with clinical experience. As an erstwhile "expert" in gastroenterology, though I would never have used the term, I should emphasize that the work of Dr Spiegelhalter in measuring the diagnostic power of various indicants yields entirely new knowledge. Not only do the "experts" not know the value of the symptoms and signs they elicit but they cannot know them without measurement. This requires some form of prospective study without which the diagnostic system of any "expert" will always be sub-optimal. Professor J. W. Tukey (Princetpn University): After only a short time to read the text, I have but three points to make. First, paper and discussion raise the question "Should statisticians be rational or empirical?" Since we must be empirically sound when we finish, this translates into "Is naive rationality the best step toward the empirical?" to which the answer has to be an empirical one. Thus I am really pleased by the authors' emphasis on calibration and evaluation. We can hope that some Al-based procedures will soon be examined to see how well their results can be calibrated in predictive probability terms. The use of "Idiot's Bayes" is a natural start, but surely no better than very naive rationality. The crucial test involves its functioning, not its suggested rationality. Second, I note that, in some circumstances, "missing" should not be scored "Zero", since there are cases where it reflects a clinical judgement. In view of the small fraction of missing data in GLADYS, however, it is not likely to be feasible to assess non-zero scores for many indicants. Third, the main novel thrust of the paper seems to be on communication —a weD chosen emphasis. The authors' Fig. 1 approaches the level of a semigraphic display; it seems reasonable to me to consider and experiment-with going on to a more fully graphic one, as illustrated in Fig. Dl. Besides ordering the individual scores upward from — to +, subject to the segregation o abdominal pain, this proposal includes, at the very top, distributional information such as that illustrated by Professor Card in his remedies. [Added in writing.] On reflection, two further points deserve notice. Fourth, the authors may find helpful a forthcoming paper by Landwehr Pregibon and Shoemaker on "Graphical methods for assessing logistic regression models" to appear in the Journal of the American Statistical Association. Finally, the whole area seems to be one in which "leave-out-one" cross-validation and "leaveout-two" estimation of cross-validation sampling errors (cf. Mosteller and Tukey, 1977, Chapter

5% _j

20 _j

10

50%

30

70

l

Final

Relieved by food Epigastric Occ. woken, and relieved by snack

Abdominal pain

y

80

I

90

I

95% L

>.'

>' '

Point at site Episodic

>

Present •

Vomits, then eats within 3 hours

_y Smoker > Family history of ulcer No previous operation for ulcers •«_ No seasonal effect on pain

Suggest Documents