mately limite bythe structure of the network. ... Many computer programs for medical diagnosis are .... between the consultants' estimates and the observed casesĀ ...
Criticizing Conditional Probabilities in Belief Networks David J. Spiegelhalter Nomi Harris Medical Research Council Biostatistics Unit Laboratory for Computer Science, MIT MA Cambridge, UK 02139 Cambridge, Kate Bull and Rodney C.G. Franklin Great Ormond Street Hospital for Sick Children London, UK
Abstract In order to constmct a Bayesian belief network for a medical domain, a large number of conditional probabilities must be obtained. We invwtigated the following issues regarding these probabilities: (1) How accurate are subjective probabilities provided by physicians? (2) How can we use imprecision in subjective probabilities to our advantage? (3) How can the probabilities be improved a we observe new cases of the diseas being studied? (4) How important are the probabilities, as compared with the actual structure of the network? We conducted prliminary exeriments in the domain of congenital heart disease to address these questions. We found that combining physician' subjective probabilities with data from actual cases can improve predictive ability, but it is likely that the success of a diagnostic program based on belief networks is ultimately limite by the structure of the network.
Introduction Many computer programs for medical diagnosis are based on Bayesian o g (e.g., [5], [10]). These programs tend to have several drawbacks. One common limitation is that the simple two-level Bayesi networks are unable to capture the conditional dependencies between symptoms. Another problem is that it can be difficult to obtain the large numbers of conditional probabilities that are needed, and to verify the accuracy of these probabilities. This task can be difficult to separate from validating the structure of the model on which the program is based. We decided to separate the isue of structure from the issue of probabilities by using a fixed, overly simplified Bayesian network to represent the dependence of diseases and findings in congenital heart disease. 806
01 95-4210/90/0000/0805$01.00 X 1990 SCAMC, Inc.
We asked several pediatric cardiologists to specify imprecise conditional probabilities, and used statistical scoring methods to check the accuracy of these probabilities. We then merged these subjective probablities with data obtained from cases, and showed that the combined probabilities yielded more correct diagno than either set alone. The success of the diagnoses was, however, limited by the inherent inadequac of the belief network that we used. Our intent in these expements was not to produce a useful diagnostic tool, but rather to investigate methods of criticizing and improving the conditional probabilities required by a belief network model.
Congenital Heart Disease Congenital heart disease is suspected in infants who exhibit cyanosis (blueness) or heart failure (breathlessness) shortly after birth. Infants with suspected congenital heart disease must be transported quickly to a specialized referral hospital, such as Great Ormond Street Hospital (GOS) in London, where they can be examined and treated by experienced pediatric cardiologists. The survival rate of the infants is very dependent on the immediate care they receive before being brought to GOS, so it is important for physicians at the referring hospital to make a quick diagnosi before transporting the inants. This diagnosis is usually aocomplished by an over-the-telephone consultation. At GOS, 24 questions are asked, each of which has between two and five possible responses. TIble 1 shows some of the questions and responses. Based on the responses to these questions, the cardiologist decides which of 27 diseas the infant is most likely to have. After the infnt is brought to GOS, a definitive diagnosis i made with the aid of echo-cardiogram or cardiac catheterization. This diagnosis is used as the gold standard when assessing the performance of other diagnostic methods.
Table 2: Examples of ranges given by pediatric cardiologists for P(F I D). -
Non-urgent heart disease Hypoplastic left heart Response Question 5-10% 0-0% Main problem Cyanosis 90-95% 0-0% Heart failure 1-2% 100-100% Asymptomatic murmur 0-0% 0-0% Arrythmia 0-0% 0-0% Other
Grunting?
Yes
No
5-10% 90-95%
30-40% 60-70%
Table 2 shows some of the probability ranges supplied by the cardiologists. The cardiologists were sometimes very precise; for example, they were certain that the main problem in non-urgent heart disease is always asymptomatic murmur. Other subjective probabilities were assigned larger ranges, such as whether grunting will be exhibited in case of hypoplastic left heart. Data for 200 cases were collected, enabling us to evaluate the accuracy of the cardiologists' estimates. Table 3 shows the portion of the data relevant to Table 2. In most cases, there was good agreement between the consultants' estimates and the observed cases. A notable exception is the main problem in nonurgent heart disease; the actual findings were much more diverse than the cardiologists' subjective probability assessments suggested. One possible explanation for this discrepancy is that referring doctors exaggerate the severity of the symptoms in order to increase the chance of passing responsibility for the child to the central hospital [9].
Table 1: Some questions used in over-the-phone diagnosis of congenital heart diseae 1. 2. 3. 4. 5.
-
Sex (M/F)
Cyanosis? (Y/N)
Heart failure? (Y/N) Birth asphyxia? (Y/N) Main problem (Cyanosis, Heart failure, Asymptomatic murmur, Arrythmia, Other) 6. Grunting? (Y/N) 7. Heart rate (200) The need for quick over-the-telephone diagnosis suggests that a computerized approach would be advantageous. Franklin et al. [4] designed a flow-chart algorithm for diagnosing cogenital heart disease. The next stage of the project is to construct a diagnostic system based on a belief network. Toward this end, we performed several experiments exploring issues relating to the conditional probabilities required by a belief network. We collected these probabilities from doman experts, asessed their accuracy, and investigated a method for combining the experts' probabilities with case data.
Obtaining Subjective Conditiona Probabilities We asked three pediatric cardiologius to assess imprecise conditional probabilities for each of the findings given each of the diseases (which were afumed to be mutually exclusive and ehaustive). Rather than requesting specific point probabilities, we asked the consultants to specify a range for each conditional probability. They were encouraged to give ranges expressing their doubts, rather than being unrealistically precise, but were not told a specific interpretation for the probability ranges. The cardiologists who participated in this experimeat came from a unit in which numerical assessment of risks is commonplace.
Figure 1: A simple Bayesian network for congenital heart disease
A simple belief network for diagnosis We constructed an overly simplified network, shown in Figure 1, representing the conditional dependencies between findings and diseases. The usual simplifying assumptions were made: the findings are assumed conditionally independent given the disease, and the pa806
Table 3: Data recorded for 200 cames -
Question/Response Main problem: Cyanosis Heart failure Asymptomatic murmur Arrythmia Other Grunting? Yes No
N.U.H.D.
H.L.H.
2 (10%) 5 (24%) 11 (52%) 0 (0%) 3 (14%)
2 (10%) 18 (90%) 0 (0%) 0 (0%) 0 (0%)
0 (0%) 21 (100%)
7 (37%o) 12 (63%)
Table 4: Implicit sample sizes (with k = 1) Question/Rsponse N.U.H.D. H.L.H. Main problem: Cyanosis Heart failure Asymp. murmur Arrythmia Other
0 0 999 0 0
8 99 1 0 0
8 582
31 59
Grunting? Yes
No
tient is assumed to have exactly one of the possible
Combining Subjective Probabilities with Data Many expert systems rely on doman experts for subjective probabilities. Other statistically-oriented systems (such as [10]) derive conditional probabilities by tabulating previously collected data De Dombal and his colleague [5] tried both approaches, but did not attempt to combine them. We felt that using both expert lkowledge and statistical data would be superior to using either source alone. Using the physicians' assessments in conjunction with data is more reliable than using only the physicians' probabilities. If the physiians' estimates are incorrect, they will gradually be corrected as data are incorporated. Combining the two sources is more flexble than using only data, because diagnoses can be made even for rare diseases for which there is ufficient statistical data. Finally, our method of combining subjective probabilities with data allows us to take advantage of the imprecision in e, rather than regarding this the probability impreciion as a liability.
is about that particular conditional probability. This high confidence could result from the physician having seen enough cas of the disease in question to have a clear mental picture of the typical manifestations. Consequently, our formula for converting subjective probabilities to implicit data makes smaller probability ranges into larger implicit samples. The conversion equation is derived by modelng the probability distribution as a Dirichlet random variable (which is a beta distribution if there are only two possible responses to a question). We assume that each probability range represents k standard distributions of this prior distribution [9]. If k = 1, we are essentially assuming that the likelihood that the true probability lies within the specified range is approximately 2 to 1. If a probability range is (a, b) with midpoint m, then the implicit sample size D is calculated as follows: D
km(l m) -
=
(b-aG)2 (2)
This produces intervals that include the midpoint m + one standard deviation, k. Table 4 shows the implicit sample sizes that would be calculated from the probability ranges provided by the cardiologists, setting k to be 1. To avoid dividing by zero, probabilities of zero were replaced by a small value e. The parameter k allows us to control how much weight is allocated to the consultants' probability assessments. If k is set very low, we are effectively ignoring the consultants' opinions and relying almost entirely on the data. If k is high, we regard the experts' probabilities as "carved in stone," and ignore any data. By setting k to an intermediate value, we can use both sources of information.
Converting probability ranges to implicit samples In order to combine subjective probabilities with case data, we chose to unify these two sources of information by converting the imprecise probability ranges supplied by the consultants to implicit data samples. Once this conversion has been performed, the original subjective probabilities can be treated uniformly with any data that arrive, yielding combined conditional probabilities. Intuitively, the smaller the probability range provided by a physician, the more certain the physician 807
0.75
Table 5: Example of learning from data Before After New Case Question/Response r Main problem: 9 (8%) 8 (7%o) Cyanosis 99 (91%) 99 (92%) Heart failure 1 (1%) 1 (1%) Asymp. murmur 0 O (0%) (0%) Arrythmia 0 (0%) 0 (0%) Other
0.7
*
t 0
L 0.55 4
0
g
.
0.0
S
4
0
r 0.55 .
-I
*
*
o.s +
Grunting? Yes No
T-*
31 (35%) 59 (65%)
31 (34%) 60 (66%)
0.45
4
0.4
*
40
*
o
9
a
r
10
2to
ISo
Nunber of cases
Learning from data Converting the consultants' probabilities to implicit data suggests a natural mechanism for "learning" from actual cases: each incoming case can be combined with the previously collected information to yield updated conditional probabilities. This updating is a basic application of Bayesian statistical theory [1]. After each case, we can use the updated probabilities to make a prediction about what will be seen in the next case; this is lmawn as the prequential approach [2]. In this way, the conditional probabilities gradually evolve to reflect both sources of information. Table 5 shows how the probabilities would be updated if we saw a new case of hypoplastic left heart in which the main problem is cyanosis and grunting is not present. We tested the effectiveness of three different "learning modes." These modes are selected by vaying k, the number of standard deviations that the probability ranges are assumed to be equivalent to. If k 5z oo, we have "no learning;" incoming data are ignored. If k # 0, we are "learning without priors," starting from a uniform probability distribution in which all responses are considered equally likely. If k # 1, we have "learning with priors," where the priors are derived from the cardiologists' subjective probability ranges. In order to compare these learning modes, we needed a way to evaluate the accuracy of subjective probabilities.
Criticizing Subjective Probabilities There are a number of scoring methods that can be used to asse the quality of subjective probabilities (see [9] and [3] for a more complete discussion). The most frequently used scoring rule is the log acoreM which allows a subjective probability distribution to be assessed on the basis of a single observed case. 80B
Figure 2: Learning with No Priors: Average log score per question, averaged tients
over every
7 consecutive
pa-
Log scores punish predictions that assign low probabilities to events that occur. The lower the predicted probability of the event that occurred, the higher (i.e. worse) the log sore will be. This scoring rule is strictly proper in the sense that the assessor's expected penalty is nimed by stating the probabilities that he or she believes to be true. Suppose a probability vector p = (pl,p2, ...,pk) has been assigned to a variable E which can take on one of k values. The vector e = (el, e2, ..., ek) specifies which response
was
observed. If response
r
is the value that
is found to be true, then e, = 1 and Vi $6 The log score is then -log(pr) [7].
r, es
= 0.
Results and Discussion We compared the diagnostic performance of the three learning modes by testing them on data collected from 200 patients. As would be expected, Leaming With No Priors exhibited the most rapid improvement (see Figure 2). The graph is not a straight line because when a rare disease is encountered, its score will be high, even if we have already seen many cases of more common diseass. If enough cases were seen, Learning with No Priors would eventually converge with Learning With Priors. Learning with Priors performed slightly better than No Learning (Figure 3). The improvement was most significant for diseases that were encountered frequently. We found an improvement of approximately 3% for each 10 patients added. In order to make a more substantial impact on the priors, a larger number of cases than we had available would be required. Table 6 compares the diagnostic performance of var-
Extensions in progress include the exploration and comparison of scoring methods, testing the learning method with more case data, and criticizing the structure of the network as well as the probabilities. Other future goals include of an "alarm" to immediately detect anomalous cases, and a program to suggest treatments as well as diagnoses.
1.061.07 1.06
R
1.05-
a *
1.04
0
References
1.031.02-
[1]
1.01.
1.0 *-*0
I
5
I 10 Numnbr df c-
I
15 of dieeee
I
20
I
[2]
25
[3]
Figure 3: Ratio of score without learning over score with learning, plotted by number of cases of disease
[4]
Table 6: Comparison of diagnostic performance Source of Diagnosis % Correct Junior physicians 61% Fowchart 72% Naive network, Learning with no priors 48% Naive network, No learning 60% Naive network, Learning with priors 61%
[5] [6]
ious approaches. The simple Bayesian network along with Learning With Priors was as accurate as the junior physicians. The flowchart algorithm, however, does even better. The reason for this difference in performance, and for the relatively small difference in performance between Learning with Priors and No Learning, is most likely that the conditional probabilities are less important than the structure of a model.
[7] [8]
Conclusions and Future Work [9]
Our experiments have shown that cardiologists' estimates of P(F | D) tend to be fairly accurate, although they tend to be slightly too extreme. A naive Bayesian approach to dianosis, known to have false assumptions, performed as well as junior physicians, which suggests that a more realistic model of the domain would lead to even better results. We were able, to some extent, to improve predictive ability by combining the cardiologists' probability estimates with actual case data; this allowed us to take advantage of the imprecision in these estimates. The performance of the program was most likely limited not by the inaccuracy of the probabilities, but by the unrealistic model.
[10]
G. E. P. Box and G. Tiao. Bayesian Inference in Statistical Analysis. Addison-Wesley, Reading, MA, 1973. A. P. Dawid. Statistical theory: The prequential apJ. Royal Stat. Soc. 147:277-305, 1984.
proach.
Probability forecasting. From Kotz and Johnson, eds., Encyclopedia of Statistical Sciences, 7:210-218, J. Wiley, New York, 1986. R. C. G. Franklin, D. J. Spiegelhalter, F. MaCartney, and K. Bull. Combining clinical judgements and statistical data in expert systems: Over the telephone management decisions for critical congenital heart disease in the first month of life. Intl. J. Clinical Monitoring and Computing, 6:157-166, 1989. D.J. Leaper, J.C. Horrocks, J.R. Staniland, and F.T. deDombal. Computer-assisted diagnosis of abdominal pain using "estimates" provided by clinicians. British Medical Journal, 4:350-354, 1972. S. L. Lauritzen and D. J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society, B50, 1988. A. R. Shapiro. The evaluation of clinical predictions: A method and initial application. New England Journal of Medicine, 296:1509-1514, 1977. D. J. Spiegelhalter, R. C. G. Franklin, and K. Bull. Assessment, criticism and improvement of imprecise subjective probabilities for a medical expert system. Proceedings of the Workshop on Uncertainty in Artificial Intelligence, 335-342, Windsor, Ontario, 1989. D. J. Spiegelhalter, N. L. Harris, R. C. G. Franklin, and K. Bull. Empirical evaluation of prior beliefs about frequencies. (In preparation). H. R. Warner, A. F. Toronto, and L. G. Veasy. Experience with Baye's [sic] theorem for computer diagnosis of congenital heart disease. Annals N.Y. Acad. Sci, A. P. Dawid.
115:558-567, 1964.
809