Expertise and Error in Diagnostic Reasoning

0 downloads 0 Views 3MB Size Report
ditions under which errors in reasoning in the first experiment could be ... use of these concepts as well as the success of the eventual solution--what we ... to establish basic categories or anchors in memory for subsequently building a .... congenital heart diseases) which is used to deduce information about blood flow,.
COGNITIVE SCIENCE 5, 235-283 (1981)

Expertise and Error in Diagnostic Reasoning* P A U L E . JOHNSON, A L I C A S. D U R A N , FRANK HASSEBROCK, JAMES MOLLER, MICHAEL PRIETULA

University of Minnesota PAUL J. FELTOVICH

University of Pittsburgh DAVID B . SWANSON

University of Rochester

An investigation is presented in which a computer simulation model (DIAGNC)SER) is used to develop and test predictions for behavior of subjects in a task of medical diagnosis. The first experiment employed a process-tracing methodolagy in order to compare hypothesis generation and evaluation behavior of DIAGNOSER with individuals at different levels of expertise (students, trainees, experts). A second experiment performed with only DIAGNOSER identified conditions under which errors in reasoning in the first experiment could be related to interpretation of specific data items. Predictions derived from DIAGNOSER's performance were tested in a third experiment with a new sample of subjects. Data from the three experiments indicated that (1) form of diagnostic reasoning was similar for all subjects trained in medicine and for the simulation model, (2) substance of diagnostic reasoning employed by the simulation model was comparable with that of the more expert subjects, and (3) errors in subjects' reasoning were attributable to deficiencies in disease knowledge and the interpretation of specific patient data cues predicted by the simulation model. *The research reported here has been funded by grants to the fin'st author from (1) the Graduate School at the University of Minnesota, (2) the Center for Research in Human Learning at the University of Minnesota, (3) the National Institute for Child Health and Human Development (T36-HD-07151 and HD-01136), (4) the National Science Foundation (NSFfBNS-77-22075), (5) the University of Minnesota Consulting Group on InstructionalDesign, and (6) the Dwan Family Fund in the University of Minnesota Medical School. Portions of the research were presented to the Sixth Annual Workshop on Artificial Intelligence in Medicine, Stanford University, August, 1980. We would like to express our appreciation to staff and students in the Department of Pediatrics in the University of Minnesota Medical School who have generously given their time and thoughts to this research.

235

236

JOHNSON, DuRAN, HASSEBROCK, MOLLER, AND PRIETULA

The nature of expert-novice differences in problem solving has been of interest to psychologists and computer scientists alike. In areas as diverse as chess and medicine, the prevailing view seems to be that experts and novices are similar in the type and frequency of concepts employed in generating a problem solution--what we shall term "form of reasoning", but differ in the appropriate use of these concepts as well as the success of the eventual solution--what we shall term "substance" or "content of reasoning" (e.g., Chase & Simon, 1973, in chess; Elstein, Shulman & Sprafka, 1978, in medical problem solving). Recent analyses of expert problem solving have focused upon the role of memory in generating successful problem solutions (Greeno, 1980; Larkin, 1980; Newell, 1980; Simon, 1978, 1980b). Skilled problem solvers in a technical domain such as medicine are presumed to have (1) a rich vocabulary of knowledge for interpreting cues and other relevant problem solving information, (2) an organization of this knowledge that permits ready access at levels of detail appropriate for the data of the problem, and (3) extensive cross referencing should an initial access prove unproductive (Simon, 1976, 1980b). The model of memory most often used to describe expert problem solving is one in which there is stored a large number of recurring patterns (schemata) or prototypical combinations of data (combinations of pieces in chess, combinations of patient signs and symptoms in medicine) that are familiar and can be recognized. The organization of these patterns is hierarchical, with the more general patterns at the top and the more specific ones at the bottom (Anderson, 1080; Mandler, 1979). Access to a particular pattern or prototype occurs at multiple levels in the hierarchy and can be cued by other patterns (pattern-to-pattern links) and data (data-to-pattern links) (Bobrow & Norman, 1975; Reed, 1978; Rumelhart & Ortony, 1977; Wortman & Greenberg, 1971). According to this model, patterns are established in the form of expectations to be matched against external data, so that a particular configuration of data can be interpreted as an instance of a given pattern or schema. Unfamiliar experiences are interpreted by building new schemata, a process that occurs either through a data driven form of processing in which commonalities are abstracted from recurring patterns of cues, or through a more conceptually driven form of processing in which higher-level rules are used to derive patterns that can be matched with unfamiliar data (Anderson, Kline, & Beasley, 1979; Rumelhart & Norman, 1977). In medicine, disease knowledge is represented in memory in the form of schemata or templates which specify, for a particular disease, the set of clinical manifestations that a patient with that disease should present clinically (Pople, 1977; Rubin, 1975). For the novice, these schemata are classically centered due to the cases employed in initial training material (e.g., Moller, 1973) as well as the probability distribution of diseases that appears in most teaching hospitals. While a focus upon the most likely or usual forms of a disease enables the novice to establish basic categories or anchors in memory for subsequently building a detailed vocabulary of disease knowledge (Rosch, Mervis, Gray, Johnson, &

EXPERTISEAND ERROR

237

Boyes-Braem, 1976) it also leads to an emphasis in diagnostic reasoning upon thinking only of "classical cases", As a consequence of the limited variation in disease models presented in medical training and the relatively limited experience with patients having any given disease, the novice's memory schemata of individual diseases have an internal structure that is fairly imprecise. Such imprecision can lead to errors in diagnostic reasoning of three types. First, novices may overestimate the allowable range of variation for the findings of a given disease. The resultant error, when it occurs, takes the form of not recognizing that observed findings are at odds with those that should be true for the disease in question. This leads to a failure to reject inappropriate candidate diseases. The second type of error that may occur in diagnostic reasoning is a result of expectations for a given disease that are too specific. This error, when it occurs, takes the form of rejecting the correct disease as a possible diagnosis when in reality the findings are within the allowable limits of the disease. A third type of error made by novices is simply not to think of the correct diagnosis. This type of error can result from a lack of initial access links between various cues in patient data and the relevant hierarchy of disease models in memory. This type of error can also be a consequence of the lack of extensive cross referencing among different disease schemata in memory (Chi, Feltovich, & Glaser, 1980; Elstein, Loupe, & Erdmann, 1971), so that when one possibility is rejected, a more plausible one cannot be considered. Contrary to the novice, disease knowledge in the expert is both precise and richly detailed. Through clinical experience, the internal structure of expert models of disease is " t u n e d " (Rumelhart & Norman, 1977) to the natural variation in findings. Such tuning generally allows the expert to properly interpret findings for a case that novices do not. Because of additional training as well as extensive experience, the expert also has a hierarchy of disease knowledge that is well organized and extensively differentiated into a number of disease variants which present t h e m s e l v e s d i f f e r e n t l y due to contrasts in u n d e r l y i n g pathophysiology, severity, or patient age (Reed, 1978; Wortman, 1972). Expert error can be due to the fact that the patient data in a given case are not adequate to trigger the appropriate disease schemata, so that even though the correct disease model exists in memory, it is never considered. Expert error can also occur even though the correct disease has been generated, if none of the patient data items match the individual's expectations for the disease. This latter type of error, which is one of hypothesis evaluation, is sometimes overcome when a combination of several weaker cues is recognized as equivalent to a few stronger ones. The representation of disease knowledge is facilitated in areas of medicine in which the major disease states can be described in terms of gross features of anatomy and physiology. Knowledge of congenital heart diseases, in particular, lends itself to empirical study by cognitive scientists because virtually all forms of such diseases can be represented in terms of anatomic or physiological abnorrfialities within the heart and cardiovascular system (e.g., holes in the heart septa,

238

JOHNSON, DURAIq, HASSEBROCK, MOLLER, AND PRIETULA

tight valves, or electrical conduction problems). These basic abnormalities alter the flow, pressure, or i'esistance patterns of the system and produce the patient manifestations (signs, symptoms, laboratory test results) that the physician must use in diagnosis. The specific disease chosen for study in the present investigation was the cardiac anomaly, Total Anomalous Pulmonary Venous Connection (TAPVC). This disease can best be understood by reference to the major components of the normal heart. Figure 1 shows schematically the normal heart and other major components of the cardiovascular system. Starting on the right side of the heart, the right ventricle (RV) of the heart pumps blood across the pulmonary valve (PV), through the pulmonary artery (PA), and into the lungs where the blood receives oxygen. Blood then returns to the heart via the pulmonary veins (PVn) into the left atrium (LA). Oxygenated blood proceeds from the left atrium across the mitral valve (MV) into the left ventricle (LV), where it is pumped across the aortic valve (AV), through the aorta (Ao), and to the body. Oxygen is extracted in the body from the blood which then flows back to the right atrium (RA) of the heart via the vena cavae (VC). Deoxygenated blood from the right atrium flows across the tricuspid valve (TV) into the right ventricle and the cycle repeats. The "upper" chambers of the heart, the atria, are normally separated by the atrial septum, while the "lower" chambers, the ventricles, are normally separated by the ventricular septum. In cases of TAPVC, blood which ordinarily flows into the left atrium of the heart from the pulmonary veins flows instead into the right atrium because of a congenital defect in which the pulmonary veins are attached to the vena cavae (as shown in Figure I) or directly to the right atrium. Because of this anomaly, the oxygen-rich blood returning from the lungs must reach the left ventricle, and therefore the systemic circulation, by flowing first through an abnormal hole between the right and left atrium. The amount of oxygen in the systemic circulation is reduced because the oxygen-rich blood from the lungs mixes with oxygen-poor blood from the systemic return in the right atrium. As a result, patients with TAPVC manifest weak cyanosis, i.e., blueness of the extremities, trunk, and lips. The three most commonly confused alternatives to TAPVC, all which have substantially overlapping physical findings, are (Moss, Adams, & Emmanouilides, 1977): (1) Partial Anomalous Pulmonary Venous Connection (PAPVC). One or two of the four pulmonary veins connect abnormally to the fight atrium in PAPVC; the remainder connect normally to the left atrium (Figure 1). Although some oxygenated blood mixes with the oxygen-poor blood in the fight atrium, enough oxygen-rich blood is delivered directly to the left atrium and hence to the body so that patients with this disease are not cyanotic. (2) Endocardial Cushion Defect (ECD). In ECD, an abnormal hole exists in the lower part of the septum which separates the right and left atrium (Figure 1). Some oxygenated blood is shunted from the left atrium to the right atrium through this defect.

EXPERTISE AND ERROR

239

NORMAL HEART

TA P V C

PAPVC

ASD/ECD

Legend Ao •

Aorta

PV

AV •

Aortic Valve

PVn-

LA •

L e f t Atrium

RA

= Pulmonary Valve P u l m o n a r y V e in s

• Right Atrium

LV •

L e f t Ver~tricle

RV

• Right V e n t r i c l e

MV •

Mltral V a l v e

TV

- Tricuspid Valve

PA •

Pulmonary A r t e r y

VC

• Vena Cavae

Figure 1. Normal and abnormal conditions of the cardiovascular system.

240

JOHNSON, DURAN, HASSEBROCK,MOLLER, AND PRIETULA

However, enough oxygenated blood flows from the left atrium into the left ventricle and, hence, to the body so that the patient is ac.yanotic. (3) Atrial Septal Defect (ASD). The blood flow in this condition is similar to ECD since there is a single hole between the atria causing a left-to-right shunting of oxygenated blood. The defect is, however, in the upper portion of the septum connecting the two atria (the ostium secundum), while in cases of ECD, the hole is in the lower portion of the septum (the ostium primum) leading to possible deficiencies in both the tricuspid and mitral valves (Figure 1). Patients with ASD also appear acyanotic. The investigation described in this paper was based upon a computer simulation model that represents the knowledge required to diagnose suspected cases of congenital heart disease. The simulation model (DIAONOSER) I was developed to test a theoretical framework for the interpretation of expertise in diagnostic reasoning (ConneIly & Johnson, 1980; Johnson, 1980; Johnson, Severance, & Feltovich, 1979; Swanson, 1978; Swanson, Feltovich, & Johnson 1977). The DIAGNOSER model embodies three types of knowledge used in the diagnosis of congenital heart disease: (1) deductive knowledge (based on cardiac physiology, anatomy, and pathophysiology of the causal structures underlying congenital heart diseases) which is used to deduce information about blood flow, cardiac pressures, anatomic changes, and resultant clinical and laboratory findings for a variety of congenital cardiac anomalies--pathophysiological changes in the heart (e.g., enlarged right ventricle) are assembled in schematic structures forming a base of disease knowledge, (2) disease knowledge (representing the end-product of applying the deductive knowledge) is realized as a hierarchy of disease schemata, where each schema consists of a structure of expectations for the patient data of that disease, as well as a structure of deductive semantics which explains the expectations for the disease in terms of underlying pathophysiology, and (3) heuristic knowledge of diagnostic reasoning in the form of schemata (with embedded, production-like control structures) which detect potentially significant data, link patient data to elements of disease knowledge, provide cross referencing within elements of disease knowledge, and control the specification of hypotheses (generation/acceptance/rejection) such that an increasingly good " f i t " to the patient data is obtained. DIAGNOSER's reasoning process is illustrated in Figure 2 which shows the flow of control through the model's knowledge base in response to specific items of patient data from a case of TAPVC. Two types of data are presented: (1) auscultation (S2-splitting--a wide, fixed splitting of the second heart sound; systolic--a systolic ejection murmur) and (2) x-ray (a vascular shadow). The S2-splitting is detected by a heuristic schema causing a "data trigger" (DT) to force consideration of a defect, atrial level shunt. The consideration of ~The simulation model is programmed in LISP 1.4 on the University of Minnesota time sharing CDC Cyber 74 Computer. It contains just over one million characters.

EXPERTISEAND ERROR

241

the shunt activates a second heuristic schema which unconditionally triggers three classic disease models via a "when triggered" (WT) mechanism. One of these diseases, APVC (anomalous pulmonary venous connection), can also be triggered via a vascular shadow found in the x-ray data, reflecting alternative means by which hypotheses may be generated. Hypothesizing a classic disease prototype generally activates heuristics to facilitate the evaluation of the triggering propriety and selection of a more "fitting" disease variant. The selection of a variant (SV) may be either unconditional or contingent upon obtained confirmatory evidence. When a hypothesized disease is presented with data in violation of its structure of expectations, heuristic schemata may be activated which either force a further specification (FS) to an alternate disease variant or, more radically, initiate a differential test procedure (DFT) for consideration of another disease family.

TATI~, DT (s2-splitting) ~ SHUNT (ATRIAL) T (systolic) ~ 1 ~ ~ ~-RAY

/7 /

WT

WT

WT

(cyar

TAPV r

D

Figure 2. Portion of DIAGNOSER'sknowledgebasefor TAPVC.

~D

242

JOHNSON, DURAN, HASSEBROCK, MOLLER, AND PRIETULA

In the investigation reported here, the behavior of the DIAGNOSER model was compared with the reasoning employed by physician subjects in diagnosing a case of TAPVC susceptible to reasoning error (Experiment 1). The model was then given a new set of diagnostic tasks designed to derive propositions about the sources of error in diagnosing the TAPVC case (Experiment 2). Finally, the predictions from the second experiment were tested with a new sample of subjects and a new set of diagnostic tasks (Experiment 3).

EXPERIMENT 1

The question addressed in the first experiment was the extent to which subjects at different levels of training and experience would misdiagnose a given case of TAPVC due to confusion between the various competing alternatives (PAPVC, ECD, ASD). The actual case used in the experiment was chosen from the file of the University of Minnesota Heart Hospital. It involved a five year old child with diagnosed TAPVC. Data from the case file were summarized in 22 written statements, each presenting a particular aspect of the medical history, physical examination, or laboratory data. These statements were extracted from a patient chart, modified slightly with the help of a consulting cardiologist, and placed into standard patient data groupings of history, physical examination, chest x-ray findings, and electrocardiogram (EKG) readings (Moiler, 1973).

Subjects Four experts (two board-certified faculty in pediatric cardiology and two advanced fellows in pediatric cardiology), four trainees (two first year fellows and two third year residents) and four medical students (who had completed a sixweek elective course in pediatric cardiology) served as subjects. All were volunteers. Each of the 22 statements of patient data was read by the subjects in the order in which the data appeared in the patient chart (history, physical examination, x-ray, EKG). After reading each statement, subjects were asked to report aloud any of their thoughts regarding the diagnosis of the case. After reading all the statements, subjects were asked to make a final primary diagnosis. The complete problem solving session for each subject was tape recorded and transcribed. DIAGNOSER also received the patient data in order to reach a diagnosis for the case.

Results Analysis of data obtained from subjects was based upon verbal protocols generated during the problem solving session. The procedures used to score these protocols are described briefly followed by a discussion of results in two parts: form of diagnostic reasoning and content or substance of diagnostic reasoning.

EXPERTISE AND ERROR

243

Protocol Analysis. The data items of the T,~PVC case and a complete protocol from one of the successful experts (E E) are contained in Technical Appendix A. Scoring of the verbal protocols generated by each subject during the problem solving session was based on the identification of hypotheses regarding possible patient states. These hypotheses were divided into two basic types: (a) diseases, which represent hypotheses based upon specific cardiac conditions (e.g., TAPVC), and (b) pathophysiological conclusions, which represent hypotheses based upon presumed cardiovascular states that are consequences of one or more of the basic anomalies (e.g., increased pulmonary blood flow). In addition to the two basic types of hypotheses, another form of hypothesis, termed "global hypothesis", was also identified (cf., Bruner, Goodnow, & Austin, 1956; Rubin, 1975). A global hypothesis was considered to be either a disease or pathophysiological hypothesis at the category level (e.g., congenital heart disease; left to right shunt). Three raters (research assistants), knowledgeable in the disease cluster of the case in question, were given the transcribed protocols and asked to mark all instances of disease, pathophysiological, and global hypotheses. Identification of hypotheses was aided by a hypothesis list developed beforehand with the aid of the consulting cardiologist. One rater scored all the protocols. Protocols from three subjects selected at random from the set of 12 were also scored by the other two raters. The proportion of agreement between the hypothesis lists developed by each rater for the three protocols rated in common was computed. Cohen's K (Cohen, 1960), an interrater reliability coefficient, was employed to adjust for agreement due to chance. The results of the analysis showed a high degree of interrater agreement (K values were .83, .74, and .71). Questions of differences between raters' interpretation of individual protocols were resolved in discussion with the consulting cardiologist. Responses generated by DIAGNOSER were scored from the same hypothesis list used to score subjects' responses. Analysis of the Form of Diagnostic Reasoning. The form of diagnostic reasoning employed by both subjects and the simulation model was compared first by examining the number and type of hypotheses generated in response to items of patient data. The number of disease hypotheses generated in response to each stage of patient data (history, physical examination, x-ray, EKG) was calculated and an analysis of variance was conducted with the level of expertise as abetween-subjects factor and stage of patient data as a repeated-measures factor. Although there was not a significant effect of expertise, more disease hypotheses were generated in response to history and physical examination data than to x-ray and EKG data. 2 This finding is not surprising, given the fact that there were more statements of patient data in the history (9) and the physical 2The main effect for stage of patient data was significant F(3, 27) = 21.75, p < .01. The mean number of disease hypotheses were 3.83, 3.83, 1.83, and 1.08, respectively, for each stage of patient data.

244

JOHNSON, DURAl,I, HASSEBROCK, MOLLER, AND PRIETULA

exam (11) than in x-ray (1) and EKG (I); and the fact that toward the end of each set of data, subjects v~ere spending time evaluating hypotheses previously generated. What is significant, however, is the fact that experts and non-experts reasoned much the same with regard to the number of disease hypotheses generated in response to each stage of patient data. The examination of the distribution of responses generated by DIAGNOSER revealed that it also generated more disease hypotheses to history and physical exam data than to x-ray and EKG data. The data in Figure 3 show the pattern of pathophysiological conclusions generated as hypotheses in response to each stage of patient data. The analysis of variance indicated that each subject group generated more pathophysiological conclusions in response to the physical exam data than to the other types of patient data. As can be seen in Figure 3, the pattern of responses given by DIAGNOSER followed the trend of the three subject groups. 3 The analysis of variance conducted on the number of global hypotheses generated by the three subject groups (between-subjects factor) in response to each stage of patient data (repeated-measures factor) showed that all subject groups also gave more global hypotheses to history and physical examination data than to x-ray and EKG data.4 The decrease in the number of global hypotheses generated with each succeeding set of patient data was dramatic for all subjects. As before, the decrease in number of hypotheses generated with subsequent stages of patient data was partly a function of decreased opportunity to respond due to fewer patient data statements in the latter categories as well as increased focus in subject thinking. Once again, however, the important point is the lack of difference between experts and non-experts. The frequency of global hypotheses generated by the simulation followed the trend of the subject groups. The form of diagnostic reasoning employed by the simulation model and each of the subject groups was further compared by examining the overlap in their distributions of hypotheses. The hypotheses generated by each subject that were in common with those generated by DIAGNOSER are shown in Table 1. As can be seen in Table 1, there was not complete agreement between DIAGNOSER and the subjects, or between subjects, with regard either to the set of hypotheses generated, or with respect to the order in which hypotheses were generated. Nevertheless, there are important similarities between the simulation model and subjects based upon the generation of members of the disease "competitor set" (i.e., TAPVC, PAPVC, ECD, and ASD). First, all subjects generated the disease alternative ASD, seven of the twelve subjects generated ECD, and ten of the twelve subjects generated PAPVC. DIAGNOSER generated each of these three disease alternatives. Second, four subjects did not generate the ~There was an interaction between the level of expertise and the stage of patient data, F(6, 27) = 2.41, p < .06. Simple main effects analysis indicated that the expert group generated more pathophysiological hypotheses (6.5) in response to the physical exam data than did the trainees (3.0) and students (4.75). F(2, 35) = 7.21.p < .01. 4The main effect for stage of patient data was significant. F(3, 27) = I 1.24, p < .01.

245

EXPERTISEAND ERROR

u) 4) u)

7

r~" 0

6

-a

5

-

Simulation Expert

e--- • o-.,......t3

Trainee

/~-- .

Student

o

o

O. >, "I"

0 0 ¢-

4

0 1~

I1.

3

-

/'"

0 oe "

3

-

~7

\ \ \

"\

L_

I,

2

-

¢~Q

1 History

Physical

X-ray

EKG

Stage of Patient Data Figure 3. Mean frequency of pathophysiologicol hypothesesgenerated by each subject group and the simulation in response to stage of patient data.

correct diagnosis (TAPVC) as a possible hypothesis, and four of the eightsubjects who generated TAPVC failed to conclude it as their final diagnosis. Third, some hypotheses generated by the simulation were used only by the more experienced subjects (e.g., Ebstein's maltbrmation). Fourth, although several subjects generated the same hypothesis at different points in their protocol, the order in which the hypotheses were generated by the simulation and subjects was fairly similar, especially for the more experienced subjects. One means of quantifying this latter relationship is to compute rank-order correlations, which appear in Table 2.

246

JOHNSON, DuRAN, HASSEBROCK,MOLLER, AND PRIETULA

(~)

~

]

.,~ °-



Q

o

0

©

o

z

Z

o~

x

: ~

~

-

~

Q

~r

~

~0

.~

~

o(~) I

U

"1-

~Q o o

< ~ ~'~

~

>

-

o ~

~ .- .-_ .~

~.-~ Eoo~

._~

~~,~

~

~ Ua.

2 . _~ •

EXPERTISE AND ERROR

247

TABLE 2 Rank-Order Correlations o f Hypotheses Generated by DIAGNOSER and Subjects

DIAG ~ E;

E1 .84** E2 .78** .75** x~-E3 .80** .82** E4 .60 .77* T 1 .83* .79* ~ T.2 .48 .44 • T3 .07 .08 ,~ T4 .01 .35 : , S 1 .72* .87 ~* Sa .25 .44 S3 .72* .26 S~ . 4 8 .45

E2

Experts E3

.65* .65* .89* .60 .61 .65* .68* -.32 .14 .53

.55 .80 .14 .63* .13 .50 .35 .29 .33

E~

T 1.

.64 .14 .22 .63* .79* .34 .94** .76**

1.0"* .80 .71 .90* .43 .50 .50

Trainees T2 T3 '

.77* .50 .24 .14 .55

.04

.32 .15 .27 .43 .19

T4" .

S1

.37 .45 .65* .38 .62 .72** .73*

Students S 2 . S3

.39 .07

S4

.64

Note. Significance of coefficients vary depending upon the number of hypotheses in common between pairs of subjects. *p < .05. **p < .01.

The correlation coefficients in Table 2 between subjects and the simulation model are based upon the responses generated in common between the model and each subject• The question addressed by these correlations is the extent to which the sequence of responses given by the model also appeared in the distribution of responses given by each subject. The correlations between subjects are based upon responses generated in common between pairs of subjects. The question addressed by these correlations is the extent to which the responses in common between any two subjects occurred in the same order. The correlations in Table 2 show that the order of responses generated by the simulation was significantly correlated with the orders generated by three experts, one trainee, and two students. There were also more significant correlations among the patterns of responses generated by the different experts than there were among the patterns of responses generated among trainees or among students. Finally, there was more commonality between experts and students and between experts and trainees than there was between trainees and students. One of the two trainees (T~) who successfully diagnosed the case generated a pattern of responses like the simulation and the two successful expert subjects (E~ and E2), while the other successful trainee (T 4) did not. One of the unsuccessful students (S ~) not only generated a pattern of responses that agreed well with the simulation, but the pattern also agreed with the pattern generated by three of the four experts and one trainee•

248

JOHNSON, DURAN, HASSEBROCK,MOLLER,AND PRIETULA

Regarding the form of diagnostic reasoning employed by subjects in the first experiment, we conclude that there were not appreciable differences between the simulation model and individuals in the three subject groups (experts, trainees, and students). Nor was there a substantial difference among individuals within each of the three subject groups. Subjects in each group (as well as DIAGNOSER) generated hypotheses of similar types in similar quantities at each stage of patient data. The only exception to this conclusion occurred with respect to the data from physical examination. Here, although '~he overall form of reasoning was similar among all subject groups and the model, the expert subjects and DIAGNOSER generated significantly more pathophysiological hypotheses. The physical examination data, particularly heart sounds, were crucial to a successful diagnosis of the TAPVC case (see Experiment 2 and 3).

Analysis of Substance of Diagnostic Reasoning. Analysis of the relationship between patient data of the case and the substance or content of responses given by DIAGNOSER and by subjects was based upon the correctness of the final diagnosis as well as the appropriateness of responses to specific items of patient data. This analysis was guided by consideration of the disease hypotheses generated from the competitor set (TAPVC, PAPVC, ECD, and ASD), and by the "line of reasoning" (a comprehensible series of steps; Feigenbaum, 1977; Shortliffe, Buchanan, & Feigenbaum, 1979) adopted by the subjects in attempting to reach a diagnostic conclusion. Four of the twelve subjects (the two most experienced experts and two trainees) correctly diagnosed the case. The simulation model also correctly diagnosed the case. The patient items that elicited diseases from the competitor set together with subjects' judgment of these items as being either consistent (+), inconsistent ( - ) , or ambiguous (o) for a disease are shown in Table 3. With respect to the four competing disease alternatives, there are six particularly important data items shown in Table 3. Data item 7, which reports that the patient is blue (i.e., cyanotic), represents strong disconfirmatory evidence for all members of the competitor set, except TAPVC. Data items 17, 18, and 19 represent evidence for an increased volume of blood in the right side of the heart (a condition common to all members of the competitor set). It was expected that all subjects would generate at least ASD, the classic instance of this cardiac condition, by the time of these data points. Data item 2 I, which contains an x-ray finding of an "'unusual vascular shadow in the right side" is evidence against most cases of ASD and simultaneously constitutes a classic cue for either PAPVC or TAPVC. In fact, one variant of PAPVC, "scimitar syndrome", derives its name from the presentation of such a finding on x-ray (Moss et al., 1977, p. 442). The EKG reading, item 22, contains a finding of "right axis deviation" which is strong disconfirmatory evidence for ECD. Each of these six data items is compatible with the operative disease, TAPVC.

EXPERTISE AND ERROR

249

The two most experienced experts who correctly diagnosed the case, E~ 'and E~, pursued lines of reasoning that incorporated the full set of competing hypotheses. El generated ASD during the history data but considered three of the four competitors (ASD, PAPVC, TAPVC) at item 17, which is the first strong cue for ASD, but a finding also compatible with the other diseases (see Figure 4). The line of reasoning adopted by this subject might be described as a "breadthfirst" (Nilsson, 1980) incorporation of disease alternatives. This line of reasoning is termed "precautionary" by Feltovich (198 i) since if any disease hypothesis encounters disconfirmatory evidence, alternative explanations for which the same evidence might apply are already under active consideration. E2. on the other hand, generated only ASD at item 17 (Figure 4) and considered this hypothesis until data item 21, which is strong evidence against ASD. At this point, E2 generated the remainder of the hypothesis set. The line of reasoning adopted by this subject was more "'depth-first" (Nilsson, 1980) since it attempts to pursue one hypothesis until shown unsuccessful. This line of reasoning is termed "extraction" by Feltovich (1981) because its success depends heavily upon rejection of the target disease only when appropriate, which, in tum, depends upon precise expectations in the subject's disease model. One of the other successful subjects, T t (a trainee) considered only ASD from the competitor set until reaching the EKG data, at which point T~ generated TAPVC and rejected PAPVC. This subject, like E2, adopted a depth-first line of reasoning but without considering the complete competitor set. The other successful subject, T 4 (also a trainee), generated both ECD and ASD in response to the physical exam data. ECD was generated in response to item 19; ASD was generated after viewing all of the physical exam data. TAPVC was generated by this subject following the x-ray data, and was maintained as the final diagnosis following the EKG data. The subject's line of reasoning was more like the breadth-first reasoning of Et. However, neither of the successful trainees generated the full set of competing disease alternatives. Subjects who incorrectly diagnosed the case demonstrated informative types of errors. Student $3 diagnosed the case as Endocardial Cushion Defect (ECD). The strongest evidence against this disease is the finding of "'right axis deviation" on the EKG (item 22). ECD uniformly presents with left axis deviation and, in fact, is one of a very few congenital heart diseases that does so. Left axis deviation is, theretbre, a nearly pathognomonic finding for ECD. $3 's line.of reasoning in the case was like that of Ez, in that Ss generated ASD from the competitor set in the early stages of patient data. Unlike E_~, however, $3 was not able to properly interpret the EKG data. Given the EKG data of the case, S 3 not only evaluated the right axis deviation as positive evidence for ECD, but, in addition, "'triggered" or proposed ECD for the first time at this point. This is, simply, imprecision in the subject's disease model for ECD. It is as though the subject remembered that the EKG axis was important in ECD but could not remember the details.

u~

B

+

÷o

÷

+

÷

÷

(3 c~ L'N

>.,~

I

-I-

+

+

÷

o i +

÷ I ÷

o

÷

+

÷

O

(,~ 0

+

:Z:

u~

÷

-I-



o

-I-

+

o

II

÷

÷

÷

+

®

E ~C~



(3

u3

U.4 0

-I-

u

-I-

-I- .-I-

o

-I-

-t-

el. ÷

-f- +

+

+

o

÷

o

o

0

._c _c

%

+

-t-

-4-

÷

÷

>u

U :> r,, o..

÷

iv

._i

÷

ul

0

250

UU

o~ZZ

~'

kk

kk u,J Q.

2

0 "o

r,

uu

+

+

+

+

I

+

+"

+

+

+

c~ C~

+

+

+

+

0

1

+

0

+

+

+

o

o

+

+

o

c~

._"2 o

u

+

+

4-

+

"o

+ E

o 0

EC~

Q-

uu O,

+

+

+

+

+

+

+

+

+

+

+

%

,,;

I o

aj

o

o

"o

0 ~

.~_..

~ z .E

b

o_~

+

12_ U

U

U

~ < 8~ >

U

UU

U

~ < ~8 ~

g

"3

~2

0



~G

%

D

o z o

Suggest Documents