automation bias and system design: a case study in a ...

2 downloads 37978 Views 146KB Size Report
found evidence of "automation bias" – roughly, over- reliance on automation, which may degrade users' performance. While such effects are well known, they.
AUTOMATION BIAS AND SYSTEM DESIGN: A CASE STUDY IN A MEDICAL APPLICATION E. Alberdi*, P. Ayton† , A. A. Povyakalo*, L. Strigini* *Centre for Software Reliability, City University, London, UK {e.alberdi, andrey, strigini}@csr.city.ac.uk †

Psychology Department, City University, London, UK [email protected]

Keywords: decision support, computer aided decision making, alerting systems, human-machine diversity

Abstract We discuss the results of an interdisciplinary study of computer-aided decision making in cancer screening. We found evidence of "automation bias" – roughly, overreliance on automation, which may degrade users' performance. While such effects are well known, they have not been previously reported for this application with expert clinicians, and are not generally considered in the literature. One lesson from our study was that these potentially important effects will easily go unnoticed with common assessment methods. Implications for designers of computer aided decision making include that: a) it may be necessary to calibrate tool design for a range of different levels of user skills; b) an "expert system" approach to computerised aid design – building a "better replica" of a human expert – may be counterproductive by making the aid weak in those very areas where humans need help; c) HCI design risks focusing on usability of the physical human-computer interface, but the critical issues in design concern the cognitive effects (e.g. changes of decision thresholds) a computer tool may have on users; d) users’ subjective assessments of a computer aid may be misleading: people may judge a tool helpful when their decision-making performance is actually being hampered.

1 Introduction This paper deals with computer aided decision making. In particular, we focus on decision systems formed by human operators working with automated aids that support human decisions with advice, filtered or enhanced information, alerts and prompts. From the viewpoint of design for dependability, computer aided decision making offers diverse redundancy [10, 24], with diverse components (people and machines) expected to detect and support the correction of errors of other components. Designing such human-computer systems involves attending to a number of different aspects of the system’s

components. Their dependability is affected by the design of the decision support tools, the procedures for their deployment and use and the training of operators and the definition of their roles. The intended role of automation is often purely auxiliary: to reduce the risk of humans missing some information (e.g. when fatigued or overloaded), or to make the process of human apprehension of all important information more efficient. The human user retains the authority and responsibility for decisions. The design intention is thus to exploit protective redundancy with diversity: the machine protects against some human errors, and vice versa. We summarise here analyses and empirical work we conducted of a particular application of computer aided decision making: Computer Aided Detection (CAD) in mammography, i.e., clinicians using computer support in reading X-ray images for breast cancer screening. The study, within the context of DIRC (the Interdisciplinary Research Collaboration on the Dependability of computer-based systems) used an interdisciplinary approach, combining insights and methods from reliability engineering, psychology, human factors and ethnography [4]. It was a follow-up to a previous controlled study of CAD conducted by other researchers for the UK Health and Technology Assessment (HTA) programme, on a specific CAD tool ("Study 1" in [26], hereon "the HTA study"). Our probabilistic modelling [23], inspired by previous experience on software systems with diverse redundancy [10], highlighted how variation and co-variation in the "difficulty" of input cases for the machine and human components of the system substantially affect the dependability of the overall system: focusing on the average probabilities of the failures of the components, or assuming statistical independence among their failures, can be misleading. We were granted access to raw data from the HTA study, on which we conducted various exploratory statistical analyses, focusing on the interactions between correctness of computer output, difficulty of individual decision problems and human skills. Additional empirical work by us and our DIRC colleagues included follow-up experiments (focusing on

the effects of incorrect computer output on decisions) and ethnographic observation, both of which helped to elucidate aspects of the users' behaviour. Our statistical results [3, 17], which we summarise here, suggested that the effects of computer assistance varied markedly across users and over the set of cases examined. In particular, the effects were shown to be potentially detrimental for some clinicians when dealing with some categories of cases. This finding was striking in this application context because it refuted an assumption often found (implicitly or explicitly) in discussions of the efficacy of CAD, i.e., that the computer aid can do no harm (cannot cause bad decisions); moreover, this assumption has implications for the methods to be used in estimating the tool's efficacy. However, the observation of varying effects of computer aids on human decisions is more generally important as a basis for conjecturing how these effects are generated, and thus informing better design of decision systems. We discuss here possible explanations for the observed detrimental effects. It is tempting to categorise them all as effects of “complacency”: human operators becoming less attentive or less wary of error since they can fall back on the computer for decisions. But this would be simplistic and unfair. When experts, giving every indication of being attentive and thorough, appear to be led into error, it is prudent to contemplate other possible causes of error besides complacency, and to examine their implications for design and assessment practices in all applications of computer-aided decision-making. Our aim is to discuss these general implications, not the effectiveness of CAD in general or of the specific tool in particular: both issues are widely debated in the medical literature [2] and CAD tools evolve rapidly so that data obtained about one version may rapidly become obsolete. In the rest of this paper, we briefly review some pertinent literature (Sec. 2), and introduce the application domain of our study (Sec. 3). Sec. 4 describes our exploratory statistical analyses of the raw data from the HTA study . In Sec 5, we outline the results of our follow up empirical studies, as well as evidence from questionnaires and ethnographic observations. Next we discuss potential explanations of the behavioural patterns observed (Sec. 6) and their implications for the design of decision systems (Sec. 7). Sec. 8 contains our conclusions.

2 Automation bias literature The human factors literature has long recognised that human–computer "decision systems" do not always function ideally. As early as 1985, Sorkin & Woods [21], using signal detection analysis, concluded that optimising an automated aid’s performance would not always optimise the performance of a human-computer team in a monitoring task. A human-machine system can only be optimised or improved as a whole: improving the

automation alone or human training alone may be ineffective or wasteful. More recently, in an influential paper, Parasuraman & Riley [16] discuss ways in which this kind of humancomputer interaction can go wrong, via anecdotal evidence and results from various empirical studies. They talk about three aspects of ineffective human use of automation: disuse, i.e., underutilization of automation, where humans ignore automated warning signals; misuse, i.e., overreliance on automation, where humans are more likely to rely on computer advice (even if wrong) than on their own judgement; and abuse, when functions are automated "without due regard to the consequences for human [...] performance and the operator’s authority " [16]. Skitka and colleagues introduced the term “automation bias” [20, 19] when studying the effects of incorrect computer advice on decisions taken by students, in laboratory settings simulating monitoring tasks in aviation. They found that students given automated advice that was unreliable (for some of the tasks) made more errors than those performing the same tasks without automated advice. They distinguished two types of error: i) of commission, where decision-makers follow automated advice even in the face of more valid or reliable indicators suggesting that the automated aid is wrong; ii) of omission: decision makers do not take appropriate action, despite evidence of problems, since the automated tool did not prompt them. In these experiments, the participating students had access to alternative (non automated) sources of information. Those who also had automated advice were told that the automated tool was not completely reliable but all other instruments were 100% reliable. Still, many chose to follow the advice of the automated tool even when incorrect and contradicted by the other information. The authors interpret these errors as evidence of "automation bias", which they explain, for the errors of omission, in terms of complacency or reduced vigilance. Others have discussed human reliance on automation in terms of trust. For example, Dzindolet and colleagues [6] found that subjective measures of the trust of human operators in a computer aid correlated with the operators’ likelihood to rely on the aid regardless of its reliability. They also found that an understanding of why the aid failed influenced people’s trust and subsequent reliance: those participants told the reasons for the aid’s errors were more likely to trust it and follow its advice than those who were not aware of these reasons.

3 Domain: CAD for mammography In screening for breast cancer, expert clinicians ("readers") examine mammograms (sets of X-ray images of a woman's breasts), and decide whether the patient should be "recalled" for further tests because they suspect cancer. A CAD tool is designed to assist the interpretation of mammograms. It aims to recognise and mark (“prompt”) regions of interest (ROI) on a digitized mammogram to

prevent clinicians from overlooking them. CAD is not meant to be a diagnostic tool, in the sense that it only marks ROIs, which should be subsequently classified by the reader to reach the "recall/no recall" decision. The CAD tool in our study was the R2 Imagechecker M1000 [1]. The prescribed procedure for using it is: the reader looks at a mammogram and interprets it as usual, then activates the tool and looks at a digitised image of the mammogram with the tool's prompts for ROIs, checks whether he/she has overlooked any features with diagnostic value and then revises his/her original assessment, if appropriate. Fig. 1 schematically shows the resulting "decision system".

1.

correct computer prompting was likely to help readers in reaching a correct decision while incorrect prompting hindered them;

2.

even without computer support, incorrect decisions were more likely for cases that the CAD tool also processed incorrectly: reader and computer errors are strongly correlated. Consequently, for those cases that are more difficult for humans to interpret, the computer is less likely to give useful output.

We define the difficulty of a case as the fraction of readers who produced an incorrect decision about that case. We call the difficulty of a case read with computer aid (prompting) dp and the difficulty without computer aid d. Statistical analyses indicated that the value: d – dp was significantly different between correctly and incorrectly prompted cases (both for cancers and normal cases). We also used logistic regression (a form of interpolation on the raw data) to look for possible general patterns in the effect of CAD [17]. This showed that CAD tended to make cancers that were relatively easy (i.e. with dd).

Figure 1. The decision system formed by a human "reader" with the Computer Aided Detection tool. The manufacturers claim that when this tool is used as recommended the potential for missing lesions is not increased over routine screening [1]. This decision system may produce two kinds of errors: false negative (FN: the reader issues a "no recall" decision on a cancer) or false positive (FP: recalling a non-cancer – a normal – case). In turn, we define a FN error by the CAD tool as failing to prompt areas of the mammogram where cancer is located, even when the tool places prompts on other areas (we also call a FN error "incorrect prompting" of a cancer), and a FP error as placing any prompt on the mammogram of a normal case.

4 Interactions between correctness of computer output, difficulty of cases and human skills In the HTA study [26], 50 readers looked at mammograms for which the correct diagnosis was known (60 cancer and 120 normal cases) in two conditions: with and without computer support. Each reader examined every case, once in a session with computer aid and once in a session without it (in independent randomised orders) producing a "recall/no recall" decision. This study found no statistically significant effect of CAD prompts on the performance of these readers in detecting cancers. We re-analysed the raw data from this study looking especially for effects that may not be visible in the average results. These supplementary analyses [4] indicated that:

Fig. 2 illustrates this effect. The regression analysis covered "non-obvious" cancer cases, that is, cancers missed by at least one reader in either condition. The horizontal axis represents their difficulty d without computer aid; the vertical axis shows the differences (dp − d). So, a point below 0 on the y-axis indicates a cancer for which CAD appears to reduce the rate of FN errors. Points marked ‘w’ and ‘c’ indicate the observed values of d and (dp – d) for non-obvious cancers, divided into those with correct CAD output (‘c’ symbols) and with wrong CAD output (‘w’ symbols). The curves show the regression estimates for the mean value of (dp – d) as a function of the difficulty d of cases. The dashed curve is obtained from the data for the incorrectly prompted cancers, the dotted-dashed curve to the correctly prompted cancers and the solid curve to all cancers together. Analyses of all readers' decisions without computer aid, suggest that the more sensitive readers (those more likely to decide correctly on cancers) recognised more of the difficult cancers than other readers. Fig. 2 shows that, for difficult cases, incorrect decisions were more common with computer aid than without. Since less sensitive readers were very unlikely to recognise difficult cancers anyway, this increase in FN errors for difficult cases must be due to an adverse effect of the computer aid on the decisions of the more sensitive readers, plausibly caused by incorrect computer prompts (cf Fig 2, again). Similarly, for the "easy" cases, without computer aid the less sensitive readers made a larger number of incorrect decisions: they were thus more likely to benefit from the correct computer prompts on "easy" cases than on "difficult" ones, and on easy cases they were more likely to benefit than the more sensitive readers.

Thus, the use of computer support is likely to be more beneficial for less sensitive readers and more questionable for the more sensitive ones. This conjecture is supported by additional statistical analyses of these data [18]. 46 non-obvious cancers. 50 readers.

0.2

Difficult cases become more difficult with CAD

0.1

Case X

0.0

c

c

c

-0.2

w

c

w c c cc

-0.1

Change in case difficulty with CAD (dp - d)

c

w c

c

w c c c cc

c

w

c

c

Easy cases become easier with CAD

w w w

w

c w

c

w

(dp - d > 0)

w c w

w c

c

w

c

(dp - d < 0) 0.0

0.2

0.4

0.6

0.8

1.0

Case difficulty without CAD (d)

Figure 2. Effect of correctness of CAD prompting on case difficulty (i.e., rate of FN errors): dp: difficulty of a case with computer aid; d: difficulty without computer aid; w: cancer incorrectly prompted by CAD; c: cancer correctly prompted by CAD; dashed curve: regression curve fitted for wrongly processed cancers; dasheddotted curve: regression curve fitted for correctly processed cancers; solid curve: regression curve fitted for all cancers.[17] Apart from evaluating the probable effect of CAD in future use [4, 22], these statistical patterns (and observed exceptions to them) can help identify possible mechanisms causing improved or worse decisions. Examples of exceptions can be seen in Figure 2: despite being prompted correctly (points highlighted as 'c'), there were a few "easy" cancer cases (with d≤ 0.1) missed by more readers with computer support than without it. For instance, the cancer case X (see Fig.2) was missed by 2 of 50 readers without computer aid (d = 0.04) and by 6 of 50 readers with computer aid (dp = 0.12). Such changes in decisions might be due simply to natural "noise" – random variation – in human decision making. But if the marked change observed here were a confirmed pattern, we would need to explore possible causes. For this particular cancer case, without CAD, 13 of 50 readers (26%) decided that this case required discussion before the final decision. This suggests that there was a relatively high level of uncertainty associated with this case, although the low number of FN errors in deciding about it define it as comparatively "easy". As we report below, evidence from ethnographic work and from our follow-up studies supports the conjecture that when dealing with uncertain ("indeterminate") cases, readers' CAD prompts can have unanticipated (and potentially detrimental) effects on readers' decisions.

5 Follow-up empirical studies: effects of incorrect computer output We conducted two follow up studies, investigating in more detail the effects of incorrect computer output on human decision making [3]. These were complemented by the collection of subjective data via questionnaires. 5.1 Follow-up studies In our follow-up Study 1 we used a test set containing a large proportion of cancers that CAD had missed. We kept all other characteristics of the test set as similar as possible to the sets used in the HTA study as we wanted the readers to perceive this study as a natural extension of the HTA study (all 20 readers in Study 1 had also participated in the HTA study) and to behave in a comparable way. We used essentially the same procedures used in the HTA study, except that the readers in Study 1 saw all the cases only once: always with the benefit of CAD. At that stage, we were not interested in comparing readers’ performance with and without CAD; our goal was to estimate the probability of reader error when the CAD tool's output was incorrect. However, we obtained intriguing, unexpected results: the proportion of correct decisions was very low, and particularly so for cancer cases not marked by the tool. This suggested that CAD errors may have had a significant negative impact on readers’ decisions. To explore this issue further, we ran Study 2, where a new but equivalent set of readers saw the same test set without computer aid. We used the same procedure as in Study 1, except that readers did not see the computer prompts. Reader sensitivity in Study 2 (without CAD) was significantly higher than in Study 1 (with CAD); the measured difference increased if we looked at incorrectly prompted cases only, and even more so for cancer mammograms where the tool had placed no prompts at all. These findings strongly suggest that, at least for some categories of cases in our studies, incorrect prompting had a significant detrimental effect on human decisions. This conclusion reinforces that of a previous experiment [27] which used artificially inserted errors in computer prompts, as opposed to the actual errors used in our study. 5.2 Questionnaire results At various stages during Study 1, the readers were asked to answer a series of questionnaires on various issues. Of particular interest was a final questionnaire which probed the perceived differences between the test sets used in Study 1 and in the HTA study, in terms of the behaviour of the CAD tool and the characteristics of the test cases. Since only 10 out of the 20 readers in Study 1 completed this questionnaire, our analysis is limited. However, we found some interesting patterns.

Some readers seemed unaware of the large proportion of cancers for which the CAD tool generated incorrect prompting in our test set. For example, most of the respondents answered that our test set had the same proportion of cancers, or fewer, than the sets in the HTA study, when it actually contained a few more cancers. Similarly a few respondents thought our test set had the same proportion of cancers missed by the CAD tool as in the HTA study; but, in fact, it contained many more. Also, half of the respondents thought that the CAD prompts were more useful in Study 1 than in the HTA study because, they said, our test set had fewer “distracting” prompts; yet, in reality, there were fewer correct prompts for cancers. Interestingly, we found no association between a reader's responses to the questionnaires and his/her performance in Study 1. Readers whose answers indicated a realistic perception of the behaviour of the CAD tool and of the proportion of cancers in the test set were as likely to be affected by incorrect prompts as those who believed CAD was being more helpful in Study 1 than in the HTA study.

6 Does CAD bias readers’ decision making? The results of our analyses and empirical studies suggest that the CAD tool’s incorrect output may have biased the decision making of at least some of the readers, for some categories of cases. Ethnographic work by DIRC colleagues has shown that readers: a) dismissed obvious incorrect prompts as spurious (indeed, most prompts on features that the reader had not considered – either missed or thought irrelevant – are false prompts); and b) recognised that the tool often fails to prompt obvious cancers [4, 9]. However, our results indicate that, for difficult cancers, incorrect prompting often led to incorrect human decisions. Using Skitka et al.’s [20, 19] terminology, introduced earlier, one could characterise these incorrect decisions as “errors of omission”: readers interpreted absence of prompts on (an area of) a mammogram as an indication of “no cancer” and therefore failed to take appropriate action (i.e., they failed to recall the patient). There are, however, important differences between Skitka et al.’s tasks and mammogram reading as investigated in the HTA study and our follow-up studies. In the former, the participants seemed to use the computer output to replace calculations they could perform otherwise by using alternative, highly reliable, non-computer mediated indicators. In contrast, uncertainty in mammogram reading is greater. The readers did not have access to any sources of information other than the X-ray films (mammograms) and the output of the CAD tool on digitised versions of the films. It is important to remember that CAD prompts are designed merely as attention cues - not to replace other sources of information. With CAD, over-reliance on automation could manifest itself in readers using prompts instead of thoroughly examining the mammograms, implying readers becoming

complacent, or less vigilant, when using automation – as hypothesised by Skitka’s team and others to explain errors of omission [11, 14, 12, 13, 16, 15, 20, 19]. One could argue that, based on past experience with the tool, readers tended to assume that the absence of prompts was a strong indication that a case was normal. CAD generates many false positives, which readers claim to find distracting. As a result, the absence of prompts may be seen as more informative than their presence so readers may have paid less attention than necessary to those mammograms that had no computer prompts. An alternative (perhaps complementary) way of accounting for the association between incorrect prompting and reader errors contemplates how readers deal with “indeterminate” cases: cases where they have detected anomalies with unclear diagnosis, so that they are uncertain as to whether the cases should be recalled. For example, it is possible that the absence of prompts made readers revise their decisions for ambiguous abnormalities they had already detected. In other words, they may have used the absence of prompts as reassurance for a "no recall" decision when dealing with features they found difficult to interpret. Conceivably readers were using any available evidence to resolve uncertainty. The implication is that the CAD tool was being used not only as a detection aid but also as a classification or diagnostic aid – specifically not what the tool is designed for. Readers’ use of CAD in such a way has been reported before for the CAD tool investigated here [9] and for other similar CAD tools [7]. The following transcript is an example of a reader’s use of prompts to inform her/his decisions: “This is a case where without the prompt I’d probably let it go … but seeing the prompt I’ll probably recall … it doesn’t look like a mass but she’s got quite difficult dense breasts … I’d probably recall.” [9] In other instances, readers were observed using the absence of a prompt as evidence for ‘no recall’ [9] Arguably, this use of CAD violates an explicit warning in the device labelling [1]: "...a user should not be dissuaded from working up a finding if the device fails to mark that site". However, such warnings could only be useful if readers were aware they exploited prompts in this fashion. An alternative conjecture about "indeterminate cases" posits that CAD may alter readers’ decision thresholds. The conjecture is as follows: when seeing certain "indeterminate cases" without computer support, readers are likely to recall it for further investigation. However, with computer support, they know that supplementary information is available in the form of CAD prompts; therefore, their preliminary decisions (before checking the prompts) are more likely to be "no recall" than they would be without CAD (a similar point was previously raised about other studies [5, 25]). In practice, this makes the computer outputs a determinant for the "recall/no recall" decision, despite readers never changing a preliminary decision from "recall" to "no recall" based on absence of prompts, or being in violation of the prescribed

procedures. For those indeterminate cases that are "difficult" cancers, CAD is likely to produce incorrect output, substantially reducing the chances of the case being recalled. Readers may not change any individual decisions because of absence of prompts, but they may change their decision thresholds for some cases due to the very presence of computer support. It may be very difficult (or even impossible) to prevent this by simple prescription.

7 Implications for design The decision system whose dependability we are considering is formed by the human reader plus an "alerting" tool (the CAD tool) - a common type of system, in many medical and non-medical applications. The people involved in designing such systems include the designers of the computer tool proper, those responsible for the client-specific configuration of the tool and those who devise procedures for its use and the training of the staff using and servicing it. What are the implication for these people of the possible subtle effects of computer use? In this section we focus on the risk of FN failures: not raising an alarm when an alarm is required. For CAD, this means not recalling a patient with cancer. A false negative, in these decision systems, is usually a more severe failure than a false positive. Space precludes discussion of the major concern of achieving a low enough FN rate without an excessive FP rate – spurious alarms. However, there are many application scenarios when we can expect a degree of homeostasis in the FP rate: the human operator spontaneously limits the total rate of alarms to a level that is acceptable in terms of extra workload on the organisation. In these cases, success in reducing FN rates does not automatically translate into excessive FP rates. It can be seen that the probability of incorrect decisions from a system like the one described is determined not only by the probability of failure of the CAD tool and of the reader separately, but also by how likely they are to fail together (for a mathematical treatment see [23]). In the experiments we described, there was indeed a high correlation between their failures. This may be due to two factors: the same mammograms that are hard to interpret correctly by readers are also hard for the tool to prompt correctly; or, incorrect prompting by the CAD tool makes readers more likely to err in their turn. The data from the HTA study suggest that both factors were present. Thus, multiple lines of attack are open to designers seeking to improve this alarm system. An obvious one is to improve either component – the tool or its user. But improvements in algorithms and/or in user training may be difficult and subject to a law of diminishing returns. They could also be comparatively ineffective, if for instance we improve a CAD tool's ability to prompt kinds of cancers that readers very seldom miss anyway. Other strategies, targeting the interaction and correlation between machine and human error, may be more attractive. We outline two

general categories of such strategies, some of them already adopted by designers, others possibly more novel. 7.1 Make alerting tools more different from their users The first strategy aims to reduce the covariance between the difficulty of a case for the alerting tool and for the unaided user: making the tool less likely to err in those situations in which the unaided user is more likely to err. This strategy is feasible, since the designers of an alerting tool have a degree of freedom in choosing the trade-off between its FP and FN rates. Especially if the tool uses a combination of algorithms for identifying situations that require an alert, tuning these multiple algorithms may allow some "targeting" of the peaks and troughs in FN and FP rates for different classes of "cases". An additional desirable effect of this strategy may be the scope for designers to reduce the overall FP rate of the tool: users find FPs annoying and they may increase cognitive load. For some applications, it may even be possible to tune the alerting tool for each individual user, tuning which could be under control of the users themselves or achieved via automatic calibration routines. The importance of the probability of common failure between human and alerting tool may have more general implications for the design philosophy for these tools. A design philosophy aiming at reproducing the outward behaviour of human experts may bring intrinsic limitations to the effectiveness of these tools, in that they will tend to help the user most reliably on those cases where the user needs less help. They will still be immune to fatigue and random lapses of attention, and thus serve the goal of making human performance more uniform, if these were important causes of human failure. But, even from this viewpoint, a tool design focused on helping users with cases where they are least effective (or most vulnerable to the effects of fatigue, for instance) could still be the more effective solution [23]. Even when designers lack information to identify these categories of cases, there may be some reward in building the tools on principles different from replicating human behaviour. The blind pursuit of diversity has proven effective, after all, in other areas of system design [24]. However, for humans to accept hints from such tools, it may be more important that the tool provides convincing explanations of its behaviour, lest users become accustomed to ignoring hints that they cannot make sense of (cf the observation that CAD users feel the need to explain successes and errors of the CAD tool in terms of regular patterns of behaviour [8, 9]). 7.2 Reduce the effect of incorrect prompting on user error The instructions for using the tool in our case study say that readers must not allow machine prompts (or lack thereof) to influence how they interpret features in the Xray images. If readers comply, then incorrect prompting of a case could not produce a false negative, except on cases

on which the reader failed to notice the symptoms to start with: computer support "could do no harm". But can users actually comply with these guidelines? Much of the skill that experts utilise is at the level of non-explicit pattern recognition rules and heuristics. If experts slowly adapted to relying on the tool's prompts for advice, at least for some types of cases, they may not realise that they are doing so. At least two forms of protection can be pursued: -

make it easier for the user to obey the prescribed procedure, or more difficult to violate it. For instance, the user interface could allow users to mark all features they are going to consider in their decision, before showing the tool's prompts. However, such measures may be cumbersome to enforce: CAD prompts are often relevant information that users may be under pressure to use in the way they find most efficient, especially if under heavy load;

-

make the user more immune to being influenced in the wrong way by the tool. A form of protection is to remind users of the possibility of machine failure by giving frequent examples of incorrect prompting: in training (just as e.g. pilot simulator training includes an unrealistically high rate of mechanical failure) or even better, where socially and organisationally acceptable, in normal use.

Several authors assume, both for cancer screening CAD and for advisory systems in general, that the phenomenon of incorrect prompting causing user error is limited to users who lack experience with the tool, and disappears with use. This seems a one-sided over-optimistic statement, if not supported by decisive empirical evidence: -

experience will help users to recognise that the tool is fallible, but it will also teach them that, for instance, in many situations it is normally reliable. CAD tools are tuned to have low FN rate at the cost of a high FP rate, so they are indeed quite reliable for the kinds of cancer signs they target.

-

experience will teach them to recognise some situations in which the tool tends to fail, but those situations are likely to be ones in which the user is more reliable and so needs the tool's support less.

So, the issue is not whether experience or special training or other factors will have some mitigating effect on user error rates, but the sign and magnitude of the overall effect of all these factors, when combined with that of the detrimental factors. Another issue specific to our case study is whether it is right to insist that readers not be influenced by the tool not prompting a feature about which they were uncertain. After all, with highly sensitive alerting tools, absence of alerts is a good indicator that no alarm should be raised. As users appear to recognise and exploit this fact to reduce their workload, it might be appropriate to develop explicit guidelines about how to use this kind of indication.

8 Conclusions We have discussed the outcomes of a case study with a form of computer aided decision making, trying to elicit implications of general interest, in particular in the area of possible detrimental effects of a computer aid. One outcome is further evidence against the simplistic view that "computer support can do no harm". Some mechanisms that may make a computer aid harmful, like user complacency, are intuitively obvious. Others that we have hypothesised here are subtler. An important implication for designers is that they cannot ignore the possibility of such harmful effects; they should consider whether the overall effect of computer support can be made sufficiently positive. In cases of high uncertainty about these factors, they must consider whether effective precautions can be adopted in designing the whole decision system (not just the computer part of it), and how the effects of computer aids can be monitored in use. Another known lesson that our findings support is that the problem in human-computer interaction goes way beyond designing the visible human-computer interfaces. Much codified knowledge about "human factors" is about the latter. But a computer may affect the dependability of the system of which it is part mostly through how it affects the mental processes of its users, quite independently of how friendly and "usable" it appears. In this case study, we would conjecture that design choices like the false positive-false negative trade-offs selected by designers for various categories of cases, and the way readers learn about the characteristics of the CAD tool, may be the determinant factors in the effects observed here. From the methodological viewpoint, our experience confirms the usefulness of an interdisciplinary approach, exploiting the indications of diverse methods of investigation. Inspired by our experience modelling common-mode failure processes in diverse redundancy, we applied methods for exploratory data analysis that focused on the variation between behaviours of computers or people across the range of "cases". These methods revealed interesting patterns that would be missed by standard analyses focused on the average effect of using a computer tool. We must underscore that these methods are a source of useful conjectures about the mechanisms of action of computer support, which can then be subjected to independent confirmation, but also provide indications of possible design problems while this confirmation is missing. Last but not least, recognising that the system of interest is the whole decision system, a composite human-machine, fault-tolerant system, rather than the computer part of it, is essential for a correct approach to its design.

Acknowledgements The work described in this paper has been partly funded by the UK Engineering and Physical Sciences Research

Council (EPSRC) through DIRC, the Dependability Interdisciplinary Research Collaboration. We thank: Mark Hartswood, Roger Slack and our DIRC colleagues in Edinburgh and Lancaster Universities for the ethnographic work mentioned here and extensive discussions; R2 Technologies Inc. for their support in obtaining our test set; Paul Taylor and Jo Champness (from University College London) for granting us access to their data, facilitating the follow-up studies; all the readers who volunteered for our studies.

References [1] "Pre-market approval decision. Application P970058", US Food and Drug Administration(1998). [2] "Computer-aided diagnosis". Brit J Radiol. 78 (special issue)(2005). [3] E. Alberdi, et al., "Effects of incorrect CAD output on human decision making in mammography". Acad Radiol. 11(8): pp. 909-918, (2004). [4] E. Alberdi, et al., "The use of Computer Aided Detection tools in screening mammography: A multidisciplinary investigation". Brit J Radiol. 78: pp. 31-40, (2005). [5] S.M. Astley and F.J. Gilbert, "Computer-aided detection in mammography". Clin Radiol. 59(5): pp. 390-399, (2004). [6] M.T. Dzindolet, et al., "The role of trust in automation reliance". International Journal of Human-Computer Studies. 58(6): pp. 697-718, (2003). [7] M. Hartswood, R. Procter, and L. Williams. " Prompting in mammography: Computer-aided detection or computer-aided diagnosis?". in Proceedings of Medical Image Understanding and Analysis, MIUA 98. 1998. Leeds, UK, [8] M. Hartswood and R. Procter, "Computer-Aided Mammography: A Case Study of Error Management in a Skilled Decision-making Task". Journal of Topics in Health Information Management. 20(4): pp. 38-54, (2000). [9] M. Hartswood, et al. "'Repairing' the Machine: A Case Study of the Evaluation of Computer-Aided Detection Tools in Breast Screening". in Eighth European Conference on Computer Supported Cooperative Work (ECSCW 2003). 2003. Helsinki, Finland, [10] B. Littlewood, P. Popov, and L. Strigini, "Modelling software design diversity - a review". ACM Computing Surveys. 33(2): pp. 177-208, (2002). [11] J. Meyer, "Effects of warning validity and proximity on responses to warnings". Hum Factors. 43: pp. 563-572, (2001). [12] J. Meyer and Y. Bitan, "Why better operators receive worse warnings". Human Factors. 44(3): pp. 343353, (2002). [13] J. Meyer, L. Feinshreiber, and Y. Parmet. "Levels of automation in a simulated failure detection task". in IEEE International Conference on Systems, Man and

[14] [15] [16]

[17]

[18] [19]

[20] [21] [22]

[23]

[24]

[25]

[26]

[27]

Cybernetics. 2003. Washington DC, USA, pp. 21012106. J. Meyer, "Conceptual issues in the study of dynamic hazard warnings". Human Factors. 46(2): pp. 196204, (2004). R. Parasuraman and V. Riley, "Humans and automation: Use, misuse, disuse, abuse". Hum Factors. 39: pp. 230-253, (1997). R. Parasuraman and C.A. Miller, "Trust and etiquette in high-criticality automated systems". Communications of the ACM. 47(4): pp. 51-55, (2004). A.A. Povyakalo, et al. "Evaluating 'Human + Advisory computer' system: A case study". in HCI2004,18th British HCI Group Annual Conf.. 2004. Leeds: British HCI Group, pp. 93-96. A.A. Povyakalo, et al., "Divergent effects of computer prompting on the sensitivity of mammogram readers", (in preparation). L.J. Skitka, K. Mosier, and M.D. Burdick, "Does automation bias decision making?" International Journal of Human-Computer Studies. 51(5): pp. 9911006, (1999). L.J. Skitka, "Accountability and automation bias". Int J Hum-Comp Stud. 52: pp. 701-717, (2000). R. Sorkin and D. Woods, "Systems with human monitors:a signal detection analysis." HumanComputer Interaction. 1: pp. 49-75, (1985). L. Strigini, A.A. Povyakalo, and E. Alberdi. "Humanmachine diversity in the use of computerised advisory systems: a case study". in 2003 Int. Conf. on Dependable Systems and Networks (DSN'03). 2003. San Francisco: IEEE, L. Strigini, A.A. Povyakalo, and E. Alberdi. "Humanmachine diversity in the use of computerised advisory systems: a case study". in DSN 2003, International Conference on Dependable Systems and Networks. 2003. San Francisco, U.S.A., pp. 249-258. L. Strigini, "Fault Tolerance Against Design Faults", in Dependable Computing Systems: Paradigms, Performance Issues, and Applications, H. Diab and A. Zomaya, Editors, J. Wiley & Sons (2005). P. Taylor, et al., "Assessing the impact of CAD on the sensitivity and specificity of film readers". Clin Radiol. 2004 Dec;59(12):1099-105. 59(12): pp. 1099-1105, (2004). P. Taylor, et al., " Impact of computer-aided detection prompts on the sensitivity and specificity of screening mammography." Health Technology Assessment. 9(6)(2005). B. Zheng, et al., "Soft-Copy Mammographic Readings with Different Computer-assisted Detection Cuing Environments: Preliminary Findings". Radiology. 221(3): pp. 633-640, (2001).

Suggest Documents