under the curve, 4, we can substitute the estimates a and b for (Y and p, respectively. The statistic pro- posed by Metz3 for testing the null hypothesis H0:&1.
Statistical Comparison of ROC Curves from Multiple Readers MARIJKE SWAVING, MSc, HANS VAN HOUWELINGEN, PhD, FENNO P. OTTES, PhD, TON STEERNEMAN, PhD Receiver operating characteristic (ROC) analysis is the commonly accepted method for comparing diagnostic imaging systems. in general, ROC studies are designed in such a way that multiple readers read the same images and each image is presented by means of two different imaging systems. Statistical methods for the comparison of the ROC curves from one reader have been developed, but extension of these methods to muitlpie readers is not straightforward. A new method of analysis is presented for the comparison of ROC curves from multiple readers. This method includes a nonparametric estimation of the variances and covariances between the various areas under the curves. The method described is more appropriate than the paired t test, because it also takes the case-sample variation into account. Key words ROC curves; ROC analysis; nonparametric estimation; area under the curve; comparison of ROC curves; EM aigorlthm. (Med Decis Maklng l996;16:143-152)
Table 1
Receiver operating characteristic (ROC) studies are frequently used to compare the diagnostic qualities of imaging systems.’ ROC studies can also be used to evaluate the performance of systems to support the diagnosis. In a conventional ROC study,1,2 several observers read all the selected images by means of each of the imaging systems that are to be compared. The diagnosis the observer makes depends on his or her confidence that the particular image shows an abnormality or a normal state and upon the confidence threshold he adopts. The diagnosis the observer makes can be correct or incorrect (table 1). For a given confidence threshold, the fraction of abnormal images that are correctly identified as abnormal is called the true-positive fraction (TPF = sensitivity) and the fraction of the normal images that are correctly identified is called the true-negative fraction (TNF = specificity). In the same way, the false-positive fraction (FPF) and the false-negative fraction (FNF) are defined. For the actually normal and the actually abnormal images, probability distributions can be derived for the various states of .
l
The Tvoes of Correct and incorrect Dlaonoses Actual State
Diagnosis
Positive
Negative
False positive (FP) Considered positive True positive (TP) Considered negative False negative (FN) True negative (TN)
truth. Two such distributions are shown in figure 1. The horizontal axis represents the confidence the observer has that a particular image was taken from a patient who actually has an abnormal condition. The confidence threshold separates the abnormal diagnoses from the normal diagnoses. The inherent discriminatory capacity of a system depends on the extent to which the probability distributions of perceived evidence from various states of truth are separated or overlap. The sensitivity and the specificity of a diagnostic imaging system depend upon the particular confidence threshold that the observer uses. According to this model, an image is read as positive if the observer’s confidence in a positive diagnosis exceeds his confidence threshold. For a range of confidence thresholds, the ROC curve represents the link between the fraction of abnormal images that are correctly diagnosed as abnormal (TPF) and the fraction of normal images that are diagnosed as abnormal (FPF). Therefore, in an ROC study, an observer is asked to adopt several confidence thresholds at the same time, as shown in figure 2, and classify each image in one of these categories denoting his or her confidence. The diagnostic quality of an imaging system can be represented by an ROC curve. The ROC curve can be estimated as a curve through the operating
Received June 22, 1994 from BAZIS, Experimental Developments, Leiden, The Netherlands (MS, FRO); the Department of Medical Statistics, Leiden University, Leiden, The Netherlands (JCVH); and the Department of Econometrics, University of Groningen, Groningen, The Netherlands (AGMS). Revision accepted for publication September 19, 1995. Dr. Steerneman was supported by the Stichting Verzekeringswetenschap and the C. R. Rao Foundation. Address correspondence and reprint requests to Dr. Ottes: HISCOM Research and Development, Schipholweg 97, 2316 XA Leiden, The Netherlands. 143
144
l
MEDICAL DECISION MAKING
Swavlng, van Houwelingen, Otter, Steerneman One Possible Setting of
the Confidence Threshold
1. Probability densities of an observer’s confidence in a positive diagnosis for a particular diagnostic task. FIGURE
Actually Positive F+ati
Confidence in a Positive Decision Less - More
points, which are obtained by the classifications of the images by a reader. Usually it is assumed that the functional form of an ROC curve is given by two underlying normal distributions. The ROC curve denotes the discriminatory capacity of the imaging system between the normal and the abnormal images. A higher ROC curve indicates greater discriminatory capacity. To compare imaging systems, the area under the ROC curve is usually taken as an index of diagnostic quality. Two imaging systems yield the same diagnostic quality when the areas under their ROC curves are equal. If the same images are read and/or the same reader reads the images presented by both imaging systems, the ROC curves (and the areas beneath them) will be correlated. Metz et al.3 and Hanley and McNeil4 have developed several methods for the comparison of two correlated ROC curves. These methods are described below. They cannot be generalized to multiple readers. A nonparametric approach to the comparison of the areas under two or more correlated ROC curves has been reported by DeLong et al.5 Until now, conclusions about the diagnostic qualities of imaging systems have been drawn by comparing ROC curves per reader, by comparing pooled ROC curves or the averaged areas under the curves and subsequently performing a test for two correlated ROC curves, or by using a paired t test to analyze the areas under the curves for all readers. Pooling6 is not a good alternative for handling correlated ROC curves, and a test for two correlated ROC curves is not designed for such ‘pooled curves. The paired t test takes both between-reader variation and within-reader varia-
tion into account, but it does not take case-sample variation into account. Therefore, the need was felt for a more suitable method than the paired t test. We present a method that takes into account casesample variation. We provide an example of eight correlated ROC curves derived from four readers’ interpretations of mammograms, then discuss our proposed method.
\
\ Actually Negative
Patients
\
\
Actually Padive
Patients
\/ \
-i5
FIGURE 2.
The confidence-rating model, sketched for a five-cat-
egory scale.
VOL 16/NO 2, APR-JUN 1996
Comparison of ROC Curves from Multiple Readers
Comparison of Correlated ROC Curves Under the assumption of a bivariate normal ROC curve, the curve can be plotted as a straight line on double-normal-deviate paper? This straight line can be represented by means of two parameters, the intercept OL and the slope p. The area under the ROC curve is equal to’
(1)
where