Hôpital du Sacré-Coeur de Montréal - University of Montreal. [christian.oreilly, tore.nielsen]@umontreal.ca. 1. Introduction. Two-class classification problems with ...
Revisiting the ROC curve for diagnostic applications with an unbalanced class distribution Christian O’Reilly & Tore Nielsen Dream and Nightmare Laboratory, Center for Advanced Research in Sleep Medicine, Hôpital du Sacré-Coeur de Montréal - University of Montreal [christian.oreilly, tore.nielsen]@umontreal.ca 1. Introduction
1010 (P)
Total
Specificity
B: False detector
1.0
PPV=5%
1.0
PPV=10%
sensitivity
0.0
1- specificity
PPV=1%
1.0
A: ROC curve
0.0
1- specificity
1.0
False negative (FN)
True negative (TN)
Negative bias N’
Negative Prevalence prevalence (P) (N)
Sample size
Table 2. Some more sophisticated threshold-dependent statistics.
Cohen’s kappa coefficient F-measure Matthews’ correlation coefficient
Classifier
Acc
Sen
Spec
PPV
NPV
𝑷𝒆
𝜅
𝐹𝛽𝐶 =1
𝑀𝐶𝐶
A
1,00 0,99
0,00
1,00
NaN
1,00
1,00
0,00
NaN
NaN
0,99
0,99
0,09
1,00
0,99
0,16
0,17
0,30
Bias P’
Fig. 1. Confusion matrix.
Statistics Probability of random agreement
Fig. 5. Example of a ROC curve complemented with information relative to PPV for a problem with a coefficient of asymmetry 𝜶 = 𝟐𝟓. 𝟐. These added iso-PPV lines are useful in rapidly evaluating the portion of the ROC curve for which an acceptable PPV can be obtained.
Definition 𝑃′ 𝑃 + 𝑁 ′ 𝑁 𝑃𝑒 = (𝑃 + 𝑁)2 𝐴𝑐𝑐 − 𝑃𝑒 𝜅= 1 − 𝑃𝑒 𝑃𝑃𝑉 ∗ 𝑆𝑒𝑛 2 𝐹𝛽𝐶 = (1 + 𝛽𝐶 ) 𝑃𝑃𝑉 ∗ 𝛽𝐶2 + 𝑆𝑒𝑛 𝑇𝑃 ∗ 𝑇𝑁 − 𝐹𝑃 ∗ 𝐹𝑁 𝑀𝐶𝐶 = 𝑃′ ∗ 𝑃 ∗ 𝑁 ′ ∗ 𝑁
B
0.0
!
Often, only these statistics are reported in the literature. These are insufficient to give a correct evaluation in unbalanced problems.
+
Always report PPV. Reporting MCC is also a good practice.
1- sensitivity
1.0
B: PT curves Fig. 4. ROC curves (A) and PT curves (B) of four real classifiers for the problem of sleep spindle detection on a database with 𝜶 = 𝟐𝟓. 𝟐.
4. Threshold-independent analyses The Receiver Operating Characteristic (ROC) curve links sensitivity and specificity as a function of the decision threshold (see Fig. 3). It is often used to allow threshold-independent analyses.
𝛽 = 10
𝛽 = 100
Useful portions of the ROC curve are determined by a straight line passing through the origin and having a slope
𝛼𝑃𝑃𝑉− 𝛽= (1 − 𝑃𝑃𝑉− ) with • 𝛼 : the coefficient of asymmetry (𝛼 = 𝑁/𝑃) • 𝑃𝑃𝑉− : the minimal acceptable PPV
ROC curve
0.0
𝛽=1 (random classifier)
! 1- specificity
With large 𝛼, only a very small portion of the ROC curve is useful if we are not willing to accept very small PPV (e.g., for 𝛼 = 100 and 𝑃𝑃𝑉− = 50%, only 0.5% of the ROC space is useful.
5 Conclusion Precautions are necessary when assessing classifiers on unbalanced datasets. Some important considerations are: 1. Reporting accuracy, specificity, and sensitivity is not sufficient. Ideally, these statistics should be complemented by the PPV. At the very minimum, PPV and sensitivity should be reported. It might also be useful to report the MCC given the robustness of this measure in relation to a problem’s asymmetry.
1.0
Fig. 3. An example ROC curve. Also shown are some values of 𝜷 and the associated regions of valid ROC space.
The 8th International Workshop on Systems, Signal Processing and their Applications, 12-15 May 2013, Zeralda, Algeria.
PPV
Total
0.0
Negative Positive
Test outcome
Negative False positive (FP)
0.0
Positive predictive value Negative predictive value
Total
Positive True positive (TP)
1.0
Sensitivity
Definition 𝑇𝑁 + 𝑇𝑃 𝐴𝑐𝑐 = 𝑃+𝑁 𝑇𝑃 𝑆𝑒𝑛 = 𝑇𝑃 + 𝐹𝑁 𝑇𝑁 𝑆𝑝𝑒𝑐 = 𝐹𝑃 + 𝑇𝑁 𝑇𝑃 𝑃𝑃𝑉 = 𝐹𝑃 + 𝑇𝑃 𝑇𝑁 𝑁𝑃𝑉 = 𝐹𝑁 + 𝑇𝑁
Total
1 010 000 (N)
Table 3. Threshold-dependent statistics for the examples of Fig. 2.
sensitivity
Statistics Accuracy
1010 (P)
PPV=20%
Fig. 2 Examples of problematic classifiers. Gold standard outcome
Table 1. Some elementary thresholddependent statistics.
1 000 000 (TN)
A: Trivial classifier
2. Threshold-dependent analyses Most statistics used to evaluate twoclass classifiers can be expressed using concepts related to the confusion matrix (see Fig. 1). Some elementary statistics derived from this matrix are shown in table 1 whereas table 2 gives some more sophisticate ones. These statistics are qualified as being thresholddependent since they usually depend (explicitly or implicitly) on a classification decision threshold.
1 010 000 (N)
10 (FN)
0.0
1 010 000 (TN)
10 000 (FP)
sensitivity
1010 (FN)
1000 (TP)
4.2 Iso-𝐏𝐏𝐕 lines PPV=50%
0.0
0 (FP)
Negative
4.1 PT curves
1.0
0 (TP)
Positive
Positive
Negative
Negative
Table 3 shows the inadequacy of reporting only accuracy, sensitivity, and specificity when dealing with an unbalanced dataset.
Positive
Test outcome
Caution is needed when dealing with such unbalanced class sizes. However, the scientific literature shows that necessary precautions are often ignored. This poster points out some ! pitfalls and proposes some + remedies .
Gold standard outcome
Gold standard outcome
Positive
• Fraud detection • Intrusion detection
Fig. 2 shows two particular examples of problematic classifiers.
Negative
• Diagnosis of rare diseases • Sleep spindle detection • EEG seizure detection
3. Impact of unbalanced class distribution (P