Many Cases Need to Be Reviewed to Compare Performance in ...

5 downloads 80 Views 239KB Size Report
Key Words: Anatomic pathology; Surgical pathology; Cytopathology; Accuracy; Diagnosis; Second opinion; Sensitivity; Quality assurance;. Continuous quality ...
Anatomic Pathology / COMPARING PERFORMANCE IN SURGICAL PATHOLOGY

Many Cases Need to Be Reviewed to Compare Performance in Surgical Pathology? Andrew A. Renshaw, MD,1 Mary L. Young, MS,2 and Michael R. Jiroutek, MS2 Key Words: Anatomic pathology; Surgical pathology; Cytopathology; Accuracy; Diagnosis; Second opinion; Sensitivity; Quality assurance; Continuous quality improvement; Blinded review DOI: 10.1309/QYYB3K0BHPCEGQG3

Abstract Recent studies have shown increased interest in measuring error rates in surgical pathology. We sought to determine how many surgical pathology cases need to be reviewed to show a significant difference from published error rates for review of routine or biopsy cases. Results of 4 series with this type of diagnostic material involving a total of 11,683 cases were reviewed to determine the range of published falsenegative, false-positive, typing error, threshold error, and clinically significant error rates. Error rates ranged from 0.00% to 2.36%; clinically significant error rates ranged from 0.34% to 1.19%. Assuming a power of 0.80 and a 1-sided alpha of 0.05, the number of cases needed to be reviewed to show that a laboratory with either twice or one half the published error rate was significantly different than the range of published error rates varied from 330 to 50,158. For clinically significant errors, the number of cases varied from 665 to 5,886. Because the published error rates are low, a relatively large number of cases need to be reviewed and a relatively great difference in error rate needs to exist to show a significant difference in performance in surgical pathology.

Diagnostic accuracy is crucial in surgical pathology. Studies that have measured diagnostic accuracy have generated widely divergent results.1-11 There are 3 major causes for these differences. The first is case selection. Many studies have focused on a very select and biased population of cases, such as those seen in consultation for second opinion. Second is the method used to measure errors. Very few studies define the performance of the review method, a feature that has been shown to be critical in defining error in gynecologic cytology.12-16 For example, reviewing a difficult Papanicolaou smear with the knowledge that the patient has cancer is different from reviewing that case in a routine manner without this knowledge. In general, the screening error rate for nonblinded review with knowledge of a poor outcome is lower than in routine review, and the threshold for diagnosing an abnormality also may be lower. Third is the significant level of interobserver disagreement in surgical pathology, which has created a problem in defining error as distinct from disagreement. This is a particular problem in surgical pathology, where many diagnoses do not have a separate “gold standard” with which to compare the results. Despite these limitations, there is now a body of literature based on the review of relatively unselected material (ie, consecutive cases or consecutive types of cases, such as biopsies only). We were interested in comparing the performance in our laboratory with that in the published literature. As a first step, we sought to determine approximately how many cases would need to be reviewed to show a significant difference from the results in the literature.

© American Society for Clinical Pathology 1

Am J Clin Pathol 2003;119:388-391 388 DOI: 10.1309/QYYB3K0BHPCEGQG3 1

Renshaw et al / COMPARING PERFORMANCE IN SURGICAL PATHOLOGY

Materials and Methods We reviewed 4 studies of either consecutive general surgical biopsy cases or biopsy cases ❚Table 1❚. The errors were defined and classified as done previously.17 In brief, a false-negative error is a diagnosis of a lesion made on second review that was not made by the first review. A falsepositive error is a diagnosis of a lesion that the reviewer thought was not present. An error of threshold refers to a difference of opinion, such as the difference between atypical ductal hyperplasia and ductal carcinoma in situ, that, while often clinically significant, does not reflect an error in identification of the lesion but rather an assessment of the degree of abnormality that is present. Differences of type and grade refer to differences in tumor type and grading. Clinically significant errors were defined as those that might have an impact on patient care or prognosis. Unweighted averages of the false-negative rate (FNR), false-positive rate (FPR), diagnostic-type error rates, diagnostic threshold error rates, and potentially clinically significant error rates were calculated from these studies. A chi-square test was used to determine whether the error rates among studies were significantly different. All sample size calculations were done with a 1-sided alpha of 0.05. Power (probability of rejecting the null hypothesis) was set at 80%. NQuery software (Statistical Solutions, Saugus, MA) was used to calculate the sample sizes required to show a significant difference between assumed laboratory rates and the published rates. A 1-sample chi-square test that proportionally equals a user-specified value (normal approximation) was used via NQuery.

Results The distribution of errors in the 4 studies is detailed in Table 1. Mean (SD) error rates were 0.38% (0.22%) for the FNR, 0.10% (0.12%) for the FPR, 0.19% (0.15%) for typing errors, 0.74% (1.09%) for threshold errors, and 0.70% (0.45%) for clinically significant errors. However, a chi-square

test used to determine whether the error rates among the 4 studies were similar showed that the error rates for each type of error were significantly different (P < .02 for all). Therefore, the power studies were performed using the range of error rates in these studies rather than the mean. ❚Table 2❚ shows the number of reviewed cases required to show that an assumed laboratory error rate of half or twice the published range is significantly better or worse than the range of errors for FNR, FPR, typing errors, threshold errors, and clinically significant errors. These ranged from 330 to 50,158. For clinically significant errors, the number of cases ranged from 665 to 5,886.

Discussion The purpose of this study was simple. We sought to determine how many cases must be reviewed in a laboratory to show a significant increase or decrease in diagnostic accuracy compared with the published literature. Although we began our study examining much smaller differences in error rates, it quickly became clear that because the published error rates were so low, to show a significant difference would require review of a very large number of cases. The differences of one half and twice the published data we tabulated are admittedly arbitrary, but they give the reader a sense of the level of effort that would be needed to show significant differences in laboratory performance. For example, in the experience of one of us (A.A.R.), it is common for many laboratories to review 2% of their surgical pathology material for quality assurance purposes. However, this simply may not be enough to detect significant differences in most smaller laboratories. Assuming a laboratory had a clinically significant error rate of 1.40% (twice that of the “mean” of the published literature), reviewing 2% of the cases would generate only enough data in 1 year to detect this difference if the laboratory examined at least 56,850 cases a year (1,137 cases needed). With the data we present herein, one can tailor the level of reviewing effort in the laboratory to the difference in error rates one wants to detect.

❚Table 1❚ Published Error Rates by Type Error Rates (%) Study Whitehead et al2 Safrin and Bark1 Lind et al3 Renshaw17 Mean (SD)

Case Type

N

Consecutive Consecutive Biopsy only Biopsy only

3,000 5,397 2,694 592

389 Am J Clin Pathol 2003;119:388-391 2 DOI: 10.1309/QYYB3K0BHPCEGQG3

False-Negative 0.30 0.11 0.59 0.51 0.38 (0.22)

False-Positive

Typing

Threshold

Clinically Significant

0.27 0.04 0.11 0.0 0.10 (0.12)

0.40 0.04 0.15 0.17 0.19 (0.15)

0.23 0.04 0.33 2.36 0.74 (1.09)

0.97 0.30 1.19 0.34 0.70 (0.45)

© American Society for Clinical Pathology

Anatomic Pathology / ORIGINAL ARTICLE

❚Table 2❚ Minimum No. of Cases Required to Show a Laboratory Error Rate (LER) Is Significantly Better or Worse Than the Lowest or Highest Error Rate From Table 1* If LER Half of Published Rate Error Rates Lowest False-negative: 0.11 False-positive: 0.0 Typing: 0.04 Threshold: 0.04 Clinically significant: 0.34 Highest False-negative: 0.59 False-positive: 0.27 Typing: 0.40 Threshold: 2.36 Clinically significant: 1.19 *

If LER Twice the Published Rate

LER

No. of Cases

LER

No. of Cases

0.06 0.0 0.02 0.02 0.17

22,580 — 50,158 50,158 5,886

0.22 0.0 0.08 0.08 0.68

7,296 — 20,083 20,083 2,353

0.30 0.14 0.20 1.18 0.60

3,518 8,076 5,001 834 1,702

1.18 0.54 0.80 4.72 2.38

1,351 2,966 1,999 330 665

Power = 0.80; 1-sided alpha = 0.05.

It is noteworthy that the data in this report are expressed in regard to the number of cases reviewed rather than disease prevalence, which generally is a better standard. Indeed, the FNR and FPR we document herein are not the same as 1 minus sensitivity and 1 minus specificity in most other studies. Unfortunately, in most of the series summarize herein, disease prevalence is not reported. Nevertheless, although the data are limited, it is interesting to note that disease prevalence does not seem to have as consistent an effect as one might expect on these data. Specifically, since disease or abnormality prevalence is greater in biopsy material than in consecutive cases, one would expect a higher error rate in biopsy material when this error rate is expressed in relation to the number of cases rather than to disease prevalence. However, no consistent trend in this regard was found. There are several possible explanations for this, including differences in case selection, bias in the review, and differences in diagnostic agreement. However, in our opinion, the most likely reason is that the number of studies to date is too limited to show a consistent trend. It is also worthwhile to compare the results of the present study with those in other fields. In gynecologic cytology, it is estimated that valid comparisons of differences in error rates as low as 10% can be achieved by review of only 1,500 to 1,700 cases.16 The difference between gynecologic cytology and surgical pathology is due largely to the much higher error rate in gynecologic cytology specimens. Other authors also have compared the error rate of the airline industry with the error rate in surgical pathology.18 These authors note that the error rate in many fields of human endeavor is on the order of 10,000 parts per million (ppm), while that of airline fatality is on the order of 0.43 ppm, which is a great decrease. The mean error rates

summarized in the present study range from 1,000 to 7,400 ppm. Certainly no one would argue that these numbers are higher than those in the airline industry and that one would wish to obtain the lowest error rate possible within the time and financial restraints of the system. But is the comparison between airline fatality and error rates in surgical pathology as just described fair? As the authors of the airline studies note, there are other components to service in the airline industry (such as baggage handling), and in these areas, the airlines are no better than other fields (including surgical pathology) in terms of error rate (4,000 ppm). Fatality represents the most extreme and presumably smallest error rate one can generate in the airline industry. Certainly not all the clinically significant errors in surgical pathology result in death, and direct comparison of these numbers may not be entirely appropriate. Indeed, the fatality rate for cervical cancer for women in the United States is only approximately 41 ppm,19 which certainly is much closer to the airline standard. Because the published error rates are very low, a relatively large number of cases need to be reviewed and a relatively great difference in error rate needs to exist to detect a significant difference in performance in surgical pathology. Laboratories that want to show a statistically significant difference between their results and those in the literature can use the data in Table 2 to tailor their level of reviewing effort to the differences in error rates they want to detect. From the Departments of 1Pathology, Baptist Hospital of Miami, Miami, FL; and 2Biostatistics, University of North Carolina, Chapel Hill. Address reprint requests to Dr Renshaw: Dept of Pathology, Baptist Hospital of Miami, 8900 N Kendall Dr, Miami, FL 33176.

© American Society for Clinical Pathology 3

Am J Clin Pathol 2003;119:388-391 390 DOI: 10.1309/QYYB3K0BHPCEGQG3 3

Renshaw et al / COMPARING PERFORMANCE IN SURGICAL PATHOLOGY

References 1. Safrin RE, Bark CJ. Surgical pathology signout: routine review of every case by a second pathologist. Am J Surg Pathol. 1993;17:1190-1192. 2. Whitehead ME, Fitzwater JE, Lindley SK, et al. Quality assurance of histopathology diagnoses: a prospective audit of three thousand cases. Am J Clin Pathol. 1984;81:487-491. 3. Lind AC, Bewtra C, Healy JC, et al. Prospective peer review of surgical pathology. Am J Clin Pathol. 1995;104:560-566. 4. Abt AB, Abt LG, Olt GJ. The effect of interinstitution anatomic pathology consultation on patient care. Arch Pathol Lab Med. 1995;119:514-517. 5. Epstein JI, Walsh PC, Sanfilippo F. Clinical and cost impact of second-opinion pathology: review of prostate biopsies prior to radical prostatectomy. Am J Surg Pathol. 1996;20:851-857. 6. Kronz JD, Westra WH, Epstein JI. Mandatory second opinion surgical pathology at a large referral hospital. Cancer. 1999;86:2426-2438. 7. Jacques SM, Qureshi F, Munkarah A, et al. Value of second opinion pathology review of endometrial cancer diagnosed on uterine curettings and biopsies [abstract]. Mod Pathol. 1997;10:103A. 8. Bruner JM, Inouye L, Fuller GN, et al. Diagnostic discrepancies and their clinical impact in a neuropathology referral practice. Cancer. 1997;79:796-803. 9. Scott CB, Nelson JS, Farnan NC, et al. Central pathology review in clinical trials for patients with malignant glioma. Cancer. 1995;76:307-313. 10. Aldape K, Simmons ML, Davis RL, et al. Discrepancies in diagnosis of neuroepithelial neoplasms: the San Francisco Bay Area Adult Glioma Study. Cancer. 2000;88:2342-2349.

391 Am J Clin Pathol 2003;119:388-391 4 DOI: 10.1309/QYYB3K0BHPCEGQG3

11. Hahm GK, Niemann TH, Lucas JG, et al. The value of second opinion in gastrointestinal and liver pathology. Arch Pathol Lab Med. 2001;125:736-739. 12. Renshaw AA. Analysis of error in calculating the false negative rate for interpretation of cervicovaginal smears: the need to review abnormal cases. Cancer. 1997;81:264-271. 13. Renshaw AA, DiNisco SA, Minter LJ, et al. A more accurate measure of the false negative rate of Pap smear screening is obtained by determining the false negative rate of the rescreening process. Cancer. 1997;81:272-276. 14. Renshaw AA. A practical problem with calculating the false negative rate of Papanicolaou smear interpretation by rescreening negative cases alone. Cancer. 1999;87:351-353. 15. Renshaw AA, Lezon KM, Wilbur DC. The human false negative rate of rescreening in a two arm prospective clinical trial. Cancer. 2001;93:106-110. 16. Renshaw AA. An accurate and precise methodology for routine determination of the false negative rate of Pap smear screening. Cancer. 2001;93:86-92. 17. Renshaw AA, Pinnar NE, Jiroutek MR, et al. Blinded review as a method for quality improvement in surgical pathology. Arch Pathol Lab Med. 2002;126:961-963. 18. Nevalainen D, Berte L, Kraft C, Leigh E, et al. Evaluating laboratory performance on quality indicators with the six sigma scale. Arch Pathol Lab Med. 2000;124:516-519. 19. Jemal A, Thomas A, Murray T, et al. Cancer statistics, 2002. CA Cancer J Clin. 2002;52:23-47.

© American Society for Clinical Pathology

Suggest Documents