Eur Radiol (2008) 18: 1134–1143 DOI 10.1007/s00330-008-0878-0
BREAST
Per Skaane Felix Diekmann Corinne Balleyguier Susanne Diekmann Jean-Charles Piguet Kari Young Michael Abdelnoor Loren Niklason
Observer variability in screen-film mammography versus full-field digital mammography with soft-copy reading
Received: 19 September 2007 Revised: 20 November 2007 Accepted: 24 December 2007 Published online: 27 February 2008 # European Society of Radiology 2008
M. Abdelnoor Center for Clinical Research, Section of Epidemiology and Biostatistics, Ullevaal University Hospital, Oslo, Norway L. Niklason Hologic Inc., Hillsborough, NC, USA
Presented at ECR, Wien 2006. P. Skaane (*) . K. Young Department of Radiology, Breast Imaging Center, Ullevaal University Hospital, Kirkeveien 166, 0407 Oslo, Norway e-mail:
[email protected] Tel.: +47-22-119411 Fax: +47-23-016535 F. Diekmann . S. Diekmann Department of Diagnostic Radiology, University Charite, Berlin, Germany C. Balleyguier Institut Gustave Roussy, Villejuif, France J.-C. Piguet Institut Imagerive, Geneva, Switzerland
Abstract Full-field digital mammography (FFDM) with soft-copy reading is more complex than screen-film mammography (SFM) with hard-copy reading. The aim of this study was to compare inter- and intraobserver variability in SFM versus FFDM of paired mammograms from a breast cancer screening program. Six radiologists interpreted mammograms of 232 cases obtained with both techniques, including 46 cancers, 88 benign lesions, and 98 normals. Image interpretation included BI-RADS categories. A case consisted of standard two-view mammograms of one breast. Images were scored in two sessions separated by 5 weeks. Observer variability was substantial for SFM as well
Introduction The success of screening mammography depends upon the detection of small preclinical cancers. The depiction of fine microcalcifications and subtle soft tissue masses and densities on high quality mammography is key to the detection of these early breast cancers. Conventional screen-film mammography (SFM) has some distinct
as for FFDM, but overall there was no significant difference between the observer variability at SFM and FFDM. Mean kappa values were lower, indicating less agreement, for microcalcifications compared with masses. The lower observer agreement for microcalcifications, and especially the low intraobserver concordance between the two imaging techniques for three readers, was noticeable. The level of observer agreement might be an indicator of radiologist performance and could confound studies designed to separate diagnostic differences between the two imaging techniques. The results of our study confirm the need for proper training for radiologists starting FFDM with soft-copy reading in breast cancer screening. Keywords Breast neoplasms . Radiography . Breast radiography . Comparative studies . Cancer screening . Full-field digital mammography . Interobserver variation
advantages, including hard-copy images conveniently displayed on motorized alternators in a rather simple way. Full-field digital mammography (FFDM) offers several advantages in mammography screening [1]. The true flexibility and benefit of digital technology is primarily realized in a soft-copy display of the images and consequently in soft-copy reading. Experimental clinical studies comparing SFM with FFDM hard-copy [2–5] or
1135
soft-copy [6] reading have demonstrated comparable or slightly better results for FFDM regarding lesion detection and characterization. Interobserver variability is a serious problem in breast imaging, and radiologists differ substantially in their interpretation of mammograms [7–10]. Differences in mammogram interpretations among radiologists can influence the number of detected cancers and, consequently, the effect of screening mammography on breast cancer mortality. Inter- and intraobserver variability using the Breast Imaging Reporting and Data System (BI-RADS) is substantial for both mammographic feature analysis and final assessment recommendations [11]. The challenge of interobserver variation is best demonstrated in the relatively low agreement on assessment, even when the BI-RADS categories are dichotomized as binary outcome [11]. FFDM soft-copy reading is more complex than SFM hardcopy reading. Radiologists may differ in their use of FFDM postprocessing, both to the extent of “roam-andzoom” and to which steps they include in their soft-copy reading. Little is known about observer variability in FFDM soft-copy reading versus SFM hard-copy reading. The purpose of this study was to analyze observer variability in SFM and FFDM of paired mammograms from a population-based breast cancer screening program.
Materials and methods Study population All cases (one case=one breast) included two standard views [craniocaudal (CC) and mediolateral oblique (MLO)] of the breast, and were selected from the Norwegian Breast Cancer Screening Program (NBCSP). Most cases were obtained from the Oslo I study, a paired study in which the women underwent a two-view examination of each breast with SFM and FFDM on the same day [12]. Some cases were collected from the Oslo II study [13], and for these cases the time interval between the examinations with SFM and FFDM was less than three weeks. All women were informed about the study beforehand, and their participation was voluntary. Each woman enrolled in the two studies signed a written consent form, and both studies were approved by the Regional Ethical Committee. A total of 250 cases having two standard views acquired with both SFM and FFDM were selected from the database by a nonradiologist. The selected study material included initially 100 normal cases, 100 abnormal but benign cases, and 50 cancers. All 100 normal cases were randomly collected from the Oslo I study, which included independent double reading for SFM as well as FFDM. The inclusion criteria for a normal examination were that all four radiologists had
scored the examination as normal (category “1” on the fivepoint rating scale for probability of cancer used in the NBCSP), that the woman did not present with interval cancer, and that the case was scored as normal in the subsequent screening round. Among the 100 normal cases selected by the nonradiologist, two women with large breasts and large format images (24 by 30 cm) on SFM and more than the four standard images on FFDM were excluded from analysis. Of the 100 benign cases, 85 cases were selected from the Oslo I and 15 cases from the Oslo II study. A case was defined as benign if call-back assessment confirmed a benign abnormality, and the woman did not present with an interval cancer or with a cancer in the subsequent screening round. Twelve cases among the selected 100 benign lesions were excluded from further analysis since no abnormality was confirmed on diagnostic work-up and the abnormality suspected on screening mammograms proved to be superimposed glandular tissue and no true abnormality. Fifty screen-detected cancers, including both ductal carcinoma in-situ (DCIS) and invasive cancers, were randomly selected from the Oslo I study (n=27) or the Oslo II study (n=23). All cancers were confirmed by histology. The histological type of the cancers, the size of the cancers, and the breast parenchyma density was not known by the person who randomly selected the cases from the files. Four cancers were excluded from analysis: one cancer proved to be mammographically occult; one cancer proved to be a positioning failure at FFDM; and two cancers proved to have latero-medial views instead of MLO at diagnostic work-up so that comparison was inappropriate. Thus, the final study population consisted of 232 cases: 98 normals (mean age 56.2 years, range 49–67 years), 88 benign cases (mean age 56.4 years, range 45–68 years), and 46 cancers (mean age 59.2 years, range 51–70 years). Eighteen breasts were categorized as BI-RADS density 1 and 12 cases as BI-RADS density 4. The distribution of the four density groups among the 232 cases has been described in a previous study [14]. Imaging SFM examinations were acquired on a Mammomat 300 (Siemens, Erlangen, Germany) with Kodak Min-R 2000 film and Min-R 2190 screens (Kodak Health Imaging, Rochester, N.Y.), using molybdenum target and a molybdenum filter at 29 kV. The physicists selected 29 kV in accordance with the recommendations of the NBCSP, and the SFM images had optical density in compliance with the NBCSP requirements [15]. FFDM images were acquired on a Senographe 2000D (GE Medical Systems, Milwaukee, Wis.), which uses a CsI-amorphous silicon detector. The unit is equipped with an AOP (automatic optimization of parameters), in which
1136
anode-filter combination, kV and mAs are selected automatically after analysis of a short pre-exposure. Image interpretation Six radiologists (A–F) from four European countries participated in the study. The readers’ experience in SFM varied from 4 to 24 years and in FFDM soft-copy reading from 2 to 4 years. The number of screening mammograms interpreted by each radiologist in their own practice varied from 2,500 to 12,000 per year. Soft-copy reading was carried out in a darkened room, and a darkened room with high luminance view boxes was used for hard-copy SFM. The reading conditions were identical for all readers. The mammograms were made anonymous. No clinical informations were available. The radiologists scored the images in two sessions separated by 5 weeks such that the same case was not seen twice in any session. Each session included six interpretation “rounds” alternating between SFM and FFDM images with a time limit of 60 min for about 40 cases. A magnifying glass was offered for SFM interpretation. FFDM images were interpreted using soft-copy reading on the GE review workstation, which included two high-resolution 2K×2.5K monitors and a dedicated keypad. Postprocessing (window-level adjustments, zooming, inversion) of the images were optional but recommended. The readers marked the localization of an abnormality (if present) on a sheet. For cases in which more than one lesion was suspected the abnormality with the highest suspicion was considered. BI-RADS scores of 1–5 were given for all cases: category 1=negative finding; category 2=benign finding with no mammographic evidence of malignancy; category 3=probably benign finding and short-term follow-up would have been suggested in daily practice; category 4=suspicious abnormality with biopsy being considered; category 5=highly suggestive of malignancy. BI-RADS category 0 was omitted. Breast parenchyma density for each case was retrospectively determined by two readers (P.S. and K.Y.) using the BI-RADS classification (category 1=fatty; category 2= scattered dense; category 3=heterogeneously dense; category 4=extremely dense). Statistical analysis Observer variability was evaluated using observed agreement and Cohen’s kappa statistic, including linear and quadratic weighting [16]. A preliminary power analysis was performed to estimate sample size required to determine whether a kappa of a given magnitude is significantly higher than 0.6. This was done taking into consideration the proportion of “Yes” observations of the two observers, the type 1 error and the power [17]. The
sample size estimated will be to determine whether the lower 95% confidence limit of a given kappa exceeds 0.6. For our study we considered a power of 80%, a type 1 error of 5%, and an expected value of kappa of 0.8 with a frequency of “Yes” of 35%. We would then need a sample size of 138 patients. Weighted kappa statistic was used for the ordinal data [18]. We used quadratic weighting for the five-point BIRADS scale. Additionally, we applied linear weighting for a three-level scale of the collapsed five-point BI-RADS scale for evaluation of concordance between the two imaging modalities for each reader. For this evaluation, the scores 1 and 2 were grouped together and the scores 4 and 5 were grouped together since their difference has little consequence on decision-making in daily practice. Kappa value 0.81 is very good observer agreement. Kappa values were compared using the U-test. Two-by-two table analysis was used for comparing SFM and FFDM interpretations based on BI-RADS categories for all lesions and subgroups of the study population. For binary outcome with two-by-two table analysis, a cut-off between BI-RADS 2 and 3 was used for positive versus negative score; i.e., for the cancer cases a BI-RADS score of 3 or higher was defined as true positive. McNemar test (P value less than 0.05 considered statistically significant) was used to compare the discordant pairs in the two-by-two table analysis (Epi Info, Version 6, Centers for Disease Control and Prevention, Atlanta, Ga.). The Wilcoxon signed-rank test for matched pairs was used to compare the scores for cancers at SFM and FFDM for the individual readers. Receiver operating characteristic (ROC) analysis was used for calculating the diagnostic performance of the individual reader (ROCKIT program, Macintosh PPC version 0.9.1 Beta, Charles E. Metz, University of Chicago, Ill.). For comparing the mean area under the curve (Az) values for SFM and FFDM, a P value less than 0.05 was considered to show statistical significance.
Results All cases Overall diagnostic performance for all cases for the six readers using ROC analysis revealed that five readers performed better at FFDM, of which one reader demonstrated a statistically significant higher performance at FFDM and one reader showed borderline significance in favor of FFDM (Table 1). The ROC analyses performed for a fixed single false-positive fraction (FPF) and consequently a given true-positive fraction (TPF) revealed that two readers (B and E) had a statistically significant higher performance at FFDM compared with SFM (P value 0.01 and 0.03, respectively).
1137
Table 1 Comparison of the area under the ROC curve (Az) for FFDM and SFM for the six readers (A–F) using the BI-RADS classification (scores 1–5). The Az values are given for all cases (n=232), for masses (n=71), and for microcalcifications (n=57) Reader
A B C D E F a
Az all cases (n=232)
Az masses (n=71)
Az calc only (n=57)
FFDM
SFM
P value
FFDM
SFM
P value
FFDM
SFM
P value
0.85 0.96 0.92 0.93 0.93 0.90
0.88 0.91 0.90 0.89 0.86 0.89
0.47 0.09 0.32 0.11 0.04 0.66
0.72 0.94 0.85 0.92 0.90 0.83
0.81 0.90 0.82 0.90 0.78 0.80
0.15 0.38 0.66 0.81 0.08 0.66
0.88
0.77
0.82 0.88 0.79
0.75 0.82 0.81
0.26 n.a.a n.a.a 0.45 0.37 0.71
Nonapplicable; data degenerate and expelled from analysis (either clustered or tied values)
Interobserver agreement for FFDM versus SFM for all pairs of readers showed a slightly higher kappa value (quadratic weighting) for SFM in ten and a slightly higher kappa value for FFDM in five of the 15 pairs of readers (Fig. 1). The mean weighted kappa score for SFM was 0.74 (range, 0.68–0.81) and for FFDM 0.71 (range, 0.61–0.82). The mean kappa values for each reader are summarized in Table 2. Concordance for SFM versus FFDM for the six readers for all lesions using kappa with linear weighting (threelevel scale of the collapsed five-point BI-RADS scale) showed a mean kappa value of 0.54 (range, 0.45–0.62). The individual results were: reader A, 0.51; reader B, 0.53; reader C, 0.60; reader D, 0.51; reader E, 0.62; and reader F, 0.45 (Fig. 2). Comparison of the discordant pairs for SFM versus FFDM in a two-by-two table analysis (BI-RADS scores 1 and 2 negative and 3–5 positive test result) using McNemar’s test for paired proportions revealed no significant difference for any of the readers. Fig. 1 Observer agreement for all pairs of the six readers for all cases (n=232) presented as weighted kappa with quadratic weighting based on the fivepoint BI-RADS scale
Subgroup: normal cases Concordance between the two imaging techniques for normal cases (n=98) as expressed by observed agreement ranged among the six radiologists from 57% to 86%: reader A, 57%; reader B, 57%; reader C, 70%; reader D, 62%; reader E, 86%; and reader F, 72%. The observed agreement using the three-level scale of the collapsed five-point BIRADS scores showed the following values: reader A, 72%; reader B, 77%; reader C, 85%; reader D, 77%; reader E, 99%; and reader F, 87%. Kappa statistic with quadratic weighting was not calculated because the observed concordance was smaller than mean-chance concordance for some tables. The number of false-positive interpretations for normal cases (defined as BI-RADS score 3–5) for the six radiologists was in total 48 of 588 (8.2%) for SFM (range among readers, 1–14) versus 70 of 588 (11.9%) for FFDM (range among readers, 0–17). Three normal cases were given BI-RADS score 5 at SFM compared with 0 cases SFM
FFDM
0,9 0,8 0,7
Kappa
0,6 0,5 0,4 0,3 0,2 0,1 0 AB AC AD AE AF BC BD BE BF CD CE CF DE DF EF
Reader
1138
Table 2 Mean kappa scores with quadratic weighting (the mean value of agreement for the five comparisons of each reader with the others) based on the five-point BI-RADS scores 1–5 for the six radiologists (A–F) at SFM and FFDM for all cases (n=232), masses Reader
A B C D E F Overall mean
(n=71), and for microcalcifications (n=57). The overall mean kappa value is based on all 30 possible pair of comparisons for the six radiologists
All cases (n=232)
Masses (densities) (n=71)
Microcalcifications (n=57)
SFM
FFDM
SFM
FFDM
SFM
FFDM
0.74 0.76 0.75 0.69 0.76 0.73 0.74
0.66 0.73 0.74 0.68 0.73 0.73 0.71
0.72 0.71 0.66 0.61 0.65 0.61 0.66
0.58 0.67 0.68 0.63 0.67 0.63 0.64
0.51 0.54 0.56 0.45 0.62 0.54 0.54
0.45 0.51 0.54 0.46 0.56 0.53 0.51
at FFDM (Fig. 3). The number of false-positive scores for both imaging modalities among the six radiologists ranged from 1 to 31 (Fig. 3). Subgroup: masses and densities ROC analysis for masses and densities (n=71: 45 benign and 26 malignant masses; six cancers manifesting as mass with microcalcifications excluded from this analysis) showed for FFDM Az values ranging from 0.72 to 0.94 and for SFM Az values from 0.78 to 0.90. Five of the six readers showed a higher Az value for FFDM. There was no statistically significant difference among the readers at FFDM versus SFM for masses, but the difference approached borderline significance in favor of FFDM for one reader (Table 1).
Fig. 2 Concordance at SFM versus FFDM for the six readers for all lesions (n=232), masses (densities) only (n=71), and microcalcifications only (n=57). Concordance is presented as kappa values (linear weighting) based on the three-level scale of the collapsed five-point BIRADS scores
Observer agreement for FFDM versus SFM for a pair of readers for masses showed for SFM a mean kappa score (quadratic weighting) of 0.66 (range, 0.50–0.80) and for FFDM 0.64 (range, 0.54–0.77). The mean kappa values for each reader are summarized in Table 2. Concordance between the two imaging techniques for the individual readers using the kappa statistic (linear weighting based on the three-level scale of the collapsed five-point BI-RADS scale) showed, for masses, a mean kappa value of 0.40 for the six readers. Values for the individual readers were: reader A, 0.38; reader B, 0.56; reader C, 0.43; reader D, 0.45; reader E, 0.34; and for reader F, 0.21 (Fig. 2). Comparison of discordant pairs in two-by-two table analysis using McNemar’s test for paired proportions (BIRADS scores 1 and 2 negative and the BI-RADS scores 3– 5 grouped as a positive test result) did not show any statistically significant difference for the readers. All lesions
Densities
B
C
Calcifications
0,7 0,6
Kappa
0,5 0,4 0,3 0,2 0,1 0 A
D
Reader
E
F
1139
a
Number
Fig. 3 Normal cases with falsepositive interpretations (BIRADS score 3 or higher) for the six readers (A–F). a SFM: a total of 48 false positives. b FFDM: a total of 70 false positives. There was no false positive BI-RADS 5 score at FFDM
18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
SFM - 5 SFM - 4 SFM - 3
A
B
C
D
E
F
Reader
Number
b
18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
FFDM - 5 FFDM - 4 FFDM - 3
A
B
C
D
E
F
Reader
Subgroup: microcalcifications ROC analysis for microcalcifications only (n=57: 43 benign and 14 malignant cases; six cancers manifesting as mass with microcalcifications were excluded from analysis) showed higher Az value for FFDM compared with SFM for three of four readers. The data were “degenerate” (either “clustered” with most data representing just a few scores without enough range, or “tied values” with the same score for many analogue and digital interpretations) for two readers (B and C) and these data were expelled from analysis by the computer program. None of the differences between FFDM and SFM for the four readers were statistically significant (Table 1).
Observer agreement for FFDM versus SFM for pair of readers using the kappa statistic (quadratic weighting) showed a mean weighted kappa for SFM of 0.54 (range, 0.40–0.68) and for FFDM 0.51 (range, 0.40–0.75) (Table 2). There was a higher kappa value at SFM for nine of the 15 pairs and a higher kappa value at FFDM for five of the 15 pairs of radiologists. For one pair (reader A versus D) the kappa value was identical (Fig. 4). The difference of the mean kappa value (quadratic weighting) for SFM versus FFDM were statistically not significant (U-test, P=0.89). However, interobserver variation was more marked for microcalcifications compared with all cases (Figs. 1, 4). Concordance at SFM versus FFDM for the six readers for microcalcifications only (n=57) showed a mean kappa
1140
SFM
FFDM
80 70 60 50
Kappa
Fig. 4 Observer agreement for all pair of readers for cases presenting as microcalcifications only (n=57). Interobserver agreement is presented as weighted kappa (quadratic weighting) based on the fivepoint BI-RADS scale. The difference between the mean kappa value for SFM and FFDM was not statistically significant (Utest, P=0.89)
40 30 20 10 0 AB AC AD AE AF BC BD BE BF CD CE CF DE DF
EF
Reader
value (linear weighting based on the three-level scale of the collapsed five-point BI-RADS scale) of 0.36. The values for the six readers were: reader A, 0.29; reader B, 0.18; reader C, 0.58; reader D, 0.25; reader E, 0.54; and reader F, 0.33. For three readers (readers A, B and D) these kappa values were lower than for masses, whereas the corresponding values were higher for the three other readers, C, E, and F (Fig. 2). Comparison of discordant pairs in two-by-two table analysis (BI-RADS 1 and 2 negative and BI-RADS score 3–5 positive test) using McNemar’s test for paired proportions again did not show any significant difference among the readers. Comparison of diagnostic performance at SFM versus FFDM showed remarkable differences among readers for masses versus microcalcifications. The Az values for all lesions for reader A and reader F at FFDM versus SFM were comparable (Table 1). Comparison of the Az values for the subgroups masses and microcalcifications, however, showed noteworthy differences among the two readers: reader A had problems with masses at FFDM (Az for FFDM 0.72 versus 0.81 for SFM), but showed higher performance for microcalcifications at FFDM (Az for FFDM 0.88 versus 0.77 for SFM). Reader F on the other hand had comparable results at the two imaging techniques for both subgroups (Az for masses 0.83 versus 0.80 and for calcifications 0.79 versus 0.81, respectively) (Fig. 5). Subgroup: cancers Since the kappa statistic (observer agreement) does not reflect the conspicuity for cancer and consequently not the level of the BI-RADS scores, we compared the interpretations for SFM and FFDM for cancers (n=46) for the individual reader using the Wilcoxon signed-rank test for
matched pairs. The mean score at SFM and FFDM and the Wilcoxon P value for the six readers were, respectively: reader A, 4.04-3.96, P=0.76; reader B, 4.35–4.37, P=0.98; reader C, 3.87–4.11, P=0.14; reader D, 4.02–4.20, P= 0.16; reader E, 3.74–4.09, P=0.08; and reader F, 4.11– 4.20, P=0.93. The mean BI-RADS score for cancers was slightly higher at FFDM for five of the six readers, but none of the differences were statistically significant.
Discussion A high level of agreement between SFM and FFDM has been reported in a diagnostic evaluation of breast cancer, and significant disagreement affecting treatment approaches between the two imaging techniques was reported in only 4% [19]. Generalization of results from experimental clinical studies on observer variation is problematic, and results from such studies are difficult to compare with prospective interpretations performed in daily practice. Three large-scale trials with a paired study design comparing SFM and FFDM in a screening setting showed a noticeable high number of cancer cases detected only at one of the two imaging techniques [1, 12, 20, 21]. Observed agreement for cancers based on a two-by-two table analysis was for the Colorado–Massachusetts study, the Oslo I study, and the DMIST trial 52%, 68%, and 66%, respectively [12, 20, 21]. In a recently published paper it was concluded that using two mammograms, one screenfilm and one digital, would significantly increase the detection of breast cancer [22]. In our study the diagnostic performance for all lesions was higher at FFDM compared with SFM for five of six readers, of which the difference in favor of FFDM was statistically significant for one reader (Table 1). Observer agreement among the readers for all cases was comparable
1141
Reader A: Masses
Reader A: Microcalcifications
1
1
FFDM
SFM
0,8
0,8
FFDM
0,6
TPF
TPF
Fig. 5 ROC curves for readers A and F. Diagnostic performance (Az values) for all lesions were comparable for the two readers (0.88–0.85 and 0.89– 0.90 at SFM and FFDM, respectively). ROC curves (Az values) for masses and microcalcifications are, however, quite different although the differences in Az values were not statistically significant
0,6
0,4
0,4
0,2
0,2
0
0 0
0,2
0,4
0,6
0,8
SFM
1
0
0,2
0,4
0,6
0,8
1
FPF
FPF Reader F: Masses
Reader F: Microcalcifications
1
1
FFDM
SFM
0,8
0,8
FFDM
SFM 0,6
TPF
TPF
0,6 0,4 0,2
0,4 0,2
0
0 0
0,2
0,4
0,6
FPF
at SFM and FFDM (Fig. 1). Concordance for the BI-RADS scores with the two imaging techniques was, however, only moderate for all cases (Fig. 2). Detection and characterization of microcalcifications at FFDM with soft-copy reading has been a hot topic in the past. Studies have shown that, despite lower spatial resolution at FFDM compared with SFM, the diagnostic accuracy of FFDM is comparable or even higher than SFM for microcalcifications [2–4, 23]. Higher numeric values of specificity in favor of FFDM for microcalcifications have been reported, although the differences were not statistically significant [6]. The observer performance for detecting microcalcifications of breast cancers is comparable among hard-copy film and five-megapixel soft-copy monitor reading [24], and monitor zooming of the FFDM images was reported to be equivalent to direct magnification in a previous study [25]. Thus, the higher contrast resolution and the ability to do image postprocessing with FFDM using soft-copy reading compensate for the limitations in spatial resolution. An interesting and important finding of our study was the low observer agreement for microcalcifications. Observer agreement for pair of readers was lower for calcifications than for all cases (Figs. 1, 4), and the mean kappa values were lower for microcalcifications compared
0,8
1
0
0,2
0,4
0,6
0,8
1
FPF
with all cases and densities (Table 2). The mean kappa scores (quadratic weighting) at SFM and FFDM were, however, not statistically different. The concordance at SFM versus FFDM for microcalcifications was noticeably low for three of the six readers (Fig. 2). Studies on interobserver variability have shown lower kappa values for calcification descriptions than for other mammographic features [11, 26]. Characterization of microcalcifications is difficult, and BI-RADS training may improve the interobserver agreement for microcalcification morphology [27]. Observer agreement for masses (densities) was lower in our study than for all cases but higher than for microcalcifications. Concordance at SFM versus FFDM for masses was noticeably low for only one of the six readers, but the concordance varied considerably among the readers (Fig. 2). Diagnostic performance for all cases was nearly the same for reader A and F, but the difference between the two subgroups was substantial although these differences were not statistically significant (Fig. 5). The number of false-positive scores at SFM and FFDM for normal cases (Fig. 3) indicate that at least five of the six readers probably did not interpret the images as they would have done in daily practice (“expectancy bias”). The higher number of false-positive interpretations at FFDM com-
1142
pared with SFM (Fig. 3) seems to support the results of the Oslo I and II studies, which showed a higher recall rate for FFDM [12, 13]. The higher number of false positives at FFDM for normal cases could also reflect a learning curve effect. The six readers were all more experienced in SFM than FFDM, and the experience with FFDM soft-copy reading also varied among the readers. Close attention must be paid to proper reader training and systematic use of image display protocols when introducing FFDM with soft-copy reading, especially in order to avoid missing small cancers in mammography screening [28]. In conclusion, our study has shown a substantial observer variability for SFM hard-copy reading as well as for FFDM soft-copy reading in breast cancer screening cases. Overall, there was no statistically significant difference between
observer variability at SFM and FFDM. An important finding was the low observer agreement (low kappa values) for calcifications, and especially the low concordance for four of six readers between the two imaging techniques for microcalcifications. We suggest that a learning curve effect might have contributed to this finding. Consequently, our results confirm the need for proper training for radiologists starting FFDM with soft-copy reading in breast cancer screening. The level of observer agreement might be a useful indicator of radiologist performance in the absence of accuracy evaluation. The variability in mammography interpretation at SFM and FFDM may confound observer studies, and it is important to be aware of this problem when designing studies to evaluate the diagnostic differences between the two imaging techniques.
References 1. Bick U, Diekmann F (2007) Digital mammography: what do we and what don’t we know? Eur Radiol 17:1931– 1942 2. Diekmann S, Bick U, von Heyden H, Diekmann F (2003) Visualization of microcalcifications on mammographies obtained by digital full-field mammography in comparison to conventional film-screen mammography (in German). Rofo 175:775–779 3. Fischer U, Baum F, Obenauer S, Luftner-Nagel S, von Heyden D, Vosshenrich R, Grabbe E (2002) Comparative study in patients with microcalcifications: full-field digital mammography vs screen-film mammography. Eur Radiol 12:2679–2683 4. Obenauer S, Luftner-Nagel S, von Heyden D, Munzel U, Baum F, Grabbe E (2002) Screen film vs full-field digital mammography: image quality, detectability and characterization of lesions. Eur Radiol 12:1697–1702 5. Obenauer S, Hermann KP, Marten K, Luftner-Nagel S, von Heyden D, Skaane P, Grabbe E (2003) Soft copy versus hard copy reading in digital mammography. J Digit Imaging 16:341–344 6. Kim HH, Pisano ED, Cole EB, Jiroutek MR, Muller KE, Zheng Y, Kuzmiak CM, Koomen MA (2006) Comparison of calcification specificity in digital mammography using soft-copy display versus screen-film mammography. AJR Am J Roentgenol 187:47–50 7. Beam CA, Layde PM, Sullivan DC (1996) Variability in the interpretation of screening mammograms by US radiologists. Arch Intern Med 156:209– 213
8. Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR (1994) Variability in radiologists’ interpretations of mammograms. N Engl J Med 331:1493–1499 9. Elmore JG, Nakano CY, Koepsell TD, Desnick LM, D’Orsi CJ, Ransohoff DF (2003) International variation in screening mammography interpretations in community-based programs. J Natl Cancer Inst 95:1384–1393 10. Skaane P, Engedal K, Skjennald A (1997) Interobserver variation in the interpretation of breast imaging. Comparison of mammography, ultrasonography, and both combined in the interpretation of palpable noncalcified breast masses. Acta Radiol 38:497–502 11. Berg WA, Campassi C, Langenberg P, Sexton MJ (2000) Breast imaging reporting and data system: inter- and intraobserver variability in feature analysis and final assessment. AJR Am J Roentgenol 174:1769–1777 12. Skaane P, Young K, Skjennald A (2003) Population-based mammography screening: comparison of screen-film and full-field digital mammography with soft-copy reading—Oslo I study. Radiology 229:877–884 13. Skaane P, Skjennald A (2004) Screenfilm mammography versus full-field digital mammography with soft-copy reading: randomized trial in a population-based screening program—The Oslo II study. Radiology 232:197–204 14. Skaane P, Balleyguier C, Diekmann F, Diekmann S, Piguet JC, Young K, Niklason LT (2005) Breast lesion detection and classification: comparison of screen-film mammography and full-field digital mammography with soft-copy reading—observer performance study. Radiology 237:37–44
15. Pedersen K, Nordanger J (2002) Quality control of the physical and technical aspects of mammography in the Norwegian breast-screening programme. Eur Radiol 12:463–470 16. Cohen J (1968) Weighted kappa. Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 70:213–220 17. Donner A, Eliasziw M (1992) A goodness-of-fit approach to inference procedures for the kappa statistic: confidence interval construction, significance-testing and sample size estimation. Stat Med 11:1511–1519 18. Cicchetti DV (1976) Assessing interrater reliability for rating scales: resolving some basic issues. Br J Psychiatr 129:452–456 19. Venta LA, Hendrick RE, Adler YT, DeLeon P, Mengoni PM, Scharl AM, Comstock CE, Hansen L, Kay N, Coveler A, Cutter G (2001) Rates and causes of disagreement in interpretation of full-field digital mammography and film-screen mammography in a diagnostic setting. AJR Am J Roentgenol 176:1241–1248 20. Lewin JM, Hendrick RE, D’Orsi CJ, Isaacs PK, Moss LJ, Karellas A, Sisney GA, Kuni CC, Cutter GR (2001) Comparison of full-field digital mammography with screen-film mammography for cancer detection: results of 4,945 paired examinations. Radiology 218:873–880
1143
21. Pisano ED, Gatsonis C, Hendrick E, Yaffe M, Baum JK, Acharyya S, Conant EF, Fajardo LL, Bassett L, D’Orsi C, Jong R, Rebner M (2005) Diagnostic performance of digital versus film mammography for breast-cancer screening. N Engl J Med 353:1773–1783 22. Glueck DH, Lamb MM, Lewin JM, Pisano ED (2007) Two-modality mammography may confer an advantage over either full-field digital mammography or screen-film mammography. Acad Radiol 14:670–676
23. Vigeland E, Klaasen H, Klingen TA, Hofvind S, Skaane P (2008) Full-field digital mammography with flat-panel selenium detectors in a population-based screening programme: The Vestfold County Study. Eur Radiol 18:183-191 24. Kamitani T, Yabuuchi H, Soeda H, Matsuo Y, Okafuji T, Sakai S, Furuya A, Hatakenaka M, Ishii N, Hona H (2007) Detection of masses and microcalcifications of breast cancer on digital mammograms: comparison among hard-copy film, 3-megapixel liquid crystal display (LCD) monitors and 5megapixel LCD monitors: an observer performance study. Eur Radiol 17:1365–1371 25. Fischer U, Baum F, Obenauer S, Funke M, Hermann KP, Grabbe E (2002) Fullfield digital mammography (FFDM): intraindividual comparison of direct magnification versus monitor zooming in patients with microcalcifications (in German). Radiologe 42:261–264
26. Lazarus E, Mainiero MB, Schepps B, Koelliker SL, Livingston LS (2006) BIRADS lexicon for US and mammography: interobserver variability and positive predictive value. Radiology 239:385–391 27. Berg WA, D’Orsi CJ, Jackson VP, Bassett LW, Beam CA, Lewis RS, Crewson PE (2002) Does training in the breast imaging reporting and data system (BI-RADS) improve biopsy recommendations or feature analysis agreement with experienced breast imagers at mammography? Radiology 224:871–880 28. Skaane P, Skjennald A, Young K, Egge E, Jebsen I, Sager EM, Scheel B, Søvik E, Ertzaas AK, Hofvind S, Abdelnoor M (2005) Follow-up and final results of the Oslo I study comparing screen-film mammography and full-field digital mammography with soft-copy reading. Acta Radiol 46:679–689