Comparison of Screen-Film Mammography and Full-Field Digital

8 downloads 0 Views 162KB Size Report
Full-Field Digital Mammography with Soft-copy Reading—. Observer Performance Study. 1. PURPOSE: To retrospectively compare screen-film and full-field ...
Radiology

Breast Imaging Per Skaane, MD, PhD Corinne Balleyguier, MD Felix Diekmann, MD Susanne Diekmann, MD Jean-Charles Piguet, MD Kari Young, MD Loren T. Niklason, PhD2 Published online before print 10.1148/radiol.2371041605 Radiology 2005; 237:37– 44 Abbreviations: Az ⫽ area under ROC curve BI-RADS ⫽ Breast Imaging Reporting and Data System ROC ⫽ receiver operating characteristic

Breast Lesion Detection and Classification: Comparison of Screen-Film Mammography and Full-Field Digital Mammography with Soft-copy Reading— Observer Performance Study1 PURPOSE: To retrospectively compare screen-film and full-field digital mammography with soft-copy interpretation for reader performance in detection and classification of breast lesions in women in a screening program.

1

From the Department of Radiology, Ullevaal University Hospital, Breast Imaging Center, Kirkeveien 166, N-0407 Oslo, Norway (P.S., K.Y.); Institut Gustave Roussy, Villejuif, France (C.B.); Department of Diagnostic Radiology, University Charite´, Berlin, Germany (F.D., S.D.); and Institut Imagerive, Geneva, Switzerland (J.C.P.); and private consultant, Milwaukee, Wis (L.T.N.). From the 2003 RSNA Annual Meeting. Received September 19, 2004; revision requested November 5; revision received January 24, 2005; accepted February 23. Address correspondence to P.S. (e-mail: [email protected]). L.T.N. was a consultant for GE Medical Systems, Milwaukee, Wis. See Materials and Methods for pertinent disclosures. Current address: Hologic, Hillsborough, NC.

2

Author contributions: Guarantor of integrity of entire study, P.S.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; literature research, P.S., F.D., S.D., J.C.P., K.Y., L.T.N.; clinical studies, all authors; statistical analysis, P.S., L.T.N.; and manuscript editing, P.S., C.B., F.D., S.D., J.C.P., L.T.N. ©

MATERIALS AND METHODS: Regional ethics committee approved the study; signed patient consents were obtained. Two-view mammograms were obtained with digital and screen-film systems at previous screening studies. Six readers interpreted images. Interpretation included Breast Imaging Reporting and Data System (BI-RADS) and five-level probability-of-malignancy scores. A case was one breast, with two standard views acquired with both screen-film mammography and digital mammography. The standard for an examination with normal findings was classification of normal (category 1) assigned by two independent readers; for cases with benign findings, the standard was benign results at diagnostic work-up in patients who were recalled. Cases with normal or benign findings that manifested as neither interval cancer nor as cancer at subsequent screening were considered the standard. All cancers were confirmed histologically. Images were interpreted by readers in two sessions 5 weeks apart; the same case was not seen twice in any session. Receiver operating characteristic (ROC) analysis and, for a given truepositive fraction, 2 ⫻ 2 table analysis and the McNemar test were used. For binary outcome, classification of BI-RADS category 3 or higher was defined as positive for cancer. RESULTS: Cases with proved findings (n ⫽ 232) were displayed: 46 with cancers, 88 with benign findings, and 98 with normal findings. ROC analysis for all readers and all cases revealed a higher area under ROC curve (Az) for digital mammography (0.916) than for screen-film mammography (0.887) (P ⫽ .22). Five of six readers had a higher performance rating with digital mammography; one of five demonstrated a significant difference in favor of digital mammography with Az values; two showed a significant difference in favor of digital mammography with ROC analysis for a given false-positive fraction (P ⫽ .01 and .03, respectively). For cases with cancer, digital mammography resulted in correct classification of an average of three additional cancers per reader. For digital versus screen-film mammography, 2 ⫻ 2 table analysis for cancers revealed a higher true-positive rate; for benign masses, a higher true-negative rate. Neither of these differences nor any others from analysis of subgroups between the modalities were significant. CONCLUSION: Digital mammography allowed correct classification of more breast cancers than did screen-film mammography. Az value was higher for digital mammography; this difference was not significant. ©

RSNA, 2005

RSNA, 2005

37

Radiology

Mammography is the established method for detection of unsuspected breast cancer at an early preclinical stage in asymptomatic women, and it has sufficient specificity to be used in screening. The success of screening mammography depends on the perception of small and sometimes subtle lesions. The depiction of fine microcalcifications and subtle soft-tissue masses at high-quality mammography is key to the detection of early breast cancer. The interval between screening rounds, the image quality, and the radiologist’s skill and training in the perception and analysis of subtle features of the lesion are important factors that must be optimized to achieve the goal of early cancer detection. Conventional screen-film mammography with high spatial resolution has so far been the modality of choice for screening programs. Screen-film mammography has some distinct advantages, including relatively inexpensive technology, high spatial resolution, convenient display of images with widely available illuminators, and the capability for simultaneous display of multiple images. New technologies for detection and characterization are being pursued because a relatively large number of cancers are missed in screening programs (1,2). Fullfield digital mammography is a promising new technology. The main advantage of digital mammography is that the processes of image acquisition, image processing, image display, and image storage are decoupled. Consequently, with a digital mammographic imaging system, each of these processes is performed independently, and this independent performance allows each step to be optimized individually. In contrast-detail studies (3,4), the digital mammographic equipment that has been developed during the past few years has demonstrated superior depiction of low-contrast objects compared with that of screen-film mammographic equipment. It is likely that the benefits of digital technology may be best realized with soft-copy display and interpretation of images. Investigators in experimental and retrospective studies (5–7) of comparisons between screen-film mammography and digital mammography with soft-copy reading have demonstrated comparable results for both modalities in regard to lesion detection and characterization. Researchers in these studies mainly have investigated the performance of these two modalities in phantoms or in symptomatic women. So far, there have been only three large-scale studies (8 –10) of a 38



Radiology



October 2005

comparison between screen-film mammography and digital mammography in asymptomatic women in a screening situation. Results in these three trials showed no statistically significant differences between screen-film mammography and digital mammography in cancer detection rate. There were, however, some noticeable differences in other results of these trials. Fewer cancers were detected with digital mammography compared with screen-film mammography in the Colorado-Massachusetts study (8,11) and the Oslo I study (9), whereas more cancers were detected with digital mammography in the Oslo II study (10). With digital mammography, a significantly lower recall rate was observed in the Colorado-Massachusetts study (8,11), but the recall rates were higher for digital mammography in the Oslo I and II studies (9,10). Thus, the aim of our study was to retrospectively compare screen-film mammography and digital mammography by using soft-copy interpretation for reader performance in the detection and classification of breast lesions in women in a screening program.

MATERIALS AND METHODS Study Population and Reference Standards GE Healthcare assisted in the study by making available a suitable reading environment used by the radiologists to interpret the images obtained at screenfilm and digital mammographic examinations and by providing logistic support for the reading sessions. The authors had full control of the data and the information submitted for publication. All cases included in the study (one case represented one breast) were selected from the population-based Breast Cancer Screening Program in Oslo, Norway, which is part of the Norwegian Breast Cancer Screening Program. This program was for women 50 – 69 years old and was started in 1996, and the interval between the screening rounds was 2 years. Two standard views (craniocaudal and mediolateral oblique) of each breast were acquired. Most cases selected for this study were obtained from the Oslo I study, which was performed between January 3, 2000, and June 22, 2000, and was a paired study in which the women underwent a two-view examination of each breast with screen-film mammography and digital mammography on the same day. Cases were also collected from the Oslo II study, which was performed be-

tween November 27, 2000, and December 31, 2001, and was a randomized trial that involved a comparison between screen-film mammography and digital mammography in women in a population-based screening program. For these cases, the interval between the examination with screen-film mammography and that with digital mammography was less than 3 weeks. All women included in the Oslo I and II studies were informed about the study beforehand, and their participation in the project was voluntary. Each woman who was enrolled in those two studies signed a written consent form, and both studies were approved by the regional ethics committee. The consent form specified that the data and images related to screening could be used for future research and scientific purposes. A total of 250 cases, with two standard views acquired with both screen-film mammography and digital mammography, were selected from the database by a nonradiologist. The study initially included 100 cases with normal findings, 100 cases with abnormal but benign findings, and 50 cases with cancers. All 100 cases with normal findings were randomly collected from the Oslo I study, which included independent double reading for screen-film mammograms and for digital mammograms. Two of the authors (P.S. and K.Y.) had, together with six other radiologists, taken part in the image reading of the Oslo I study. The inclusion criterion as a reference standard for a normal examination was that all four independent readers in the screening program, the two readers for screen-film mammograms and the two readers for digital mammograms, had assigned a score of category 1 on the fivepoint rating scale of probability of cancer used in the Norwegian Breast Cancer Screening Program, which was considered normal. A further inclusion criterion for the cases with normal findings was that such a case did not manifest as an interval cancer and that the case also was assigned a score of normal (category 1) by both independent readers in the subsequent screening round 2 years later. Of the 100 cases with abnormal findings, 85 cases were selected from the Oslo I study and 15 cases were selected from the Oslo II study. A case with benign findings was defined as such if the woman had been recalled for diagnostic work-up because of an abnormal mammographic interpretation (rating score of ⬎1) by at least one of the two independent readers in the screening program and the results of diagnostic work-up (at Skaane et al

Radiology

ultrasonography [US] or fine-needle aspiration cytologic analysis) indicated a benign abnormality. Furthermore, the case with benign findings was defined as such if the case did not manifest as an interval cancer or as a cancer in the subsequent screening round 2 years later. The 50 cancers, which included both cases of ductal carcinoma in situ and invasive cancers, were detected at screening in the Oslo I study (n ⫽ 27) or the Oslo II study (n ⫽ 23). All cancers were confirmed at histologic analysis. The histologic type of the cancers, the size of the cancers, and the density of the breast parenchyma were not known by the person randomly selecting the cases from the files of the Oslo Breast Cancer Screening Program. The mean age and age range (minimum and maximum age) of the women with normal findings, abnormal findings, or cancers were calculated.

Imaging Screen-film mammographic examinations were all performed with a unit (Mammomat 300; Siemens, Erlangen, Germany) with film (Min-R 2000; Kodak Health Imaging, Rochester, NY) and screens (Min-R 2190; Kodak Health Imaging). A molybdenum target and a molybdenum filter at 29 kV were always used. The hospital physicist chose 29 kV, in accordance with the recommendations of the Norwegian Breast Cancer Screening Program, to keep the dose low while acceptable image quality was maintained. Screen-film mammographic images had optical density in compliance with the Norwegian Breast Cancer Screening Program requirements (12). Digital mammograms were acquired with a digital system with a cesium iodide– amorphous silicon detector (Senographe 2000D; GE Medical Systems, Milwaukee, Wis). The unit is equipped with an automatic mode (automatic optimization of parameters) in which anode-filter combination, tube voltage, and tube current– time product are selected automatically after analysis of results at a short exposure before actual imaging was performed. The standard automatic optimization of parameters mode was used according to the manufacturer’s recommendations. The area of the image detector was 19 ⫻ 23 cm. Mammograms obtained with both imaging modalities (screen-film mammography and digital mammography) included the two standard views (craniocaudal and mediolateral oblique) of each breast. Volume 237



Number 1

Image Interpretation Six radiologists (readers A–F) from four European countries participated in the study. The readers’ experience in screenfilm mammography varied from 4 to 24 years, and their experience in digital mammography with soft-copy reading varied from 2 to 4 years. The number of screening examinations interpreted by each radiologist in his or her own practice varied from 2500 to 12 000 examinations per year. The two Norwegian radiologists had experience from a population-based (every member of the eligible target population being invited, with batch interpretation) screening program, and the other four readers had experience from service-based (self-referred or referred asymptomatic women, with consecutive individual interpretation) mammographic screening. Soft-copy review was performed in a dedicated darkened room for digital mammography, and a darkened room with high-luminance view boxes was used for hard-copy display of screen-film mammographic images. No clinical information was available, and the readers had no knowledge of the screening results. Patient identification information was removed from the mammograms so that they were anonymous. The radiologists assigned scores to the images in two sessions that were 5 weeks apart such that the same case was not seen twice in any session. Each reading session included six interpretation rounds, and screen-film mammographic images and digital mammographic images were alternated, with a time limit of 60 minutes for about 40 cases in each round. A magnifying glass was always offered for screen-film mammographic interpretation. Images from the digital mammographic examinations were interpreted by using soft-copy reading on the review workstation (GE Medical Systems), which included two highresolution 2000 ⫻ 2500 pixel monitors and a dedicated keypad. Postprocessing of the images, which included windowlevel adjustments, zooming, and inversion, was optional but strongly recommended. For interpretation, a data form was included on which the reader marked the localization of an abnormality, if present. For cases in which more than one lesion was suspected, the lesion with the highest suspicion was considered. A Breast Imaging Reporting and Data System (BIRADS) category of 1–5 was assigned for all cases as follows: category 1, findings negative for disease; category 2, findings

that were benign, with no mammographic evidence of malignancy; category 3, findings that were probably benign, with short-term follow-up suggested in daily practice; category 4, suspicious abnormality, with biopsy considered; and category 5, findings that were highly suggestive of malignancy. BIRADS category 0 was omitted. A cutoff between BI-RADS category 2 and category 3 was used for interpretations with positive versus negative results in the binary outcome for analysis with 2 ⫻ 2 tables; that is, for the cancer cases, assignment of a BI-RADS category of 3 or higher was considered as an interpretation with true-positive results. The readers also assigned a score for probability of cancer in all cases. To accomplish this task, they used the five-point rating scale applied by the Norwegian Breast Cancer Screening Program, according to the following: score 1, normal or definitely benign; score 2, probably benign; score 3, indeterminate finding; score 4, probably malignant; and score 5, malignant. The density of breast parenchyma in each case was retrospectively determined by two radiologists (P.S. and K.Y.) by using the BI-RADS classification, which was as follows: category 1, fatty; category 2, scattered dense; category 3, heterogeneously dense; and category 4, extremely dense.

Statistical Analysis Receiver operating characteristic (ROC) analysis was used for calculation of the diagnostic performance rating of each reader and for determination of overall results for the comparison of the average performance rating between screen-film mammography and digital mammography. The diagnostic performance rating was calculated for the BI-RADS categories, as well as for the probability of malignancy scores, and for the subgroups of masses and microcalcifications only. ROC analysis for the individual readers was performed by using a software program (ROCKIT, Macintosh PPC version 0.9.1 Beta; Charles E. Metz, University of Chicago, Chicago, Ill), whereas multireader analysis was performed by using another software program (LABMRMC, Macintosh PPC version 1.0b3; Charles E. Metz, University of Chicago). The software program for multireader analysis was used to compare the area under the ROC curve (Az) for the six readers and yielded a mean Az value. For comparison of the mean Az values of screen-film mammography and digital mammograBreast Lesion Detection and Classification



39

TABLE 1 Mammographic Features and Breast Parenchyma Density of Cases with Normal, Benign, and Malignant Findings

Radiology

Cancers Distortion Cases with Benign Findings or BI-RADS Mass and Density Cases with Mass or Circumscribed Spiculated Asymmetric Density Microcalcifications Microcalcifications Total Category Normal Findings Density Microcalcifications Mass Mass 1 2 3 4 Total

10 40 42 6

3 28 14 0

4 13 22 4

0 2 1 0

1 9 8 0

0 3 2 0

0 8 4 2

0 5 1 0

18 108 94 12

98

45

43

3

18

5

14

6

232

phy, a difference with a P value of less than .05 was considered statistically significant. By using the software program for ROC analysis, comparison of diagnostic performance ratings was also performed for a fixed true-positive fraction at a given false-positive fraction on the ROC curve. This analysis was performed to compare performance ratings of screen-film mammography and digital mammography at a specific operating point on the ROC curve. Since a BI-RADS category of 3 or higher was defined as a true-positive classification for the cancer cases, the operating point was chosen as the false-positive fraction that corresponded to a BIRADS category of 3. For comparison of scores of individual readers, the average false-positive fraction from the screenfilm mammographic and digital mammographic scores was used. A paired t test was used to test for significance; a difference with a P value of less than .05 was considered statistically significant. Analysis with a 2 ⫻ 2 table was applied for comparison of screen-film mammographic interpretations and digital mammographic interpretations that were based on the BI-RADS categories for subgroups of the study population. A BI-RADS category of 3 or higher was defined as a true-positive classification for the cancer cases. The McNemar test (a difference with a P value of less than .05 was considered statistically significant) was used to compare the discordant pairs in the analysis with the 2 ⫻ 2 tables (Epi Info, version 6; Centers for Disease Control and Prevention, Atlanta, Ga). For each case, the average value of the BI-RADS category over the six readers was computed, and a mean value of 3 or higher was considered positive for the probability that a cancer was present. 40



Radiology



October 2005

RESULTS Final Study Population Among the 100 cases with normal findings that were selected by the nonradiologist, those in two women who had large breasts that could not fit on the standard film format of 18 ⫻ 24 cm were excluded from analysis. In these two women, large-format images (24 ⫻ 30 cm) were obtained at screen-film mammography and more than the standard two images (mosaic) were obtained at digital mammography. A total of 12 cases among the 100 randomly selected cases with benign lesions were excluded from further analysis, since no abnormality was confirmed at diagnostic work-up, which included magnification views, cone-down views, and breast US. The suspicious abnormality seen on the screening mammograms thus proved to be superimposed glandular tissue and not a true abnormality. A total of four cancers among the 50 malignant tumors randomly selected from the database were excluded from analysis: One cancer proved to be occult and, consequently, represented an incidental finding. One cancer proved to be outside the image on one view and at the margin on the other view at digital mammography; consequently, detection of this cancer was missed because of positioning failure. In two cancers, lateromedial views instead of mediolateral oblique views were obtained at diagnostic work-up, so comparison with the screening mammograms was inappropriate. Thus, the final study population consisted of 232 cases—98 cases with normal findings in women with a mean age of 56.2 years and an age range of 49 – 67 years, 88 cases with benign findings in women with a mean age of 56.4 years and an age range of 45– 68 years, and 46 cancers in women with a mean age of 59.2 years and an age range

of 51–70 years. In 18 cases, the density of the breast parenchyma was classified as BI-RADS category 1 (fatty), and in 12 cases, as BI-RADS category 4 (extremely dense). In the rest of the cases, the classification of the density was nearly equally distributed between categories 2 and 3. The mammographic features of the cases with benign and malignant findings and the distribution of breast parenchyma density according to the BIRADS classification for the 232 cases included in the study are summarized in Table 1.

Reader Comparisons All cases.—The Az for the six individual readers with the BI-RADS categories for digital mammography and screen-film mammography, respectively, was as follows: reader A (0.846 and 0.878), reader B (0.963 and 0.913), reader C (0.921 and 0.895), reader D (0.933 and 0.891), reader E (0.931 and 0.860), and reader F (0.901 and 0.886). These values and the corresponding Az values determined by using the probability of malignancy score are shown in Table 2. One reader performed had a higher performance rating with screen-film mammography, but the difference was not statistically significant, whereas five readers had a higher performance rating with digital mammography. Reader E demonstrated a statistically significant higher performance rating (P ⫽ .04) with digital mammography, whereas the differences in Az values between digital mammography and screenfilm mammography that were based on BI-RADS categories, as well as on probability of malignancy scores, were not statistically significant for the other comparisons (Table 2). The multireader analysis of BI-RADS categories for all cases and all readers showed a mean Az value of 0.916 for digital mammography, compared with a Skaane et al

TABLE 2 Az Values for Digital Mammography and Screen-Film Mammography for Six Readers with BI-RADS Classification and Five-Point Scale for Probability of Malignancy

Radiology

Az with BI-RADS

Az with Probability of Malignancy

Reader

Digital

Screen Film

P Value

Digital

Screen Film

P Value

A B C D E F All*

0.846 0.963 0.921 0.933 0.931 0.901 0.916

0.878 0.913 0.895 0.891 0.860 0.886 0.887

.47 .09 .32 .11 .04 .66 .22

0.856 0.951 0.934 0.921 0.934 0.901 0.925

0.874 0.918 0.910 0.898 0.852 0.894 0.893

.67 .26 .35 .38 .05 .80 .25

* Values for all readers combined are mean values.

Graphs show fitted ROC curves for full-field digital mammography (FFDM) and screen-film mammography (SFM). (a) Mean ROC curves for all cases (n ⫽ 232) and all six readers on the basis of the BI-RADS classification: Az value for digital mammography was 0.916; that for screen-film mammography, 0.887 (P ⫽ .22). (b) Mean ROC curves for all cases (n ⫽ 232) and all six readers on the basis of the five-point rating scale for probability of malignancy: Az value for digital mammography was 0.925; that for screen-film mammography, 0.893 (P ⫽ .25). (c) Mean ROC curves for all benign and malignant densities and masses (n ⫽ 71) and all six readers on the basis of the BI-RADS classification: Az value for digital mammography was 0.860; that for screen-film mammography, 0.835 (P ⫽ .646). The six cancers that manifested as masses with associated microcalcifications were excluded from analysis. (d) Mean ROC curves for cases with benign findings that manifested as microcalcifications (n ⫽ 43) and cancer cases that manifested as microcalcifications (n ⫽ 14) and four readers on the basis of the BI-RADS classification. Az value for digital mammography was 0.841; that for screen-film mammography, 0.787. This difference was not significant. FPF ⫽ false-positive fraction, TPF ⫽ true-positive fraction.

mean Az of 0.887 for screen-film mammography. This difference was not statistically significant (P ⫽ .22). The corresponding ROC analysis with the probability of malignancy scores provided Volume 237



Number 1

similar results with a mean Az for all readers of 0.925 for digital mammography and 0.893 for screen-film mammography (P ⫽ .25). The ROC curves are shown in the Figure, parts a and b.

Subgroups.—The ROC curves for the subgroups of benign and malignant masses and microcalcifications only are shown in the Figure, parts c and d. For densities and masses only (n ⫽ 71 cases), the Az value for digital mammography was 0.860, compared with an Az value of 0.835 for screen-film mammography. Comparison of these values, however, caused problems, since the ROC curves crossed each other (Figure, part c). For the subgroup of microcalcifications only (n ⫽ 57 cases), the data were degenerate (either clustered values, with which most data represented just a few scores without enough range, or tied values, with which the same score was assigned for many screen-film mammographic and digital mammographic interpretations) for two readers, and these data were expelled from analysis by the computer program. The average ROC curve that was based on the BI-RADS category for the other four readers is shown in the Figure, part d. The ROC curves show a higher diagnostic performance rating for digital mammography, compared with screen-film mammography, for all false-positive fractions; the Az for digital mammography was 0.841, compared with the Az of 0.787 for screen-film mammography, but this difference was not significant. Other analyses.—In addition to the comparisons with Az, further analyses were performed for a fixed single falsepositive fraction and, consequently, a given true-positive fraction. The threshold for a true-positive score with the BIRADS classification was 3, so a BI-RADS category of 3 or higher for cancers was a true-positive rating. When the average false-positive fractions for digital mammography and screen-film mammography at this threshold level were entered into the program for ROC analysis, two readers (readers B and E) showed a statistically significant higher performance Breast Lesion Detection and Classification



41

Radiology

rating with digital mammography, compared with screen-film mammography (P ⫽ .01 and .03, respectively). When the false-positive fraction was averaged over all six readers (0.337) for all cases, digital mammography showed a higher performance rating (mean true-positive fraction of 0.935) than did screen-film mammography (mean true-positive fraction of 0.889), but this difference was not significant (P ⫽ .088). Analysis with 2 ⫻ 2 tables that were based on the BI-RADS categories for the subgroups of masses and microcalcifications only and for masses and microcalcifications in women with dense breast parenchyma (density of breast parenchyma BI-RADS categories 3 and 4) showed no significant differences between the two modalities. Analysis with 2 ⫻ 2 tables for all cancers (46 cases or 276 interpretations) showed that 235 of 276 interpretations determined by the six readers were truepositive with both modalities, six interpretations were false-negative with both modalities, 27 interpretations were truepositive with digital mammography but false-negative with screen-film mammography, compared with eight interpretations that were false-negative with digital mammography but true-positive with screen-film mammography, with a discordant pair of 27 and eight (Table 3). With application of the McNemar test, which was based on averaging of the BIRADS categories, to the 46 cancers determined by the six readers, however, no statistically significant difference between screen-film mammography and digital mammography was shown. Classification of the cancer cases according to masses only (26 cases or 156 interpretations) revealed that 13 interpretations of cancer were true-positive at digital mammography but false-negative at screen-film mammography, compared with six truepositive cases determined with screen-film mammography that were false-negative with digital mammography, whereas 132 cases were true-positive with both modalities, and five cancer cases were missed with both screen-film mammography and digital mammography. Hence, digital mammography depicted seven (4.5%) more mass cancers in 156 interpretations than did screen-film mammography. For malignant microcalcifications only (14 cases or 84 interpretations), the corresponding discordant pair was five versus one in favor of digital mammography, or four (4.8%) more malignant microcalcifications in 84 interpretations were correctly classified with digital mammography than with 42



Radiology



October 2005

TABLE 4 Analysis with 2 ⴛ 2 Table for Interpretations of Six Readers for Screen-Film and Digital Mammography in 45 Benign Masses (270 Interpretations)

TABLE 3 Analysis with 2 ⴛ 2 Table for Interpretations of Six Readers for Screen-Film and Digital Mammography in all 46 Cancers (276 Interpretations) Screen-Film Mammography Digital Mammography True-positive results False-negative results Total

Screen-Film Mammography

TruePositive Results

FalseNegative Results

Total

235

27

262

8

6

14

243

33

276

Note.—Nineteen more cancers were depicted with digital mammography than with screen-film mammography (discordant pair, 27 and eight), but this difference was not statistically significant (P ⫽ .07).

screen-film mammography. Six malignancies that manifested with both microcalcifications and a mass or density were not included in this classificatory analysis of the cancers. In the subgroup of cancers in dense breast parenchyma (BI-RADS categories 3 and 4), 89 of 108 interpretations were true-positive with both modalities, five interpretations were false-negative with both modalities, and the discordant pair of 11 and three in favor of digital mammography was not statistically significant. For the benign and normal cases (186 cases or a total of 1116 interpretations determined by the six readers), there was virtually no difference between the two modalities. A total of 604 interpretations were true-negative with both modalities, and 238 were false-positive with both modalities, with the discordant pair of 140 and 134 interpretations in favor of digital mammography (one more correct true-negative interpretation with digital mammography per reader). For the 45 benign masses only (270 interpretations), the discordant pair of 67 versus 41 truenegative interpretations in favor of digital mammography (Table 4) was not statistically significant (P ⫽ .33, McNemar test that was based on averaging of the BI-RADS categories over the six readers). Overall, 26 (9.6%) of 270 more correct true-negative interpretations were observed with digital mammography, compared with screen-film mammography. Only 14 benign masses were found in women with dense breast parenchyma; in this number of masses, 20 of 84 interpretations were correctly true-negative with digital mammography and false-

Digital Mammography False-positive results True-negative results Total

FalsePositive Results 90

TrueNegative Results 41

Total 131

67

72

139

157

113

270

Note.—Twenty-six more benign masses were correctly classified as true-negative with digital mammography than with screen-film mammography (discordant pair, 67 and 41), but this difference was not statistically significant (P ⫽ .33).

positive with screen-film mammography, compared with 14 of 84 false-positive ratings with digital mammography and true-negative ratings with screen-film mammography (the difference was not statistically significant). For the benign microcalcifications only (43 cases or 258 interpretations), 50 interpretations were true-negative with both modalities and 136 interpretations were false-positive with both modalities, whereas the discordant pair was 38 versus 34 in favor of screen-film mammography. Thus, there was practically no difference between screen-film mammography and digital mammography in the characterization of benign microcalcifications. With screen-film mammography, a total of 48 (8.2%) of 588 interpretations for the cases with normal findings (n ⫽ 98) were classified as BI-RADS category 3 or higher, and the distribution was as follows: 31 cases, category 3; 14 cases, category 4; and three cases, category 5. For digital mammography, a total of 70 (11.9%) of 588 interpretations were classified as BI-RADS category 3 or higher, and the distribution was as follows: 48 cases, category 3; 22 cases, category 4; and zero cases, category 5.

DISCUSSION The digital mammographic equipment developed over the past few years has demonstrated superior detection of lowcontrast objects in contrast-detail studies (3,4). This improvement, along with a Skaane et al

Radiology

wider dynamic range, is expected to enhance the diagnostic quality of images, particularly for women with dense breast parenchyma. The true flexibility, and benefit, of digital technology is primarily realized in a soft-copy display of the images and, consequently, in soft-copy reading. This potential benefit of digital mammography should result in enhanced cancer detection, especially in dense breast parenchyma, because of increased contrast resolution. The three large-scale trials in which screen-film mammography and digital mammography were compared in screening populations so far, however, did not show any statistically significant differences in the cancer detection rate between screenfilm mammography and digital mammography (8 –11). The differences between screen-film mammography and digital mammography for cancer detection rates found in the two first trials (8,9) could be explained by random statistical variation or lack of power. The third and larger-scale trial, however, showed a cancer detection rate for women in the age group of 50 – 69 years that was close to statistical significance in favor of digital mammography, compared with screen-film mammography (10). We present the results of ROC analyses that were based on the BI-RADS classification, since most breast imaging radiologists are familiar with this system. A principal requirement for performance of ROC analysis is that the measurements or interpretations are meaningfully ranked in magnitude (13). It is a matter of discussion, however, whether the BI-RADS categories of 1–5 should be considered as a linear, continuous scale suitable for ROC analysis (14). The overall mean Az values for all cases and all readers that were based on the BI-RADS categories were comparable with the Az values that were based on the somewhat more linear five-point rating scale for probabilty of cancer. We, therefore, decided to present the diagnostic performance rating on the basis of the widely familiar categories of 1–5 in the BI-RADS lexicon. The ROC curves for digital mammography and screen-film mammography for all readers and all cases on the basis of the BI-RADS classification did not intersect, whereas the corresponding curves that were based on the probability scale showed that the two curves crossed slightly. In the present study, results of the ROC analyses showed a slightly higher overall reader diagnostic performance rating with digital mammography, compared Volume 237



Number 1

with screen-film mammography, although the difference was not statistically significant. Comparison of the interpretations for the cancer cases revealed a higher true-positive rating for digital mammography, but the difference was not statistically significant. Five of the six readers had a higher diagnostic performance rating with digital mammography by using Az values for comparison, and the difference was statistically significant for one of the five readers. Two of the six readers showed a statistically higher overall performance rating with digital mammography, if comparison is based on a fixed single false-positive fraction that defines BI-RADS category 3 or higher as a true-positive classification for cancers. Thus, the results of our study confirm the results of previous studies with experimental settings with a digital mammographic unit, which have shown that digital mammography is at least comparable with screen-film mammography in the detectability and characterization of microcalcifications and low-contrast objects (5–7,15–18). The differences among the six readers reflect the challenges of interobserver variation in the interpretation of screening mammograms (19,20). The substantially higher number of true-negative interpretations for benign masses with digital mammography compared with screen-film mammography in our study is noteworthy. This better characterization of benign masses with digital mammography seems to support the results of a previous study, which showed a statistically lower recall rate for women who underwent digital mammography, compared with those who underwent screen-film mammography, in a screening population (8). On the other hand, the higher number of false-positive classifications (BI-RADS category 3 or higher) of cases with normal findings in our study with digital mammography (11.9%) compared with screen-film mammography (8.2%) seems to support the results of the Oslo I and II studies, in which a higher recall rate was reported for women who underwent digital mammography (9,10). Interpretation of our results related to benign and malignant masses caused problems, partly because the ROC curves for masses intersected. The characterization of benign microcalcifications in our study showed no difference between the two modalities. Digital mammography is expected to have a potential benefit over screen-film mammography for breast cancer detection in women with dense breast paren-

chyma because of increased contrast resolution and wider dynamic range. Classification of the population according to cancer cases in women with dense breast parenchyma (BI-RADS breast density categories 3 and 4) in our study, however, revealed no significant difference in the true-positive interpretations between the two modalities. Since there were only 18 cancers in this subgroup, a larger study would be required to verify improved detection and classification of cancers with digital mammography in women with dense breast parenchyma. There were only slightly, but not significantly, more correct true-negative interpretations with digital mammography for the benign masses in women with dense breast parenchyma. The experimental design of this study led to some limitations. An important aspect of our study was the comparison of diagnostic performance ratings with respect to screen-film mammography versus digital mammography for the detection of small asymptomatic lesions, such as those usually found in women in a mammographic screening population. The large number of cancers included in the study population sample, however, does not reflect a screening situation, and this feature may explain the high number of false-positive interpretations, and especially the relatively high number of false-positive interpretations for cases with normal findings with both modalities. This clearly shows— once again— that results from retrospective experimental studies are difficult to compare with prospective interpretations performed in daily practice. Also, significant results could not be derived for different mammographic types of lesions, because of a relatively low number of cases in various subgroups and, consequently, a lack of power. In conclusion, our results demonstrated that digital mammography with soft-copy reading was slightly superior to screen-film mammography in the detection and the characterization of lesions that are representative of those in women in a mammographic screening program. Overall, however, there was no statistically significant difference in diagnostic performance ratings between the two imaging modalities. The small number of cases in our study limits the ability to determine true differences between the two imaging modalities for various subgroups of lesions. Breast Lesion Detection and Classification



43

Radiology

References 1. Saarenmaa I, Salminen T, Geiger U, et al. The visibility of cancer on earlier mammograms in a population-based screening programme. Eur J Cancer 1999;35:1118 – 1122. 2. van Dijck JAAM, Verbeek ALM, Hendriks JHCL, Holland R. The current detectability of breast cancer in a mammographic screening program: a review of the previous mammograms of interval and screendetected cancers. Cancer 1993;72:1933– 1938. 3. Berns EA, Hendrick RE, Cutter GR. Performance comparison of full-field digital mammography to screen-film mammography in clinical practice. Med Phys 2002;29: 830 – 834. 4. Suryanarayanan S, Karellas A, Vedantham S, Ved H, Baker SP, D’Orsi CJ. Flat-panel digital mammography system: contrastdetail comparison between screen-film radiographs and hard-copy images. Radiology 2002;225:801– 807. 5. Feig SA, Eskola-Feig C. Visualization of ductal carcinoma in situ and early invasive carcinoma: comparison of film-screen mammography and full-field digital mammography. In: Yaffe MJ, ed. IWDM 2000: 5th International Workshop on Digital Mammography. Madison, Wis: Medical Physics Publishing, 2001; 451– 460. 6. Fischer U, Baum F, Obenauer S, et al. Comparative study in patients with microcalcifications: full-field digital mammography vs screen-film mammography. Eur Radiol 2002;12:2679 –2683.

44



Radiology



October 2005

7. Obenauer S, Luftner-Nagel S, von Heyden D, Munzel U, Baum F, Grabbe E. Screen film vs full-field digital mammography: image quality, detectability and characterization of lesions. Eur Radiol 2002;12: 1697–1702. 8. Lewin JM, Hendrick RE, D’Orsi CJ, et al. Comparison of full-field digital mammography with screen-film mammography for cancer detection: results of 4,945 paired examinations. Radiology 2001;218:873– 880. 9. Skaane P, Young K, Skjennald A. Populationbased mammography screening: comparison of screen-film and full-field digital mammography with soft-copy reading—Oslo I study. Radiology 2003;229:877– 884. 10. Skaane P, Skjennald A. Screen-film mammography versus full-field digital mammography with soft-copy reading: randomized trial in a population-based screening program—the Oslo II study. Radiology 2004;232:197–204. 11. Lewin JM, D’Orsi CJ, Hendrick RE, et al. Clinical comparison of full-field digital mammography and screen-film mammography for detection of breast cancer. AJR Am J Roentgenol 2002;179:671– 677. 12. Pedersen K, Nordanger J. Quality control of the physical and technical aspects of mammography in the Norwegian breastscreening programme. Eur Radiol 2002;12: 463– 470. 13. Obuchowski NA. Receiver operating characteristic curves and their use in radiology. Radiology 2003;229:3– 8. 14. Cole EB, Pisano ED, Kistner EO, et al. Diagnostic accuracy of digital mammogra-

15.

16.

17.

18.

19.

20.

phy in patients with dense breasts who underwent problem-solving mammography: effects of image processing and lesion type. Radiology 2003;226:153–160. Bick U, Diekmann F, Marth F, Le Roux A, Juran R, Friedrich M. Comparison of a new full-field digital mammography system and conventional film-screen mammography: a phantom study (abstr). Radiology 1999; 213(P):152. Diekmann S, Bick U, von Heyden H, Diekmann F. Visualization of microcalcifications on mammographies obtained by digital full-field mammography in comparison to conventional film-screen mammography [in German]. Rofo 2003;175: 775–779. Rosol MS, Niklason LT, Venkatakrishnan V, Silvenoinnen H, Kopans DB, Hamberg LM. Contrast-detail comparison of a fullfield mammography system and a screenfilm system (abstr). Radiology 1999; 213(P):151. Venta LA, Hendrick RE, Adler YT, et al. Rates and causes of disagreement in interpretation of full-field digital mammography and film-screen mammography in a diagnostic setting. AJR Am J Roentgenol 2001;176:1241–1248. Beam CA, Layde PM, Sullivan DC. Variability in the interpretation of screening mammograms by US radiologists. Arch Intern Med 1996;156:209 –213. Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists’ interpretation of mammograms. N Engl J Med 1994;331:1493–1499.

Skaane et al