Evaluation of a Prospective Scoring System Designed for ... - CiteSeerX

0 downloads 0 Views 208KB Size Report
(MITR), and pattern of contrast material washout (POCW). The system was ... Cancer Research, Royal Marsden Hospital, London, En- gland (S.R.L.); and Mount ...
ORIGINAL RESEARCH

Ruth M. L. Warren, MD Deborah Thompson, PhD Linda J. Pointon, MPhil Rebecca Hoff, BSc Fiona J. Gilbert, FRCR Anwar R. Padhani, FRCR Douglas F. Easton, PhD Sunil R. Lakhani, MD Martin O. Leach, PhD Collaborators in the United Kingdom Medical Research Council Magnetic Resonance Imaging in Breast Screening (MARIBS) Study

Purpose:

To evaluate prospectively the accuracy of a lesion classification system designed for use in a magnetic resonance (MR) imaging high-breast-cancer-risk screening study.

Materials and Methods:

All participating patients provided written informed consent. Ethics committee approval was obtained. The results of 1541 contrast material– enhanced breast MR imaging examinations were analyzed; 1441 screening examinations were performed in 638 women aged 24 –51 years at high risk for breast cancer, and 100 examinations were performed in 100 women aged 23– 81 years. Lesion analysis was performed in 991 breasts, which were divided into design (491 breasts) and testing (500 breasts) sets. The reference standard was histologic analysis of biopsy samples, fine-needle aspiration cytology, or minimal follow-up of 24 months. The scoring system involved the use of five features: morphology (MOR), pattern of enhancement (POE), percentage of maximal focal enhancement (PMFE), maximal signal intensity–time ratio (MITR), and pattern of contrast material washout (POCW). The system was evaluated by means of (a) assessment of interreader agreement, as expressed in ␬ statistics, for 315 breasts in which both readers analyzed the same lesion, (b) assessment of the diagnostic accuracy of the scored components with receiver operating characteristic curve analysis, and (c) logistic regression analysis to determine which components of the scoring system were critical to the final score. A new simplified scoring system developed with the design set was applied to the testing set.

Results:

There was moderate reader agreement regarding overall lesion outcome (ie, malignant, suspicious, or benign) (␬ ⫽ 0.58) and less agreement regarding the scored components. The area under the receiver operating characteristic curve (AUC) for the overall lesion score, 0.88, was higher than the AUC for any one component. The components MOR, POE, and POCW yielded the best overall result. PMFE and MITR did not contribute to diagnostic utility. Applying a simplified scoring system to the testing set yielded a nonsignificantly (P ⫽ .2) higher AUC than did applying the original scoring system (sensitivity, 84%; specificity, 86.0%).

Conclusion:

Good diagnostic accuracy can be achieved by using simple qualitative descriptors of lesion enhancement, including POCW. In the context of screening, quantitative enhancement parameters appear to be less useful for lesion characterization.

1

From the Department of Radiology, Addenbrooke’s Hospital, Cambridge, England (R.M.L.W.); Cancer Research-UK Genetic Epidemiology Unit, Cambridge, England (D.T., D.F.E.); Study Coordinating Office, Section of Magnetic Resonance, Institute of Cancer Research, Royal Marsden Hospital, Downs Rd, Sutton Surrey SM2 5PT, England (L.J.P., R.H., M.O.L.); Department of Radiology, University of Aberdeen, Aberdeen, Scotland (F.J.G.); Breakthrough Breast Cancer Research Centre, Institute of Cancer Research, Royal Marsden Hospital, London, England (S.R.L.); and Mount Vernon Hospital, Northwood, Middlesex, England (A.R.P.). The complete list of MARIBS group members and their affiliations is listed in Appendix E1 (radiology.rsnajnls.org/cgi/content/full/2393042007/ DC1). Received November 25, 2004; revision requested January 27, 2005; revision received May 5; accepted June 3; final version accepted August 1. Supported by a project grant from the United Kingdom Medical Research Council. D.F.E. and D.T. supported by Cancer Research U.K. Genetic Epidemiology Unit. Address correspondence to M.O.L. (e-mail: [email protected]).

娀 RSNA, 2006

Supplemental material: radiology.rsnajnls.org/cgi/ content/full/2393042007/DC1

姝 RSNA, 2006 Radiology: Volume 239: Number 3—June 2006

677

䡲 BREAST IMAGING

Evaluation of a Prospective Scoring System Designed for a Multicenter Breast MR Imaging Screening Study1

BREAST IMAGING: Scoring System for Breast MR Screening Study

A

nalysis of dynamic magnetic resonance (MR) imaging features to discriminate between benign and malignant breast lesions is critical to the effective use of breast MR imaging for screening patients at high risk for cancer with high sensitivity and without reduced specificity. In a United Kingdom study of MR imaging screening for breast cancer in women at high risk, the Magnetic Resonance Imaging in Breast Screening (MARIBS) Study, a scoring system to help discriminate benign from malignant lesions was devised at the outset (1). When the study protocol was written in 1996, the scoring system was devised by using the literature of that period as the basis (2). It should be noted that the American College of Radiology Breast Imaging Reporting and Data System (BIRADS) lexicon (3) was developed after our study had commenced (4). The enhancement characteristics of individual breast lesions were scored in five categories, and the aggregated score was used to classify the lesions as malignant, suspicious, or benign. The details of the scoring system are given in Appendix E2 (Fig E2, radiology.rsnajnls.org/cgi/content/full /2393042007/DC1). The recall rate for MR imaging in the high-risk population has been analyzed previously and was 10% (5). The background for the scoring decisions is described in the published protocol (2). The same features were subsequently used to create the BI-RADS breast MR imaging lexicon (3,4). The system for scoring was based on three morphologic features: lesion shape, uniformity or heterogeneity of the contrast material uptake, and shape of the contrast material washout curve in the dynamic examination. Two numerical features also were included: percentage of maximal focal enhancement (PMFE) and maximal signal intensity–time ratio (MITR); definitions of these features are given in the worksheets in Appendix E2 (radiology.rsnajnls.org/cgi/content/full /2393042007/DC1). Thus, an interpretation based on integrated morphologic and quantitative pharmacokinetic features was made to maximize specificity while preserving clinically acceptable sensitivity. The cutoff points chosen in the scoring system were to be validated,

678

as they are in the present analysis, by using a symptomatic cohort of patients. An analysis to study reader performance by using the scoring system and the overall conclusions was performed previously (6). The sensitivity and specificity of our scoring system have already been published (6) and are based on analyses of the results of 1441 examinations performed in women at high risk for breast cancer and the results of 100 examinations performed in women treated at symptom clinics. Briefly, the sensitivity of the double reading of mammograms was 91% (95% confidence interval [CI]: 83%, 96%), and the specificity was 81% (95% CI: 79%, 83%). The purpose of our current study was to evaluate prospectively the accuracy of a lesion classification system designed for use in the United Kingdom MARIBS study.

Materials and Methods This national study was supported by a grant from the Medical Research Council. The MR imaging examinations were paid for from subvention funding for research from the United Kingdom National Health Service. The study protocol was based in part on developments supported by Cancer Research United Kingdom and Yorkshire Cancer Research. Schering Healthcare (Burgess Hill, United Kingdom) and Oracle Education (Bracknell, United Kingdom) made contributions for training and education in the study. None of the authors had a financial interest in the companies associated with the research, and only the authors had full control of the data that were used and published. The study was a United Kingdom–wide collaboration of 22 genetics centers and their associated MR imaging and mammography departments. The key clinical contributors are named in Appendix E1 (radiology.rsnajnls.org/cgi/content/full /2393042007/DC1).

Study Population All participating patients provided written informed consent. The study received ethical approval from the North Thames Multicenter research ethics

Warren et al

committee and the local research ethical committees of the 22 participating centers. The data analyzed comprised those from 1541 breast MR imaging examinations performed in 738 women who were examined as part of the MARIBS study between August 6, 1997 (start of the study), and February 25, 2003. Each imaging study was read prospectively by two independent radiologists. A total of 44 radiologists (including R.M.L.W., F.J.G., and A.R.P.) from 22 centers in the United Kingdom participated in the study; a median of 36.5 studies (range, 1–509) were read per radiologist. The reference standard was either the abnormal finding in any case in which cytologic or histologic results were available or a verified negative result of a minimum follow-up of 24 months. The women at high risk for breast cancer (n ⫽ 638) were monitored until the end of the study (March 31, 2004) at genetics clinics. Twenty-six breast cancers were detected in this highrisk group during the study period. Of 352 women who underwent one subsequent annual screening with MR imaging and/or mammography (32 women underwent MR imaging only, nine underwent

Published online 10.1148/radiol.2393042007 Radiology 2006; 239:677– 685 Abbreviations: BI-RADS ⫽ Breast Imaging Reporting and Data System CI ⫽ confidence interval MARIBS ⫽ Magnetic Resonance Imaging in Breast Screening MITR ⫽ maximal signal intensity–time ratio MOR ⫽ morphology PMFE ⫽ percentage of maximal focal enhancement POCW ⫽ pattern of contrast material washout POE ⫽ pattern of enhancement ROC ⫽ receiver operating characteristic Author contributions: Guarantor of integrity of entire study, M.O.L.; study concepts/study design or data acquisition or data analysis/ interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; manuscript final version approval, all authors; literature research, R.M.L.W., L.J.P.; clinical studies, R.M.L.W., F.J.G., A.R.P.; statistical analysis, D.T., D.F.E.; and manuscript editing, R.M.L.W., D.T., L.J.P., F.J.G., A.R.P. Authors stated no financial relationship to disclose. Radiology: Volume 239: Number 3—June 2006

BREAST IMAGING: Scoring System for Breast MR Screening Study

mammography only, 311 underwent both examinations), five had cancer that was detected at screening with contrast material– enhanced MR imaging. Thirty-six women underwent two subsequent annual screenings with MR imaging and/or mammography (three women underwent MR imaging only, 33 underwent both examinations), and no cancers were detected. Of the women who did not undergo subsequent imaging or who withdrew from the study, 67 had 1–2 years of follow-up and 157 had more than 2 years of follow-up. Most of these women will have been offered annual mammographic examinations outside the context of this study during the follow-up period. Two interval cancers arose in this group: one at 10 months and one at 27 months. No other breast cancers were reported. The majority of the examinations (n ⫽ 1441) were performed in a screening cohort of 638 women aged 24 –51 years (mean age, 40.5 years) who were at high genetic risk for breast cancer; this group formed the high-risk cohort. Eligibility for this cohort was based on the following: carrying of a known BRCA1, BRCA2, or p53 mutation; being a first-degree relative of a known carrier of either mutation; or having a family history suggestive of a 50% or higher probability of being a carrier of either mutation (7). Symptomatic women in the high-risk cohort were excluded. The remaining 100 examinations were performed in 100 women recruited at the start of the study (not on the basis of family history) who had an indeterminate lesion at mammography and/or ultrasonography or a possible breast cancer for which a histologic diagnosis would be obtained at excision biopsy, core-needle biopsy, or fine-needle aspiration cytology and were attending a symptomatic breast treatment clinic. These women were aged 23– 81 years (mean, 49.2 years) and are herein referred to as the symptomatic cohort. Additional details about these patient cohorts have been published (6). The pathologists (including S.R.L.) from all 22 centers either participated in the United Kingdom Breast Screening Radiology: Volume 239: Number 3—June 2006

Program or analyzed specimens and participated in the United Kingdom Pathology and Cytology Quality Assurance Program. A pathologist (S.R.L.) reviewed the pathology reports to classify lesions as benign or malignant (see Appendix E3 [radiology.rsnajnls.org/cgi/content/full /2393042007/DC1]). At the time of this writing, it had been more than 24 months since the last MR imaging examination described herein was performed; thus, there has been adequate time for false-negative findings to emerge. There were 91 histologically proved cancers and 86 biopsy-proved histologically benign lesions. For 307 lesions analyzed at additional examinations, the methods of validation were as follows: Of the benign lesions, 126 were diagnosed with further imaging; 36, with fine-needle aspiration cytology only; 23, with core-needle biopsy (five with MR imaging guidance); 12, with core-needle biopsy and surgical biopsy; and 19, with surgical biopsy. Of the malignant lesions, 46 were diagnosed with core-needle biopsy and surgery (19 of these lesions were analyzed at fine-needle aspiration cytology also), and 45 were diagnosed with surgery (20 of these lesions were analyzed at fine-needle aspiration cytology also).

MR Imaging Protocols The MR imaging screening examination (protocol A) consisted of high-spatialresolution (512 ⫻ 256 matrix) T1weighted sequences performed before and after contrast medium injection, with two three-dimensional coronal acquisitions of lower spatial resolution (256 ⫻ 128 matrix) performed before the intravenous bolus injection of 0.2 mmol of gadopentetate dimeglumine (Magnevist; Schering, Berlin, Germany) per kilogram of body weight and four to six acquisitions performed immediately after the injection (Appendix E4 [radiology.rsnajnls.org /cgi/content/full/2393042007/DC1]) (2). An optional additional fat-suppressed T1weighted high-spatial-resolution sequence performed approximately 8 minutes after the injection also was allowed. This combination of sequences enabled analysis of the signal intensity–time char-

Warren et al

acteristics of any region in the imaging volume of either breast and morphologic assessment of high-spatial-resolution images. Patients who were recalled because of an indeterminate MR imaging study that was deemed suspicious (Appendix E2 [radiology.rsnajnls.org/cgi/content /full/2393042007/DC1]) underwent either a high-temporal-resolution examination with 0.1 mmol/kg gadopentetate dimeglumine (protocol B, Appendix E4 [radiology.rsnajnls.org/cgi/content /full/2393042007/DC1]), with attention focused on the area of the breast where the abnormality was detected at the initial MR imaging screening examination (by preference, the protocol recommends that this second MR imaging examination be performed in the sagittal plane, but the radiologist may opt to use the coronal plane), or a repeat of the initial protocol A MR imaging screening examination with 0.2 mmol/kg gadopentetate dimeglumine. Either alternative examination was to be performed during a phase of the patient’s menstrual cycle different from that during which the initial screening examination was performed. The supervising radiologist chose the diagnostic pathway, but in general, diffuse or multiple abnormalities were addressed by repeating protocol A and focal lesions were addressed by performing the high-temporal-resolution examination, which covered only the portion of the breast specified by the supervising radiologist. Benign lesions were sampled at biopsy, where possible, rather than managed with surveillance, but the protocol allowed for a repeat protocol A examination at 6 months (8). Full details of the MR imaging protocol are given in Appendix E4 (radiology.rsnajnls.org/cgi /content/full/2393042007/DC1) and have been described previously (2).

Scoring System Radiologists read the initial MR studies while blinded to other MR imaging, clinical, and mammographic findings. The scoring system used was based on five morphologic and dynamic contrast material uptake characteristics and derived from the literature (9) that was 679

BREAST IMAGING: Scoring System for Breast MR Screening Study

available before the start of the study. Findings were recorded prospectively on standard worksheets (Appendix E2 [radiology.rsnajnls.org/cgi/content/full /2393042007/DC1]), which were developed to ensure consistency in the choice and analysis of regions of interest. Although a reading radiologist may have analyzed multiple regions of interest in a breast, only the highest scoring lesion was included in the analysis. Since this was a multicenter study, the data are those from examinations performed at two field strengths (1.0 and 1.5 T) on MR imaging units from four different manufacturers (GE Medical Systems, Milwaukee, Wis; Siemens, Erlangen, Germany; Philips Medical Systems, Eindhoven, the Netherlands; Picker, Cleveland, Ohio). The two elements of the scoring system that were intended to reflect the morphologic characteristics of the lesion were morphology (MOR)—that is, whether the lesion was lobulated or well defined, poorly defined, or spiculated— and pattern of enhancement (POE)— that is, whether the enhancement was centrifugal, absent, minimal, or homogeneous; heterogeneous; or ring like. The pattern of contrast material washout (POCW) was a qualitative variable that indicated whether the signal intensity showed (a) a monotonic increase with time, (b) an increase followed by a plateau, or (c) an increase followed immediately by a decrease (ie, washout). The two quantitative variables were PMFE and MITR, the calculations for which are given in Figure E1 (radiology.rsnajnls .org/cgi/content/full/2393042007/DC1). The weightings of some variables were chosen on the basis of findings from the literature, and the weightings of other variables were obtained from a center in the United Kingdom and subsequently published (10). The total score for the five weighted variables was used to classify each lesion as malignant, suspicious, or benign.

Statistical Analyses From a total of 1541 double-read MR imaging examinations performed in 738 women, the maximal number of breasts in which a lesion could have been ana680

lyzed was 3082, in 2091 of which a lesion was not analyzed by either reader. When a reader analyzed more than one lesion in a single breast, only the highest scoring lesion was considered in the analysis. For 550 breasts, just one reader analyzed a lesion, and for 126 breasts, both readers analyzed different lesions. Thus, in 315 breasts, both readers chose the same lesion to analyze. The levels of between-reader agreement across different elements of the scoring system were evaluated by using ␬ statistics and on the basis of the 315 MR imaging examinations at which the same lesion was judged by both readers to be the highest scoring (11). Weighted ␬ values were used for outcomes with more than two levels, with the assumption of evenly spaced categories (12): A ␬ value of less than or equal to 0.20 indicated poor agreement; 0.21– 0.40, fair agreement; 0.41– 0.60, moderate agreement; 0.61– 0.80, substantial agreement; and greater than 0.80, almost perfect agreement. For the analyses aimed at improving the scoring system, it was necessary to divide the data into a design set, which was used to design the new scoring system, and a testing set, with which the new system would be tested. From the complete set of 991 breasts in which a lesion was analyzed by one or both readers, 500 breasts were randomly sampled to create the testing set and the remaining 491 breasts served as the design set. The 114 lesion-containing breasts of the 100 women in the symptomatic cohort were evenly distributed between the two groups. To avoid the nonindependence of two readings of the same lesion, only one reading was used for each breast. When both readers analyzed a lesion in the same breast (441 breasts), the reader whose results were used was selected randomly. Both continuous variables—PMFE and MITR—were separated into three categories for the purposes of the scoring system in the main MARIBS screening study, according to cutoff points chosen by using the best information available before the start of the study. The results suggested that this may not have been the most appropriate way of

Warren et al

dividing the data, so the 33.3rd and 66.7th percentiles of the PMFE and MITR distributions in the design set were used to select more informative cutoff points and to investigate whether it would be more suitable to use MR unit–specific cutoff points. With the original scoring system, only the absolute value of the MITR was used: [(SIpost ⫺ SIpre) 䡠 100]/T, where SIpost is the maximal signal intensity after the contrast material administration, SIpre is the average signal intensity before the contrast material administration, and T is the time between the contrast material administration and the time at which the maximal signal intensity after contrast material administration was achieved. As an alternative approach, we also considered whether the normalized MITR— defined as [(SIpost ⫺ SIpre) 䡠 100]/(T 䡠 SIpre)—was a more useful measure. Multiple logistic regression analysis with backward stepwise selection was used to determine the optimal combination of variables needed to predict a malignant lesion, and a new simplified scoring system was devised on the basis of these results. The sandwich variance estimator was used to allow for the nonindependence of examinations performed in the same woman (when lesions were analyzed either in both breasts or at more than one annual examination) (13). Statistical significance was set at the 5% level. The discriminatory abilities of the different scoring system elements, of the new simplified scoring system, and of the existing system were evaluated by comparing the areas under the nonparametric receiver operating characteristic (ROC) curves (14). The asymptotic 95% CIs assumed a normal distribution for the area under the curve, with standard errors derived from the DeLong et al method (14). The performance of the new simplified scoring system was evaluated in the testing set of 500 breasts. The 95% CIs for sensitivity and specificity in the detection of malignancy were exact and based on the binomial distribution. All analyses were performed by using Stata, version 8.2, software (StataCorp, College Station, Tex). Radiology: Volume 239: Number 3—June 2006

BREAST IMAGING: Scoring System for Breast MR Screening Study

For ␬ statistic agreement analysis, we used only the 315 breasts for which both readers chose the same highest scoring lesion. To redefine the scoring system, we used only one reading for each breast. If only one reader analyzed a lesion in a particular breast, then that reading was used. If both readers analyzed lesions (regardless of whether they both chose the same highest scoring lesion), the lesion used in the analysis was randomly chosen from the two highest scoring lesions.

Results Interreader Agreement For the 315 lesions that were analyzed by both readers, the ␬ statistic for interreader agreement regarding the overall lesion outcome (ie, malignant, suspicious, or benign) was a moderate 0.56 (95% CI: 0.47, 0.65), or 0.58 (weighted) if the three outcome categories were considered separately. The levels of agree-

ment between readers 1 and 2 regarding the morphologic parameters (␬ ⫽ 0.51 for MOR, ␬ ⫽ 0.49 for POE) and the POCW (␬ ⫽ 0.53) were similar but slightly higher regarding the categorized MITR with use of the three original categories (␬ ⫽ 0.65). Reader agreement regarding the categorized PMFE with use of the three original categories was very poor (␬ ⫽ 0.12), but the choice of cutoff points led to all but 15 of the 630 readings scoring four points for this element of the scoring system. Figure 1a shows the PMFE values calculated for the 491 breasts in the design set according to MR unit. By chance, none of the three examinations performed with the Picker machine was included in the design set. The distribution of values appeared to be broadly similar among the three MR units; therefore, it seemed reasonable to use the same categories for all machines. The values in the overall 33.3rd and 66.7th percentiles were 158% and 263%, respectively; thus, after round-

Warren et al

ing, the PMFE values for the three proposed new scoring categories were less than 160%, greater than or equal to 160% and less than 260%, and greater than or equal to 260%. These are very different from the original values: less than 40%, 40%– 60%, and greater than 60%, respectively. Changing the categories in this way substantially improved interreader agreement (weighted ␬ ⫽ 0.49). Figure 1b shows the same data for MITRs. Since the distribution of values differed markedly among the MR units, it was more appropriate to define the categories separately for each machine. Rounding the values in the 33.3rd and 66.7th percentiles yielded the following ratio categories: for the GE Medical Systems unit, less 20, between greater than or equal to 20 and less than 35, and greater than or equal to 35; for the Siemens unit, less than 40, between greater than or equal to 40 and less than 60, and greater than or equal to 60; and for the Philips Medical Systems unit,

Figure 1

Figure 1: (a) Graph shows distribution of PMFE values in the design set according to MR unit, including the 33.3rd and 66.7th percentiles on which the cutoff values were based. The distribution of PMFE values was very similar for the three MR units; thus, it was appropriate to use a single set of cutoff points for the categories in the revised scoring system. (b) Graph shows distribution of MITRs in the design set according to MR unit, including the 33.3rd and 66.7th percentiles on which the cutoff values were based. The wide disparities between the values measured by using the different machines indicate that different cutoff points were used for each unit in the revised scoring system. (c) Graph shows distribution of normalized MITRs in the design set according to MR unit, including the 33.3rd and 66.7th percentiles on which the cutoff values were based. Normalizing the MITR made the distribution similar across all three units; thus, a single set of cutoff points was used for the categories in the revised scoring system. E ⫽ benign lesions, ⴱ ⫽ malignant lesions. Radiology: Volume 239: Number 3—June 2006

681

BREAST IMAGING: Scoring System for Breast MR Screening Study

less than 250, between greater than or equal to 250 and less than 400, and greater than or equal to 400. With use of the new categories, the ␬ statistic for agreement regarding this variable was reduced (weighted ␬ ⫽ 0.56). With use of the original scoring system, only the absolute MITR was used. However, as an alternative approach, we also decided to consider the normalized MITR (Fig 1c). Normalizing the MITR made the distribution of values more similar across the different MR units, eliminating the need for machinespecific category definitions. On the basis of the overall 33.3rd and 66.7th percentiles, the normalized MITR categories were less than 0.45, between greater than or equal to 0.45 and less than 0.75, and greater than or equal to 0.75, yielding a weighted ␬ statistic of 0.49.

Lesion Size The widest diameters of 968 of the 991 lesions considered in this analysis were recorded. The lesions for which size

was not recorded were not significantly different in terms of either patient age distribution (P ⫽ .17, Kruskal-Wallis test) or total lesion score (P ⫽ .44), but they were about three times more likely to have been examined with a 1.0-T rather than 1.5-T MR imaging unit (P ⫽ .018, adjusted for MR unit). In the design set, the median size for the 478 lesions of known size was 7 mm in diameter (range, 1– 63 mm; interquartile range, 5–12 mm). With small lesions defined as those 10 mm or less in diameter, there were 346 small lesions, three of which were malignant, and 132 larger lesions, 39 of which were malignant.

Test Performance The diagnostic accuracies of each element of the scoring system are illustrated by the ROC curves in Figure 2, which are based on the design set. The total score was more useful for distinguishing between malignant and nonmalignant lesions than any individual component of the scoring system. However, the area under the ROC curve for the

Warren et al

total score was not significantly greater than that for the POE curve (P ⫽ .1); this result suggests that the POE was almost as useful a predictor of malignancy as the total score. The categorical PMFE and MITR values and the total score illustrated in Figure 2 are based on the new category cutoff points. Using the original category cutoff points did not cause a significant reduction in the area under the curve for the total score (area for original total score, 0.88; 95% CI: 0.83, 0.94; P ⫽ .9 for difference in area with use of old vs new category cutoff points) or the MITR (area for original value, 0.56; 95% CI: 0.49, 0.63; P ⫽ .1). However, use of the new definition caused a significant increase in the area under the curve for the PMFE (area for original value, 0.53; 95% CI: 0.52, 0.54; P ⫽ .03). The normalized MITR was a significantly better predictor than the absolute MITR (P ⫽ .005).

Diagnostic Accuracy Six covariates (MOR, POE, POCW, PMFE, MITR, and normalized MITR)

Figures 2, 3

Figure 2: Design set– based ROC curves for the total lesion score and for each element of the scoring system. The qualitative components (POE, POCW, and MOR) appeared to be more diagnostically useful than the quantitative components (PMFE and MITR). Figure 3: Testing set– based ROC curves for the new total scoring system and the original scoring system. The area under the ROC curve for the new scoring system is slightly larger, but the difference in areas under the curve for the two systems is not significant (P ⫽ .02). 682

Radiology: Volume 239: Number 3—June 2006

BREAST IMAGING: Scoring System for Breast MR Screening Study

were assessed in a multiple logistic regression analysis by using the design set, with backward stepwise selection at the 5% significance level. All of these covariates were categorical variables with three levels, so there were a total of 12 parameters to be estimated. After removing the least significant term at each step until no nonsignificant terms remained, the model contained terms for spiculated MOR and for both levels of POE, POCW, and normalized MITR. However, the coefficients for the two levels of normalized MITR were very similar (1.7 and 1.6 for the 2nd and 3rd levels, respectively), so the covariates for the three categories were replaced by a single covariate, which indicated whether or not the normalized MITR was greater than or equal to 0.45. The coefficients for this model are given in Table 1.

Simplified Scoring System A new total score was computed for each lesion on the basis of just the terms shown in Table 1 and with the regression coefficients considered to be the optimal weightings for the different categories. The area under the ROC curve for this new score was 0.91 (95% CI: 0.86, 0.97). In reality, a scoring system that involved the allocation of integer points would be preferable, so a simplified total score was also computed for each lesion on the basis of rounded regression coefficient values (Table 1). Use of these approximate values did not affect the area under the ROC curve (P ⫽ .96). It should be noted that three of the four terms in this model (MOR, POE, and POCW) did not require the radiologist to perform any numerical analysis, whereas the sole quantitative parameter—normalized MITR—was also the least significant in the logistic regression (P ⫽ .028). Removing the normalized MITR from the scoring system had virtually no effect on the area under the ROC curve (area, 0.91; 95% CI: 0.85, 0.96) (Fig 3). Assigning all lesions with a score of 5 or higher to the malignant or suspicious category yielded a sensitivity of 81% (95% CI: 66.6%, 91.6%) and a specificity of 84.4% (95% CI: 80.7%, 87.6%) (Table 2). Radiology: Volume 239: Number 3—June 2006

Warren et al

Table 1 Results from Multiple Logistic Regression Analysis Based on the Design Set and Points Allocation for the New Scoring System Variable Spiculated MOR Heterogeneous POE Ringlike POE POCW type 2 POCW type 3 Normalized MITR ⱖ 0.45‡

Coefficient*

95% CI

P Value

No. of Points†

1.8 2.2 3.2 2.0 2.4 1.6

0.64, 2.90 1.2, 3.1 2.0, 4.4 0.91, 3.0 1.3, 3.4 0.18, 3.1

.002 ⬍.001 ⬍.001 ⬍.001 ⬍.001 .028

4 4 6 4 5 3 (0)‡

* Each coefficient from the logistic regression analysis is equal to the natural logarithm of the odds ratio associated with the presence of this characteristic. † Number of points allocated to this observation in the new scoring system, derived by rounding the regression coefficient to the nearest 0.5 and then doubling it to obtain a whole number for ease of computation. ‡ A normalized MITR score of greater than or equal to 0.45 was initially assigned three points, but the scoring system without this term was found to perform almost equally well.

Table 2 Sensitivities and Specificities of Design and Testing Sets in New and Original Scoring Systems Scoring System and Set New scoring system Design set (n ⫽ 491) Testing set (n ⫽ 500) Original scoring system Design set (n ⫽ 491) Testing set (n ⫽ 500)

Sensitivity

Specificity

81 (35/43) 66.6, 91.6 84 (37/44) 69.9, 93.4

84.4 (378/448) 80.7, 87.6 86.0 (392/456) 82.4, 89.0

79 (34/43) 64.0, 90.0 86 (38/44) 72.6, 94.8

79.9 (358/448) 75.7, 83.3 76 (347/456) 71.9, 79.9

Note.—Sensitivity and specificity values, expressed as percentages, were calculated on the basis of 43 malignant cases in the design set and 44 malignant cases in the testing set. Numbers in parentheses are the numbers of cases used to calculate the percentage. Numbers on the second line are 95% CIs.

With use of the simplified scoring system, the area under the ROC curve for the small lesions was the same as that for the entire design set (area, 0.91; 95% CI: 0.77, 1.00), but that for the larger lesions was slightly lower (area, 0.83; 95% CI: 0.74, 0.91; P ⫽ .3 for comparison of small vs large lesions). The rarity of small malignant lesions made it impossible to repeat the logistic regression analysis for this group alone, but an analysis restricted to the larger lesions yielded an optimal model similar to that in Table 1 (data not shown). The normalized MITR was

not significant in this group, but the biggest change occurred with regard to the POE: Heterogeneous POE was no longer a significant predictor of malignancy, and while a ringlike POE remained significant, the regression coefficient was reduced from 3.2 to 1.4 (95% CI: 0.11, 2.6). This reduced effect size may have been due in part simply to stochastic variation, particularly given the smaller number of lesions, but it may also have been an indication that the POE was useful primarily for the detection of smaller lesions. The new scoring system was tested 683

BREAST IMAGING: Scoring System for Breast MR Screening Study

on the 500 breasts in the testing set, and the area under the ROC curve for this set was very close to that for the design set. The new scoring system performed slightly better than the original scoring system in the testing set (Fig 3), although not significantly so (P ⫽ .2). However, the small gain in diagnostic ability was accompanied by a substantial reduction in computational complexity, since the new scoring system did not require the calculation of PMFE or MITR values. When the lesions that were assigned total scores of 5 or higher were described as malignant or suspicious, the sensitivity in the testing set was 84% (95% CI: 69.9%, 93.4%) and the specificity was 86.0% (95% CI: 82.4%, 89.0%) (Table 2). With use of the 315 lesions that were analyzed by both readers, the ␬ statistic for this definition of malignant or suspicious lesion was moderate (0.60; 95% CI: 0.50, 0.69).

Discussion Our study, in which we examined the different morphologic and dynamic uptake variables that were recorded as part of the original scoring system used in the MARIBS study, revealed the original scoring system to be effective and enabled us to devise a simplified scoring system that had similar or slightly improved diagnostic usefulness. An important consideration in evaluating the suitability of breast MR imaging for screening is the repeatability of the observations that lead to the diagnosis. In the current study, almost all levels of interreader agreement regarding the individual elements of the scoring system could be described as moderate (10), and even the new scoring system yielded a ␬ statistic of 0.60. It must be recognized that diagnoses made with breast MR imaging are sufficiently imprecise such that the allocation of diagnostic categories involves the use of terms with which only moderate agreement can be achieved, and this fact applies to both descriptive and calculated numerical observations and detailed and aggregated features. There is clearly a subjective element to the de684

scribed MR imaging screening technique despite attempts to minimize it. The double reading of all breast MR imaging studies was the practice throughout the MARIBS study. The analyses related to lesion size are interesting. We found the POE to be a more useful discriminator of lesions smaller than 10 mm in diameter. It is interesting that the frequency of small lesions that were analyzed but found to be cancers was low. This finding is not reflected in a paucity of small cancers detected with MR imaging in our main study. By chance, our case material included no histologic lesions with atypia. Histologic atypias have been a source of false-positive findings (in other studies) that was not represented in our present analyses (15). With the new simplified scoring system, integer scores were assigned for spiculated MOR, heterogeneous POE, ringlike POE, and POCW types 2 and 3, with a summed score higher than 5 indicating a malignant or suspicious lesion. With the new simplified scoring system, the sensitivities in both the design set and the testing set were within the range of the previously published singlereader sensitivities in the main MARIBS study (80% for reader 1, 89% for reader 2), but the specificities in both sets were lower than the 88% specificity previously reported for both reader 1 and reader 2. However, this was more a consequence of the different data sets than a consequence of the behavior of the scoring system itself, as can be seen from the comparisons between the new simplified and original scoring systems. Because the design and testing sets described herein included only breasts with lesions that radiologists had decided to analyze, the study was necessarily enriched for interesting observations, as opposed to the main study, which included also all the breasts that did not have analyzed lesions. At the design stage of the study protocol, there was debate within the study advisory group (members listed in Appendix E1 [radiology.rsnajnls.org /cgi/content/full/2393042007/DC1]) and in the literature regarding the precise

Warren et al

method of calculating the parameter that included both the peak signal intensity and the time to achieve it. The MITR was chosen because it was the preferred measure reported in the literature at the time. The normalized MITR takes into account the signal intensity before the contrast material injection and thus includes adjustments for MR unit–related factors, and the use of this parameter eliminates the problem of a lack of uniform signal intensity related to the site within the magnet. We found that the normalized MITR is a more informative measure than the absolute MITR and has the advantage of having the same distribution on all MR unit types. Nevertheless, it does not appear to be an important independent predictor of malignancy. Interestingly, the variables that required the most computational effort on the part of radiologists (PMFE, MITR, and normalized MITR) apparently were also the least diagnostically useful. The proposed new scoring system should therefore be quicker, easier to perform, and less prone to calculation errors while maintaining sensitivity in lesion detection and characterization. The described analyses were performed after the full publication of the American College of Radiology BI-RADS breast MR imaging lexicon (3), which is designed to be a common format for reporting breast MR imaging data. The introductory statement in the published document recognizes the existing diversity of interpretations and the need for a more uniform approach. A comparison of the conclusions of our scoring system analysis with the data cited in the first (2003) edition of the published BIRADS lexicon revealed that our study and the extensive work on the design and testing of the MR lexicon in BIRADS yield similar conclusions. Our scoring system included a very simple descriptive categorization that was intended to yield a simple integer score. Since our radiologists were extensively involved in mammography, it is likely that the morphologic features familiar to them from mammography guided the use of the simple MOR score requested on our forms. POE and washout characteristics are an essential part of the BIRADS lexicon report. Our study results Radiology: Volume 239: Number 3—June 2006

BREAST IMAGING: Scoring System for Breast MR Screening Study

show that these are important breast lesion discriminators. On the other hand, numerical scores and calculations of the contrast dynamics are discarded in the BIRADS lexicon, and the associated text recognizes the complexity of the literature and the local importance of these features in some medical centers. Our results therefore independently support the approach used by the authors of the BI-RADS lexicon. Our study had limitations. Our initial plan was to use the 100 images from the symptomatic cohort to design an improved scoring system, which would then be tested in the early cases in the screening study. This proved to be impractical, however, largely because the age distribution of the recruited symptomatic cohort was not representative of the age distribution of the screening cohort. There also may have been cancers with different morphologic characteristics. The revised scoring system could not be based purely on data from the screening cohort because the small number of malignant cases yielded little information about the sensitivity of the system. Thus, the material that we used was a combination of lesions from screening cases (n ⫽ 877) and lesions from symptomatic cases (n ⫽ 114). While it was often the case that a radiologist identified and scored multiple lesions in the same breast, the analysis described herein was based on only the highest scoring lesions. This was reasonable for the main analysis, since in practice the maximal score determines whether a woman will be recalled for further examination. However, in the present study, this meant that the ␬ statistics for interreader agreement could be based on only those breasts in which both readers identified the same lesion as the highest scoring or the only lesion in the given breast. Thus, it is possible that the reported ␬ values were biased in favor of the more unambiguously high-scoring lesions.

Radiology: Volume 239: Number 3—June 2006

In conclusion, the scoring system designed at the outset of the MARIBS study was validated and effective. However, it could be improved in minor ways—namely, by changing the weightings and the cutoff points. The quantitative parameters PMFE and MITR were shown to be less useful compared with three qualitative descriptors: MOR, POE, and POCW. The kinetic calculation shown to be the most discriminative was the normalized MITR, which can contribute to an effective lesionscoring system. Results of the current analyses therefore suggest that a simple integer-based calculation of lesion morphologic features, heterogeneity, and kinetic curve shapes can be used effectively in the context of breast MR imaging screening. Acknowledgments: We acknowledge the work of many radiographers, nurses, clerical staff members, physicists, engineers, and others, who have not been named but whose contributions were important. We extend a special thank you to the patients and screening subjects and the surgeons and oncologists who referred them, without whom this study would not have been possible.

References 1. MARIBS Study Group. A prospective multicentre cohort study to evaluate the performance of screening with breast MRI and mammography in a population at high familial risk of breast cancer: the UK study of MRI screening for breast cancer in women at high risk (MARIBS). Lancet; 2005;365(9473): 1769 –1778. 2. Brown J, Buckley D, Coulthard A, et al. Magnetic resonance imaging screening in women at genetic risk of breast cancer: imaging and analysis protocol for the UK multicentre study. Magn Reson Imaging 2000;18(7):765–776. 3. American College of Radiology. Breast Imaging Reporting and Data System atlas: report P-BIRADSA. Reston, Va: American College of Radiology, 2003. 4. Ikeda DM, Hylton NM, Kinkel K, et al. Development, standardization, and testing of a lexicon for reporting contrast-enhanced

Warren et al

breast magnetic resonance imaging studies. J Magn Reson Imaging 2001;13(6):889 – 895. 5. Warren RM, Pointon L, Caines R, Hayes C, Thompson D, Leach MO. What is the recall rate of breast MRI when used for screening asymptomatic women at high risk? Magn Reson Imaging 2002;20(7):557–565. 6. Warren RM, Pointon L, Thompson D, et al. Reading protocol for dynamic contrast-enhanced MR images of the breast: sensitivity and specificity analysis. Radiology 2005; 236(3):779 –788. 7. Antoniou AC, Pharoah PP, Smith P, Easton DF. The BOADICEA model of genetic susceptibility to breast and ovarian cancer. Br J Cancer 2004;91(8):1580 –1590. 8. Liberman L, Morris EA, Benton CL, Abramson AF, Dershaw DD. Probably benign lesions at breast magnetic resonance imaging: preliminary experience in high-risk women. Cancer 2003;98(2):377–388. 9. Liu PF, Debatin JF, Caduff RF, Kacl G, Garzoli E, Krestin GP. Improved diagnostic accuracy in dynamic contrast enhanced MRI of the breast by combined quantitative and qualitative analysis. BR J Radiol 1998;71:501–509. 10. Gibbs P, Liney GP, Lowry M, Kneeshaw PJ, Turnbull LW. Differentiation of benign and malignant sub-1 cm breast lesions using dynamic contrast enhanced MRI. Breast 2004; 13(2):115–121. 11. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33(1):159 –174. 12. Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 1968;70:213–220. 13. Huber P. The behaviour of maximum likelihood estimates under non-standard conditions. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967. Berkeley, Calif: University of California Press, 1967; 221–223. 14. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988;44(3):837– 845. 15. Hartman AR, Daniel BL, Kurian AW, et al. Breast magnetic resonance image screening and ductal lavage in women at high genetic risk for breast carcinoma. Cancer 2004; 100(3):479 – 489.

685

Suggest Documents