Subjective assessment of adnexal masses with ... - Wiley Online Library

37 downloads 1537 Views 91KB Size Report
and level of experience, who were also given a brief clinical history of the patients ..... was malignant; hence the analysis of his data gave a very high specificity ...
Ultrasound Obstet Gynecol 1999;13:11–16

Subjective assessment of adnexal masses with the use of ultrasonography: an analysis of interobserver variability and experience D. Timmerman, P. Schwärzler*, W. P. Collins†, F. Claerhout, M. Coenen, F. Amant, I. Vergote and T. H. Bourne* Department of Obstetrics and Gynecology, University Hospitals Leuven, Belgium; *Academic Department of Obstetrics and Gynaecology, St. George’s Hospital Medical School, London; †Academic Department of Obstetrics and Gynaecology, Guy’s, King’s and St. Thomas’ School of Medicine, King’s College Hospital, London, UK Key words:

OBSERVER VARIABILITY, ADNEXAL MASSES, SUBJECTIVE IMPRESSION, ULTRASONOGRAPHY, COLOR DOPPLER

SONOGRAPHY

ABSTRACT

INTRODUCTION

Objective The aim of the study was to evaluate the subjective assessment of ultrasonographic images for discriminating between malignant and benign adnexal masses.

The debate over the ability to characterize persistent cystic ovarian tumors continues, and it is an issue that is of considerable relevance in view of the increasing interest in the conservative management of benign ovarian cysts, as well as in the selection of patients for less invasive surgical techniques. Transvaginal sonography (TVS) in conjunction with color Doppler imaging is now widely used to evaluate adnexal masses. The masses have been classified according to the presence or absence of septae and/or solid tissue into the following categories: unilocular, unilocular–solid, multilocular, multilocular–solid or solid1. In addition, morphological scoring systems which include the wall thickness of the cyst, septal thickness and echogenicity have been proposed2. This approach was modified by the inclusion of weighted point value assignments, by the introduction of fewer point values per variable, by deleting one variable found not to be significant (wall thickness) and by the inclusion of a new variable called shadowing3. If TVS is used in conjunction with color Doppler imaging, it is possible to identify the presence of vascular changes in new tissue growth. It has been proposed that these changes, imaged as fluctuating areas of color, represent neovascularization4. Indices derived from the flow velocity waveform provide a quantitative estimate of blood flow velocity and impedance. Nevertheless, and despite the availability of scoring systems, most sonographers base their diagnosis on a subjective assessment of adnexal masses by using ultrasonography and the available information, including a medical history. However, like all measurement techniques, TVS and color Doppler imaging in particular are subject to

Study design The study was prospective. Initially, one ultrasonographer preoperatively assessed 300 consecutive patients with adnexal masses. Subsequently, the recorded transparent photographic prints were independently assessed by five investigators, with different qualifications and level of experience, who were also given a brief clinical history of the patients (i.e. the age, menstrual status, family history of ovarian cancer, previous pelvic surgery and the presenting symptoms). The diagnostic performance of the observers was compared with the histopathology classification of malignant or benign tumors. The end-points were accuracy, interobserver agreement and the possible effect of experience. Results The first ultrasonographer and the most experienced investigator both obtained an accuracy of 92%. There was very good agreement between these two investigators in the classification of the adnexal masses (Cohen’s kappa 0.85). The less experienced observers obtained a significantly lower accuracy, which varied between 82% and 87%. Their interobserver agreement was moderate to good (Cohen’s kappa 0.52 to 0.76). Conclusion Experienced ultrasonographers using some clinical information and their subjective assessment of ultrasonographic images can differentiate malignant from benign masses in most cases. The accuracy and the level of interobserver agreement are both correlated with experience. About 10% of masses were extremely difficult to classify (only < 50% of assessors were correct).

Correspondence: Dr D. Timmerman, Department of Obstetrics and Gynecology, University Hospitals Leuven, Herestraat 49, B-3000 Leuven, Belgium

11

O R I GI N A L PAP E R

98/132

AMA: First Proof

Received 17–6–98 Revised 2–10–98 Accepted 13–10–98

Subjective assessment of adnexal masses errors. Systematic errors such as the malfunctioning of technical equipment and examiner error can lead to variations, the size of which affects the diagnostic accuracy. High interobserver variability would make it impossible to incorporate measurements into a routine ultrasound schedule performed by several examiners, because such variability would vitally affect the interpretation of results. There is little information on the reliability of transvaginal sonographic findings regarding malignant and benign adnexal masses as obtained by different examiners. Moreover, the definition of an acceptable method of measuring reliability is still a matter of debate5,6. Therefore, no proper guidelines can be given as to the acceptable level of bias in TVS and pulsed Doppler measurements to discriminate between malignant and benign adnexal masses. Although major patient management decisions are based on sonographic interpretations, few detailed statistical analyses of (inter- and intra-) observer variability have been performed7–10 and little is known about the impact of experience on the subjective assessment of adnexal masses. In particular, there are no data regarding the variability between different examiners of various experience and qualification. The aims of the present study were, first, to determine whether subjective assessment of an adnexal mass by ultrasonography can be used to discriminate accurately between malignant and benign pelvic tumors; second, to assess the effect of different levels of ultrasound experience on the ability of the observer to make this discrimination; and, third, to assess the proportion of masses that are difficult to classify as malignant or benign.

METHODS Study design All patients were preoperatively scanned by the same ultrasonographer (observer A) with either an Acuson 128 XP/10 Acoustic Response Technology (ART) ultrasound system or an Acuson Sequoia ultrasound system (Acuson Inc., Mountain View, CA, USA), both equipped for color Doppler imaging. A 5-MHz or a MultiHertz intravaginal probe was used, incorporating a field of view of 90° and 140°, respectively. The wall filter was set at 50 Hz and the pulsed Doppler sample volume at 2 mm. Initially, an extensive transabdominal and transvaginal gray-scale examination with attention to morphological criteria was performed. Briefly, attention was given to the size, locularity, echogenicity, papillary structures and internal surface of the tumor. Ascites was judged to be present if there was at least 5 mm of free fluid in the pouch of Douglas. Subsequently, the entire tumor was surveyed by color Doppler imaging. The flow parameters included the pulsatility index (PI), resistance index (RI), peak systolic velocity (PSV) and time-averaged maximum velocity (TAMXV). All indices were calculated electronically from a smooth curve fitted to the Doppler shift spectrum. In tumors with multiple sampled areas, the set of results with the highest

Timmerman et al. PSV and the corresponding values for TAMXV, PI and RI were selected. Multiple transparent photographic prints were made of relevant structures and Doppler signals. Immediately after the examination, the ultrasonographer noted a personal assessment of the adnexal mass and classified the tumor as malignant or benign. Malignancies were subsequently staged according to criteria recommended by the International Federation of Gynecology and Obstetrics (FIGO)11.

Patients The study group consisted of 300 consecutive patients with an extrauterine pelvic tumor, who were referred to a single institution between August 1994 and June 1997. We included pre- and postmenopausal patients who gave informed consent to undergo transvaginal color Doppler sonography. Hysterectomized patients, older than 50 years of age, were regarded as postmenopausal. Only data from those patients who had an adnexal mass removed at surgery were included.

Investigators The prospective evaluation of the masses was performed by observer A, who had performed more than 5000 transvaginal ultrasound scans on women with gynecological conditions over a period of 8 years. The ultrasound images derived from the masses were assessed independently by each of a group of ultrasonographers with different levels of experience. Investigator B had been performing transvaginal ultrasonography in gynecology for over 10 years and had undertaken over 15 000 scans. Investigator C was moderately experienced, and had carried out approximately 1000 scans over a period of 6 years. Finally, three doctors who were still training in ultrasonography, who had each performed between 200 and 300 scans, mostly under supervision, evaluated the images. The aim was for the masses to be evaluated by doctors who had marked differences in experience (i.e. highly qualified (A, B), trained (C) and still training (D, E, F)). When assessing the images, each investigator was given the age, the menopausal status, and a brief family and medical history of the patient, together with the presenting symptoms and the findings from B-mode and color Doppler imaging. In order to test the value of subjective assessment, no criteria for malignancy regarding morphology and vascularity of the lesion (e.g. cut-off values for PI or RI) were given to the investigators. They had access to color prints of color flow imaging and to prints of the representative Doppler shift spectra if color flow had been detected within the adnexal mass. Thus, a qualitative analysis of the color distribution and the Doppler waveforms was performed. Each investigator gave his personal assessment of the mass as being malignant or benign, and was blind to the opinion of the other observers. During the study, no feedback regarding the histopathological classification of the tumors was given to the participants, in order to avoid any learning effect.

12 Ultrasound in Obstetrics and Gynecology AMA: First Proof

98/132

Subjective assessment of adnexal masses

Timmerman et al.

Statistical analysis

had a benign tumor on one ovary and a malignancy on the other.

Baseline characteristics were summarized by the mean and standard deviation (SD) for normally distributed variables. Sensitivity was defined as the percentage of women with ovarian cancer who had a positive scan result. Accuracy was defined as the sum of the true negatives and true positives divided by the total number of studied cases. The accuracy of each investigator was compared by McNemar’s test (exact interference). The interobserver agreement was evaluated with Cohen’s kappa12, calculated by introducing the following equation in an Excel 7.0 workbook (Microsoft Office for Windows 95, Microsoft Corporation, USA): kappa = (OA − AC)/(1 − AC), where observed agreement (OA) = (a + d)/N, and agreement by chance (AC) = {[(a + c) × (a + b)/N] + [(b + d) × (c + d)/N]}/ND, and the total number of cases (N) = a + b + c + d (a, true positives; b, false positives; c, false negatives; d, true negatives). All tests were performed two-sided at a significance level of 0.05. The McNemar’s test was performed with the use of StatXact (StatXact 3 for Windows, Cytel Software Corporation, USA). Cohen’s kappa was used to assess interobserver agreement for categorical data (kappa values of 0.81–1.0 indicate very good agreement, kappa values of 0.61–0.80 good agreement and kappa values of 0.41–0.60 moderate agreement)13.

Subclassification of masses Histological analysis revealed 83 malignancies. The subclassification of masses by histological and surgical criteria is shown in Table 2. Fifty-five malignancies were primary invasive (31% stage I, 5% stage II, 49% stage III and 15% stage IV). Forty-four of the primary invasive cancers (27% stage I) and eight of the tumors of low malignant potential (100% stage I) were classified as cystadenocarcinomas. Eleven (55%) of the metastatic cancers were of gastrointestinal origin and four (20%) were endometrial. The most common benign tumors were subclassified as cystadenoma (59), endometrioma (36), benign teratoma (23), abscess and/or hydrosalpinx (21), corpus luteum cyst (14) and cystadenofibroma (12). Amongst the less common histological findings were Brenner tumors (4), fibrothecomas (3), a tuberculous granuloma (1) and a chronic ectopic pregnancy (1).

Diagnostic accuracy of different observers Table 3 shows the performance of the different assessors in discriminating between malignant and benign adnexal masses. The most experienced operators (A and B) obtained an accuracy of 92%. The less experienced observers obtained accuracies between 82% and 87%. Observers A and B both achieved significantly better accuracies than observers D (p = 0.0001), E (p < 0.0001) and F (p = 0.03). However, there was no significant difference with observer C (p = 0.26), who scored significantly better than both observers D (p = 0.009) and E (p = 0.008).

RESULTS The total study group of 300 patients included 166 premenopausal women (mean age 39 years; range 18–57 years) and 134 postmenopausal women (mean age 66 years; range 44–93 years). Some patient characteristics and the histological classification of the tumors are summarized in Table 1. In 61 women, the ultrasound examination revealed a bilateral mass. Only data from the more complex lesion were taken into account. None of the women

Interassessor agreement Interassessor agreement for the discrimination between malignant and benign adnexal masses is shown in Table 4. There was very good agreement between the first ultrasonographer (A) and the most experienced investigator (B) in the classification of the adnexal masses (Cohen’s kappa 0.85). For the other investigators, the interassessor agreement was moderate to good (Cohen’s kappa 0.52–0.76). The proportion of assessors who correctly classified the adnexal masses as malignant or benign is shown in Table 5. A correct classification was made by all six assessors in 65% of all 300 tumors (in 73% of all primary invasive

Table 1 Prevalence of malignancy in adnexal masses by histological criteria and the age and menopausal status of patients in both groups Age (years)

Tumor type

% Range postmenopausal

n

%

Mean

Malignant Benign

83 217

28 72

59 48

23–86 18–93

70 35

All

300

100

51

18–93

45

Table 2

Subclassification of malignant masses by histological and surgical criteria and the age of patients in each group Stage I

Age (years)

Subclassification

n

%

n

%

Mean

SD

Primary invasive Borderline malignant Metastatic invasive

55 8 20

66 10 24

17* 8 NA

68 32 NA

60 42 62

13 14 16

All

83

100

25

100

59

15

NA, not applicable; *31% of primary tumors were stage I

Ultrasound in Obstetrics and Gynecology 13 98/132

AMA: First Proof

Subjective assessment of adnexal masses

Timmerman et al.

Ability of individual observers to discriminate between benign and malignant adnexal masses

Table 3

Assessor

Assessment outcome

Experience of TVS (approximate number of scans)

Sensitivity

Specificity

PPV

NPV

Accuracy

5 000 15 000 1 000 200 300 300

96.4 97.6 81.9 86.7 86.7 90.4

89.9 89.4 91.7 80.6 80.6 85.3

78.4 77.9 79.1 63.2 63.2 70.1

98.5 99.0 93.0 94.1 94.1 95.9

91.7 91.7 89.0 82.3 82.3 86.7

A B C D E F

TVS, transvaginal sonography; PPV, positive predictive value; NPV, negative predictive value Table 4

Interassessor agreement between scan results from the study of 300 adnexal masses Number of cases Agreement

Assessors A, B A, C A, D A, E A, F

Disagreement

(+,+)

(−,−)

(−,+)

(+,−)

Cohen’s kappa

Interpretation

93 77 81 79 83

187 189 165 163 174

11 9 33 35 24

9 25 21 23 19

0.852 0.737 0.610 0.581 0.684

very good good good moderate good

+, classified as malignant; −, classified as benign

Table 5 Proportion of assessors who correctly classified the adnexal masses as malignant or benign Proportion of assessors 6/6 5/6 4/6 3/6 2/6 1/6 0/6

Masses classified correctly n

%

196 39 35 12 9 6 3

65 13 12 4 3 2 1

cancers, in 50% of the borderline malignant tumors, in 50% of the metastatic invasive cancers and in 65% of all benign tumors). Only 25% of all cystadenofibromas and 25% of the tubo-ovarian abscesses were correctly classified, whereas a correct classification was obtained by all assessors in 81% of the endometriomas, 76% of the cystadenomas and 70% of the benign teratomas. In 10% of the masses, ≤ 50% of assessors made a correct classification, whilst in 1% of the masses all six assessors made an incorrect classification. Examples of misclassified adnexal masses are shown in Figures 1 and 2.

DISCUSSION Initial efforts to distinguish malignant from benign lesions were based on the morphological assessment of tumors by criteria such as the presence or absence of septa or papillary projections. Thus, the hypothesis was tested that malignant tumors have morphological characteristics that make it possible for them to be identified. Several studies have reported that this is the case1. However, even if all tumors

other than simple unilocular cysts are considered malignant, it must be accepted that a small number of cancers will be missed. The response to the limitations of simple morphological classifications has been the introduction of scoring systems2,3. The use of color Doppler imaging has added to our ability to evaluate ovarian masses; however, the data in relation to this technique are variable14–17. More recently, attempts have been made to combine clinical, morphological and color Doppler information from a particular patient to produce a post-test probability of malignancy. Logistic regression analysis has been applied in this context18. Little has been published on the validity of subjective evaluations, but there have been numerous reports on the effectiveness of different mathematical algorithms. However, most ultrasonographers base their interpretation of ultrasound images on a subjective estimation rather than on the use of decision levels or scoring systems. The primary aim of this study was to assess whether subjective impressions of recorded ultrasound images could be used to classify adnexal masses as malignant or benign. A further aim was to evaluate the influence of observer experience on overall test performance. The study design did not involve each observer examining the patients, as this would not be technically feasible or acceptable, and for the same reasons we did not test the potential variability resulting from the use of different ultrasound equipment. Ultrasonography is a dynamic process. The examiner has mentally to create a three-dimensional image of the structures of interest by moving a transducer that produces two-dimensional images. In our study, five of the observers had to assess static images. This was almost certainly a disadvantage to the experienced observers, who were unable to select informative images for themselves. In contrast, it was an advantage to the less experienced observers,

14 Ultrasound in Obstetrics and Gynecology AMA: First Proof

98/132

Subjective assessment of adnexal masses

Timmerman et al.

Colour

Figure 1 Ultrasonographic image of a benign mucinous cystadenoma (20.2 × 17 × 10 cm) in a 42-year-old patient; the mass was falsely classified as scan positive by all assessors. There was a large amount of ascites in the pouch of Douglas and very strong arterial vascularization with low-resistance flow in the septa and cyst wall (pulsatility index 0.36; resistance index 0.30; peak systolic velocity 0.23 m/s; time-averaged maximum velocity 0.19 m/s)

Figure 2 Ultrasonographic image of a mainly borderline malignant serous cystadenocarcinoma but with a small focus of stromal microinvasion in a 57-year-old patient; the mass was classified by five investigators as benign. The whole ovary measured 3.6 × 2.7 × 2.3 cm and contained a multilocular cyst of 2.2 × 1.9 × 1.3 cm. No arterial flow could be detected by color Doppler imaging within the septa or the cyst wall. Note the small irregularities in the cyst wall

who were presented with optimal images to interpret. This bias would lead to an underestimation of the impact of experience. The excellent agreement between the two experienced ultrasonographers suggests that an adnexal mass can be characterized with a high degree of accuracy, if informative images are available. Our data also suggest that experience is important for the accurate interpretation of ultrasound images. The study population had a very large prevalence of adnexal malignancies, including an important fraction with stage I ovarian cancers (31%). This finding might be explained by the fact that the study was conducted in a regional referral center for gynecological oncology. The large number of stage I ovarian cancers is important, as the characterization of such tumors is more clinically relevant and difficult to recognize in comparison to stage III or stage IV disease.

The finding that subjective assessment of stored images is reliable is highly relevant for telemedicine applications, and for training in gynecological ultrasonography. Whilst learning to obtain high-quality representative images is of prime importance, our data suggest that training should focus on recognizing the constituent morphological features of a mass, rather than on any particular scoring system. This aim may be achieved by using examples of complex ultrasonographic images from different adnexal masses so that trainees may establish their own database of experience. Exposure to stored images, perhaps ‘online’ or in CD-ROM format, could be formally implemented into relevant training programs. Our data have implications for clinical audit. For example, in our study it was apparent that some investigators used different mental decision levels: observer C classified a mass as being malignant only if he was certain that it was malignant; hence the analysis of his data gave a very high specificity and a lower sensitivity for malignancy. On the other hand, observer F had a lower specificity, but a higher sensitivity. The systematic feedback of information about the histology of any mass and clinical outcome for a patient may be important to help observers modify their decision-making. It is perhaps surprising that observers with relatively little experience were able to classify so many masses accurately. Indeed, 65% of all masses were correctly classified by all six observers. We feel that this probably means that most masses are in fact quite easy to characterize, leaving a proportion (10%) that are difficult (arbitrarily defined by us as being correctly classified by only three assessors). Our data suggest that the lesions most likely to cause difficulties are cystadenofibromas and tubo-ovarian abscesses (25% correctly classified as malignant or benign by all six observers). It is with these lesions that greater experience may enable an operator to produce better results. This subset of lesions merits further analysis and should be the

Ultrasound in Obstetrics and Gynecology 15 98/132

AMA: First Proof

Subjective assessment of adnexal masses subject of future studies. Large increases in operator experience appear from our data to have little impact on diagnostic accuracy. However, this may not be the case for those lesions that may be considered difficult to assess. We conclude that the subjective evaluation of an ovarian mass by an experienced ultrasonographer is an accurate method for discriminating between benign and malignant ovarian masses. Therefore, any analysis using mathematical approaches such as logistic regression or neural networks would have to obtain very high levels of test performance to be comparable to an expert. Where such models are likely to be of value is in helping those operators with less experience to mimic a more experienced operator and so obtain better overall diagnostic accuracy.

REFERENCES 1. Granberg S, Wikland M, Jansson I. Macroscopic characterization of ovarian cancer and relation to the histological diagnosis. Gynecol Oncol 1989;35:139–44 2. Sassone AM, Timor-Tritsch IE, Artner A, Westhoff C, Warren WB. Transvaginal sonographic characterization of ovarian diseases: evaluation of a new scoring system to predict ovarian malignancy. Obstet Gynecol 1991;78:70–6 3. Lerner JP, Timor-Tritsch IE, Federman A, Abramovich G. Transvaginal ultrasonographic characterization of ovarian masses with an improved, weighted scoring system. Am J Obstet Gynecol 1994;170:81–5 4. Kurjak A, Predanic M, Kupesic-Urek S, Jukic S. Transvaginal color and pulsed Doppler assessment of adnexal tumor vascularity. Gynecol Oncol 1993;50:3–9 5. Altman DG. Practical Statistics for Medical Research. London: Chapman & Hall, 1991:403–9 6. Bland M. An Introduction to Medical Statistics. Oxford: Oxford Medical Publications, 1995:266–9

Timmerman et al. 7. Higgins RV, van Nagell JR, Woods CH, Thompson EA, Kryscio RK. Interobserver variation in ovarian measurements using transvaginal sonography. Gynecol Oncol 1990;39: 69–71 8. Sladkevicius P, Valentin L. Interobserver agreement in the results of Doppler examinations of extrauterine pelvic tumors. Ultrasound Obstet Gynecol 1995;6:91–6 9. Tekay A, Jouppila P. Intraobserver reproducibility of transvaginal Doppler measurements in uterine and intraovarian arteries in regularly menstruating women. Ultrasound Obstet Gynecol 1996;7:129–34 10. Tekay A, Jouppila P. Intraobserver variation in transvaginal Doppler blood flow measurements in benign ovarian tumors. Ultrasound Obstet Gynecol 1997;9:120–4 11. Petersson F. Annual report on the results of treatment in gynecological cancer. Int J Gynecol Obstet 1991; 21(Suppl): 238–77 12. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960;20:37–46 13. Brennan P, Silman A. Statistical methods for assessing observer variability in clinical measures. Br Med J 1992;304: 1491–4 14. Bourne TH, Campbell S, Steer C, Whitehead M, Collins WP. Transvaginal colour flow imaging: a possible new screening for ovarian cancer. Br Med J 1989;299:1367–70 15. Kurjak A, Zalud I, Jurkovic D, Alfirevic Z, Miljan M. Transvaginal color Doppler for the assessment of pelvic circulation. Acta Obstet Gynecol Scand 1989;68:131–5 16. Tekay A, Jouppila P. Validity of pulsatility and resistance indices in classification of adnexal tumors with transvaginal color Doppler ultrasound. Ultrasound Obstet Gynecol 1992; 2:338–44 17. Valentin L, Sladkevicius P, Marsal K. Limited contribution of Doppler velocimetry to the differential diagnosis of extrauterine pelvic tumors. Obstet Gynecol 1994;83:425–33 18. Tailor A, Jurkovic D, Bourne TH, Collins WP, Campbell S. Sonographic prediction of malignancy in adnexal masses using multivariate logistic regression analysis. Ultrasound Obstet Gynecol 1997;10:41–7

16 Ultrasound in Obstetrics and Gynecology AMA: First Proof

98/132