Computerized Diagnostic Decision Support System for the ...

6 downloads 1416 Views 558KB Size Report
server and intraobserver variation, diagnosis, expert system support,. Bayesian belief network, quantitative pathology, informatics. Abbreviations: DSS, decision ...
Computerized Diagnostic Decision Support System for the Classification of Preinvasive Cervical Squamous Lesions G. J. PRICE, BSC, W. G. MCCLUGGAGE, MB, MRCPATH, M. L. MORRISON, BSC, G. MCCLEAN, MB, L. VENKATRAMAN, MB, J. DIAMOND, PHD, H. BHARUCHA, MD, FPCPATH, R. MONTIRONI, MD, FRCPATH, P. H. BARTELS, PHD, D. THOMPSON, MSC, AND P. W. HAMILTON, PHD Previous studies have revealed considerable interobserver and intraobserver variation in the histological classification of preinvasive cervical squamous lesions. The aim of the present study was to develop a decision support system (DSS) for the histological interpretation of these lesions. Knowledge and uncertainty were represented in the form of a Bayesian belief network that permitted the storage of diagnostic knowledge and, for a given case, the collection of evidence in a cumulative manner that provided a final probability for the possible diagnostic outcomes. The network comprised 8 diagnostic histological features (evidence nodes) that were each independently linked to the diagnosis (decision node) by a conditional probability matrix. Diagnostic outcomes comprised normal; koilocytosis; and cervical intraepithelial neoplasia (CIN) I, CIN II, and CIN III. For each evidence feature, a set of images was recorded that represented the full spectrum of change for that feature. The system was designed to be interactive in that the histopathologist was prompted to enter evidence into the network via a specifically designed graphical user interface (i-Path Diagnostics, Belfast, Northern Ireland). Membership functions were used to derive the relative likelihoods for the alternative feature outcomes, the likelihood vector was entered into the network, and the updated diagnostic belief was computed for the diagnostic outcomes and displayed. A cumulative probability graph was generated throughout the diagnostic process and presented on screen. The network was tested on 50 cervical colposcopic biopsy specimens, comprising 10 cases each of normal, koilocytosis, CIN I, CIN II, and CIN III. These had been preselected by a consultant gynecological pathologist. Using conventional morphological assessment, the cases were classified on 2 separate occasions by 2 consultant and 2 junior pathologists. The cases were also

then classified using the DSS on 2 occasions by the 4 pathologists and by 2 medical students with no experience in cervical histology. Interobserver and intraobserver agreement using morphology and using the DSS was calculated with ␬ statistics. Intraobserver reproducibility using conventional unaided diagnosis was reasonably good (␬ range, 0.688 to 0.861), but interobserver agreement was poor (␬ range, 0.347 to 0.747). Using the DSS improved overall reproducibility between individuals. Using the DSS, however, did not enhance the diagnostic performance of junior pathologists when comparing their DSS-based diagnosis against an experienced consultant. However, the generation of a cumulative probability graph also allowed a comparison of individual performance, how individual features were assessed in the same case, and how this contributed to diagnostic disagreement between individuals. Diagnostic features such as nuclear pleomorphism were shown to be particularly problematic and poorly reproducible. DSSs such as this therefore not only have a role to play in enhancing decision making but also in the study of diagnostic protocol, education, self-assessment, and quality control. HUM PATHOL 34:1193-1203. © 2003 Elsevier Inc. All rights reserved. Key words: cervix, cervical intraepithelial neoplasia, interobserver and intraobserver variation, diagnosis, expert system support, Bayesian belief network, quantitative pathology, informatics. Abbreviations: DSS, decision support system; CIN, cervical intraepithelial neoplasia; HPV, human papillioma virus; BBN, Bayesian belief network; CPM, conditional probability matrix; NP, nuclear pleomorphism; MA, mitotic activity; NU, nucleoli; LP, loss of polarity; SM, superficial maturation; KC, presence of koilocytes; MF, abnormal mitotic figures; NC, nuclear chromatin; CL, confidence limits.

Organized cervical screening programs have meant that increasing numbers of women with abnormal cervical smears are being referred for colposcopic examination. This has resulted in increasing numbers

of cervical colposcopically directed biopsies that, in many laboratories, account for a significant proportion of the surgical pathology workload. Histopathological examination of these specimens is important in determining treatment, clinical management, and subsequent follow-up of patients. Cervical intraepithelial neoplasia (CIN) is the preferred designation in the United Kingdom for the range of squamous intraepithelial abnormalities of the cervix that are associated with an increased risk of the subsequent development of invasive squamous carcinoma. Traditionally, intraepithelial abnormalities are graded as CIN I, CIN II, or CIN III, depending on the degree of differentiation. It is apparent that there is poor agreement between histopathologists in the interpretation of cervical biopsies.1-4 Previous studies have revealed considerable interobserver and intraobserver variation in the classification of CIN and in the separation of CIN from

From the Quantitative Pathology Laboratory, Cancer Research Centre and Centre for Health Care Informatics, The Queen’s University, and Department of Pathology, Royal Group of Hospitals Trust, Belfast, United Kingdom; Institute of Pathological Anatomy and Histopathology, University of Ancona, Ancona, Italy; and Optical Sciences Centre, University of Arizona, Tucson, Arizona. Accepted for publication June 30, 2003. Address correspondence and reprint requests to Prof. Peter Hamilton, Quantitative Pathology Laboratory, Cancer Research Centre and Centre for Health Care Informatics, Department of Pathology, The Queen’s University of Belfast, Grosvenor Road, Belfast, Northern Ireland, BT12 6BA. © 2003 Elsevier Inc. All rights reserved. 0046-8177/03/3411-0017$30.00/0 doi:10.1016/S0046-8177(03)00421-0

1193

HUMAN PATHOLOGY

Volume 34, No. 11 (November 2003)

normal cervices and from cervices showing features of human papilloma virus (HPV) infection, the histological hallmark of which is koilocytosis. Histopathology is based on the visual identification of morphological clues and the assimilation of this information to diagnose the underlying disease process. This ability requires visual acumen and experience but is affected by the vagueness and uncertainty associated with visual interpretation, the subjective terminology that is used to describe morphology, and the relationship between observed clues and diagnostic alternatives. This is particularly so in the grading of morphological abnormalities, such as nuclear pleomorphism, which exhibit a spectrum of changes. To classify CIN and to separate this from normal cervices and from koilocytosis, pathologists take account of a number of histological features.5 These include the degree of nuclear pleomorphism, the number of mitotic figures, the level of mitotic figures within the squamous epithelium, and the presence of abnormal mitotic figures. Loss of polarity and cellular disorganization are also assessed, as is the nuclear chromatin pattern. Pathologists also look for the presence or absence of superficial maturation and for the presence or absence of cells showing features of koilocytosis. Most of these criteria are subjective, and difficulties arise in assimilating all the relevant diagnostic information in a consistent and reproducible manner. This has prompted us to attempt to develop alternative, more objective methods for the assessment of preinvasive cervical squamous lesions using digital image analysis6 and DSSs (described herein). DSSs are computer programs that are designed to store, access, and process knowledge using a variety of different approaches such as rules, inference networks, and artificial neural networks.7 One of the most widely used forms of knowledge representation in pathology has been the Bayesian belief network (BBN).8-13 BBNs can provide a framework for representing histological diagnostic knowledge in a logical manner that is familiar to pathologists and can have considerable potential in aiding pathologists in making more accurate and consistent diagnoses.8-11,14 They can combine many pieces of evidence to provide a final probability for a range of diagnostic alternatives. In addition to the structuring of knowledge, there is a need to enter observational evidence into the network. Diagnostic evidence in histopathology is not clear-cut but often involves uncertainty because of the need for visual interpretation of patterns and the use of language to describe these patterns. In our experience, uncertainty caused by this type of vagueness is best managed using fuzzy set theory. In the DSS described in this paper, we have used fuzzy set theory to derive a realistic measure of morphological evidence and have used this evidence to reach a diagnostic probability through the use of a BBN. This approach has been used previously in other areas of pathology and is described in detail elsewhere.8,12,15 We have recently developed a DSS software package (i-Path Diagnostics, Belfast, Northern Ireland) that

allows a BBN and associated image sets to be loaded and presented to the user. Each morphological clue is presented to the user as a series of images representing the possible spectrum of morphological abnormalities. Evidence to be entered into a BBN through the use of stored images and fuzzy set theory is used to define likelihood vectors that are used to update the belief in diagnosis. This DSS was originally designed for breast cytology8,16 but can easily be adapted for other applications.17 In this study, we have developed a new module for the classification of cervical biopsies. In addition, we wished to determine what advantages were conferred, if any, in the classification of cervical lesions through the use of the DSS as compared with conventional morphological assessment. Here we were interested not only in reproducibility of overall diagnostic outcome but also in how the assessment of individual clues contributes to diagnostic accuracy and whether a decision support model such as this might be used to objectively assess diagnostic ability. MATERIALS AND METHODS Bayesian Belief Network Design The BBN was constructed with the diagnostic clues represented as a series of evidence nodes comprising various histological features (see http://www.i-path.co.uk). These nodes are linked to a parent node (the decision node) that contains the diagnostic outcomes (normal; koilocytosis; and CIN I, CIN II, and CIN III; Fig 1). The relationship between the decision node and each evidence node is represented numerically in the form of a conditional probability matrix (CPM). The CPM expresses the probability of finding a particular evidence feature (such as nuclear pleomorphism in basal cell layers: none) given a diagnostic outcome (such as normal; see Table 1). In this study, algorithmic implementation followed that designed by Morawski18,19 from the concepts originally developed by Pearl.20 In the present study, the decision node contained 5 possible diagnostic outcomes, that is, normal, koilocytosis, CIN I, CIN II, and CIN III. Eight diagnostic features (evidence nodes) that are used for the histological diagnosis of preinvasive cervical squamous lesions were defined by a gynecological pathologist (W.G.M.). These are listed in Table 1. These evidence nodes were linked to the decision nodes by CPMs (Table 2) that were defined by the gynecological pathologist based on his previous experience in the classification of these specimens. Reference images were also selected by the gynecological pathologist to represent the possible morphological grades for each diagnostic clue (Fig 2).

Pathological Cases Fifty colposcopic biopsy specimens were chosen by the gynecological pathologist. These comprised 10 cases each of normal, koilocytosis, CIN I, CIN II, and CIN III. For the purposes of the study, these diagnoses were considered the gold standard. In each case, a mark was put on the slide beside the area to be examined to eliminate problems with selection of appropriate areas.

Entering Evidence Into the DSS The DSS is interactive in that the user examines the histological features of a case and enters evidence on

1194

DIAGNOSTIC DECISION SUPPORT FOR CIN (Price et al)

FIGURE 1. Illustration of BBN network structure.

screen. By referring to the on-screen images, the user selects a position on the spectrum where the features of the case under examination is thought to lie. This conveys the TABLE 1. Diagnostic Clues Used in the Classification of Cervical Biopsies Together With Possible Outcomes Nuclear pleomorphism in basal cell layers 1 None 2 Mild 3 Moderate 4 Severe Presence of nucleoli 1 Occasional 2 Many Loss of polarity in basal cell layers 1 None 2 Partial loss 3 Complete loss Superficial maturation 1 Present 2 Absent Presence of koilocytes 1 Absent 2 Possibly present 3 Present Presence of abnormal mitotic figures 1 Absent 2 Occasional 3 Many Nuclear chromatin pattern 1 Vesicular 2 Some coarsening 3 Marked coarsening Mitotic activity 1 None 2 Confined to lower one-third of epithelium 3 Confined to lower two-thirds of epithelium 4 Involves full epithelial thickness

likelihood that each of the possible feature outcomes is present in the case under investigation. The value for each outcome is an independent estimate, and they define in numerical form the subjective impression of the histopathologist. When evidence is entered into an evidence node, the belief in the outcomes of that node is then propagated to the diagnostic decision node. This in turn updates the belief in each of the diagnostic outcomes. At the decision node, the arrival of new evidence will result in a change in the outcome belief until all the evidence is evaluated and final diagnostic probabilities are reached for each diagnostic outcome. The outcome with the highest probability is chosen as the most likely diagnosis. A cumulative probability graph is maintained throughout the diagnostic process (Fig 3) that shows the changes in belief as each morphological clue is assessed. The cumulative probability graph quantitatively maps the diagnostic pathway.

Testing the DSS The DSS was tested on the 50 cases. Initially, each case was evaluated morphologically by 2 consultant pathologists (one of whom was the gynecological pathologist) and 2 junior pathologists. The results were recorded, and the cases were relabeled at random and reassessed again a week later. The cases were relabeled a second and third time and were evaluated by the 4 pathologists and by 2 medical students using the DSS on 2 separate occasions. On the rare occasion on which 2 diagnostic outcomes were calculated as having the same probability, the higher grade was reported as the diagnosis. The user’s second diagnostic run on the DSS was used for interobserver analysis to allow for potential unfamiliarity with the system on the first run.

1195

HUMAN PATHOLOGY

Volume 34, No. 11 (November 2003)

TABLE 2. Conditional Probability Matrices (CPMs) Defining the Probability of Finding a Feature Outcome Given the Diagnosis for Each Diagnostic Clue NUCLEAR PLEOMORPHISM IN BASAL CELL LAYERS None

Mild

Moderate

Severe

0.90 0.20 0.01 0.01 0.01

0.05 0.70 0.80 0.20 0.10

0.04 0.08 0.15 0.50 0.29

0.01 0.02 0.04 0.29 0.60

NORMAL KOILOCYTOSIS CIN I CIN II CIN III

MITOTIC ACTIVITY

NORMAL KOILOCYTOSIS CIN I CIN II CIN III

None

Lower 1/3

Lower 2/3

Full thickness

0.70 0.20 0.15 0.10 0.05

0.28 0.50 0.50 0.20 0.05

0.01 0.20 0.25 0.50 0.40

0.01 0.10 0.10 0.20 0.50

PRESENCE OF NUCLEOLI

NORMAL KOILOCYTOSIS CIN I CIN II CIN III

Occasional

Many

0.85 0.70 0.60 0.65 0.50

0.15 0.30 0.40 0.35 0.50

LOSS OF POLARITY IN BASAL CELL LAYERS

NORMAL KOILOCYTOSIS CIN I CIN II CIN III

None

Partial Loss

Complete Loss

0.70 0.40 0.05 0.01 0.01

0.25 0.50 0.75 0.49 0.29

0.05 0.10 0.20 0.50 0.70

SUPERFICIAL MATURATION

NORMAL KOILOCYTOSIS CIN I CIN II CIN III

Present

Absent

0.90 0.90 0.80 0.70 0.20

0.10 0.10 0.20 0.30 0.80

PRESENCE OF KOILOCYTOSIS Absent

Possibly Present

Present

0.90 0.01 0.20 0.30 0.60

0.09 0.29 0.50 0.40 0.30

0.01 0.70 0.30 0.30 0.10

NORMAL KOILOCYTOSIS CIN I CIN II CIN III

Occasional

Many

0.95 0.70 0.80 0.50 0.30

0.04 0.25 0.15 0.40 0.40

0.01 0.05 0.05 0.10 0.30

NORMAL KOILOCYTOSIS CIN I CIN II CIN III

RESULTS Intraobserver Agreement Using Unaided Conventional Morphology Table 3 (first column) shows weighted ␬ values for intraobserver agreement using conventional diagnosis. Consultant 1 (the consultant gynecological pathologist) achieved strong agreement between first and second runs, with a weighted ␬ of 0.861. Disagreement only resulted in cases being reclassified into contiguous categories. One of the junior pathologists also achieved strong agreement with a weighted ␬ of 0.757, although reclassification into a diagnostic group that was more than 1 class away from the original diagnosis was more common. Consultant 2 and the other junior pathologist achieved fair to good agreement with weighted ␬ values of 0.693 and 0.688, respectively, although a closer look at the data shows diagnostic disagreement that is clinically significant (Table 4); for instance, 2 cases were diagnosed as normal on 1 occasion and as CIN II on another.

Table 5 (first column) shows that diagnostic agreement between individuals was poorer than for intraobserver assessment, with lower ␬ values and major discrepancies in diagnostic classification between pathologists. Major inconsistencies occurred primarily in the koilocytosis, CIN I, and CIN II groups, but there was also disagreement over the diagnosis of normality and CIN III.

NUCLEAR CHROMATIN PATTERN

NORMAL KOILOCYTOSIS CIN I CIN II CIN III

The degree of interobserver and intraobserver agreement was evaluated using ␬ statistics. ␬ is an index of observer variation that has been corrected for chance and indicates the degree of observer variation over and above that which would be expected by chance alone.21 The value of ␬ can range from ⫺1.0 to ⫹1.0. A value of 0 indicates chance agreement only, whereas a value of ⫹1.0 indicates perfect agreement. It is generally accepted that a value of ⬎0.75 reflects strong agreement, a value of 0.4 to 0.75 suggests fair to good agreement, and a value of ⬍0.4 means agreement is poor. Weighted ␬ analysis was carried out. Weighted ␬ takes into account the fact that some disagreements are more serious than others.22 In unweighted analysis, a diagnostic disagreement of normal and CIN III is treated in the same way as a diagnostic disagreement of CIN II and CIN III. In weighted ␬ analysis, this is adjusted to ensure that larger differences in a ranked grading scheme will have a stronger influence on the final ␬ value. In this study, a linear weighting scheme was used.

Interobserver Agreement Using Unaided Conventional Morphology

PRESENCE OF ABNORMAL MITOTIC FIGURES Absent

Statistics

Vesicular

Some Coarsening

Marked Coarsening

0.70 0.60 0.50 0.30 0.10

0.25 0.35 0.35 0.40 0.30

0.05 0.05 0.15 0.30 0.60

Intraobserver Agreement Using DSS Table 3 (second column) shows weighted ␬ values for intraobserver agreement using the DSS. This was consistently good for most observers, including 1 of the medical students who had little or no experience in cervical histopathology. For most pathologists, the DSS

1196

DIAGNOSTIC DECISION SUPPORT FOR CIN (Price et al)

FIGURE 2. Image set used for the DSS grading of the diagnostic clue “loss of polarity”. Each image and diagnostic category has an underlying membership curve. By positioning a slider mark on the spectrum, a relative likelihood vector is generated from the membership curve and entered into the BBN as evidence.

did not improve intraobserver consistency but provided comparable results with the conventional morphological approach. A point to note, however, is that intraobserver agreement does not measure diagnostic accuracy

but rather diagnostic consistency by an individual (ie, an individual may consistently get the diagnosis wrong). Even with the use of on-screen images as a means of support, individuals can significantly vary in their inter-

FIGURE 3. Cumulative probability graph illustrating the change in diagnostic probability for all of the diagnostic outcomes as each feature is assessed. The diagnostic category with the highest final probability is koilocytosis (see code at bottom of graph). As is shown, the normal, CIN II, and CIN III outcomes are determined as unlikely by the first few clues. The presence of koilocytes (KC) clue differentiates the koilocytosis outcome from the CIN I. Before this point, however, the assessment of loss of polarity suggested that CIN I was the more likely outcome. This illustrates the importance of assessing all available clues. NP, nuclear pleomorphism; MA, mitotic activity; NU, nucleoli; LP, loss of polarity; SM, superficial maturation; KC, presence of koilocytes; MF, abnormal mitotic figures; NC, nuclear chromatin.

1197

HUMAN PATHOLOGY

Volume 34, No. 11 (November 2003)

TABLE 3. A Comparison of Intraobserver Weighted ␬ Values for Unaided Diagnostic Assessment and DSS-Aided Diagnosis Observer Consultant 1 Consultant 2 Junior 1 Junior 2 Medical student 1 Medical student 2

Conventional diagnosis, ␬ (95% CL)

DSS, ␬ (95% CL)

0.861 (0.78, 0.94) 0.693 (0.52, 0.87) 0.757 (0.63, 0.88) 0.688 (0.56, 0.82)

0.778 (0.65, 0.90) 0.686 (0.55, 0.82) 0.731 (0.61, 0.85) 0.720 (0.59, 0.85) 0.602 (0.43, 0.77) 0.505 (0.33, 0.68)

pretation of morphological clues for the same case material. Interobserver Agreement Using DSS Table 5 (second column) shows weighted ␬ values for interobserver agreement using the DSS. Using the DSS alone, agreement between the 2 consultants was increased from 0.465 to 0.541 when compared with morphological assessment alone. Moderate to good agreement was achieved between all paired combinations of pathologists (␬ value range: 0.467 to 0.681). When using the DSS, agreement was improved in 5 of the 6 possible interpathologist comparisons, indicating that increased consistency between individuals was achieved when using the DSS. Analysis of Feature Assessment In addition to marginally enhancing reproducibility between individuals, the DSS provides the additional ability to analyze feature assessment and the impact that this has on diagnostic classification. Figure 3 shows the cumulative probability graphs for each of the diagnostic outcomes for an example koilocytosis case, as assessed by a single observer. This illustrates that assessment of the first 2 clues supported a diagnosis of either CIN I or koilocytosis. It is the presence of koilocytes (KC) that identifies the case as most likely representing koilocytosis with a final diagnostic probability of 0.70. It is possible to use this objective mapping of the diagnostic procedure to compare discrepancies between different individuals. Figure 4 shows a comparison of cumulative probability graphs (for CIN II outcome) between the 4 pathologists for a single case. It can be seen that Consultant 1 reached a very strong diagnostic probability for CIN II, with all features supTABLE 4. Intraobserver Variation Between First (Horizontal) and Second (Vertical) Runs Using Conventional Morphological Diagnosis for Consultant 2

Normal Koilocytosis CIN I CIN II CIN III

Normal

Koilocytosis

CIN I

CIN II

CIN III

17 3 1 1 0

3 14 0 1 0

0 0 0 0 0

1 0 0 1 1

0 1 0 1 5

portive of the final outcome. The diagnostic profile for the same case by Consultant 2 is very different. The reason for this is evident when we consider the slider positions for each diagnostic feature (shown on right of Fig 4). Consultant 1 grades nuclear pleomorphism (NP) as between moderate and severe, whereas Consultant 2 grades NP as mild. This has a major initial effect on the probability graph. Other discrepancies, particularly in features superficial maturation (SM), presence of koilocytes (KC), and nuclear chromatin (NC), contribute to the disagreement in diagnosis. Interestingly, Junior 1 has a very similar profile to Consultant 1 until the assessment of SM, which Junior 1 misinterprets. Here Consultant 1 is confident that SM is present, whereas Junior 1 is unsure, placing the pointer between present and absent. Apart from this 1 morphological clue, Junior 1 has assessed the case very well. Having observed this discrepancy, the opportunity now exists to review this feature with Junior 1 to ensure accurate assessment in the future. By plotting slider positions for each clue over a number of assessments, the interobserver and intraobserver variation in clues can be visualized (Fig 5). Whereas there is reasonably good agreement in the assessment of nuclear chromatin, there appears to be considerable disparity in the assessment of nucleoli and superficial maturation. Training and the Objective Measure of Diagnostic Ability Using DSS Acknowledging the fact that there is inherent variability in the classification of preinvasive cervical lesions, even by experienced pathologists, it is conceivable that a “correct diagnosis” and a “correct pathway to diagnosis” for a given case can be objectively defined by a leading expert using the DSS and a cumulative probability graph defined. Consequently, this could be used as an objective means to compare the performance of an individual against that of an expert. For example, Fig 6 shows the probability graph (diagnostic pathway) and the final diagnostic probability for a case defined by the gynecological pathologist in this study. It also shows the pathway taken by a junior pathologist who has clearly reached the wrong diagnosis. This not only highlights the difference in the final diagnostic classiTABLE 5. A Comparison of Interobserver Weighted ␬ Values for Unaided Conventional Diagnostic Assessment and DSS-Aided Diagnosis. The Second Run was Used for the DSS Comparisons to Allow for Learning Observer pair

Conventional diagnosis, ␬ (95% CL)

DSS, ␬ (95% CL)

Cons 1 versus Cons 2 Cons 1 versus Junior 1 Cons 1 versus Junior 2 Cons 2 versus Junior 1 Cons 2 versus Junior 2 Junior 1 versus Junior 2

0.465 (0.31, 0.62) 0.747 (0.60, 0.90) 0.570 (0.44, 0.70) 0.347 (0.17, 0.53) 0.465 (0.30, 0.63) 0.503 (0.34, 0.66)

0.541 (0.38, 0.70) 0.681 (0.46, 0.75) 0.677 (0.55, 0.80) 0.589 (0.43, 0.74) 0.467 (0.30, 0.63) 0.563 (0.41, 0.71)

Abbreviation: Cons, consultant.

1198

DIAGNOSTIC DECISION SUPPORT FOR CIN (Price et al)

FIGURE 4. Comparison of cumulative probability (decision) graphs for the 4 pathologists, allowing a comparison of how decisions were made on a single case, in this case a CIN II case. Slider positions for each of the features are shown on the right and indicate how misinterpretation of individual morphological clues leads to inconsistent diagnosis. (NP, nuclear pleomorphism; MA, mitotic activity; NU, nucleoli; LP, loss of polarity; SM, superficial maturation; KC, presence of koilocytes; MF, abnormal mitotic figures; NC, nuclear chromatin.)

1199

HUMAN PATHOLOGY

Volume 34, No. 11 (November 2003)

FIGURE 5. Comparison of slider positions for each diagnostic clue for each run in a selected case. The interobserver and intraobserver variation in the assessment of each individual feature can be visualized. There appears to be considerable disparity in the assessment of nucleoli and superficial maturation and in the presence of koilocytes with slider values between the consultants at each of the extremes. There appeared to be good agreement in the assessment of nuclear chromatin. With the exception of Junior 2, there was little variation in the assessment of abnormal mitotic figures. This approach may not only help identify contentious clues but may discover the gaps in a pathologist’s knowledge or experience and further aid in their personal development and education. C, consultant; J, junior; R, run.

fication but also differences in the assessment of individual clues leading to the misclassification. Here inaccurate assessment of MA, LP, and SM lead to a wrong decision being reached. In Fig 4, Consultant 2 and Junior 2 are clearly misinterpreting the features and so misclassifying the case, when compared with the expert Consultant 1. This ability to compare diagnostic profiles has potential, not only in training inexperienced pathologists in correct feature assessment but also in objectively assessing the performance of more senior pathologists within a quality assurance program. DISCUSSION Previous studies have revealed considerable interobserver and intraobserver variation in the classification of CIN and in the separation of CIN from nonneoplastic lesions, including koilocytosis.1-4 Other studies have investigated the use of alternative, more objective, methods as an adjunct to histology in classifying cervical biopsies. Immunohistochemical studies using antibodies against cell cycle–related antigens have been employed. The most extensively used of these antibodies are MIB-1 (reacting against the Ki-67 antigen) and

PCNA (reacting against proliferating cell nuclear antigen). With these antibodies, proliferating cells can be demonstrated at progressively higher levels within cervical squamous epithelia in accordance with the degree of CIN.23-30 Other investigators have used alternative methods of assessing cell proliferation, such as AgNOR staining, in an attempt to develop more objective methods for the classification of CIN.24,31,32 Image analysis methods have also been used in an attempt to bring more objectivity to the classification of CIN.6,33 However, these methods have not been adopted in routine surgical pathology practice. Intraobserver agreement indicates the degree to which a pathologist grades a given case into the same diagnostic group on 2 separate occasions. This study showed that using conventional morphology, intraobserver agreement (as measured using ␬ statistics) ranged from 0.69 to 0.86, with the gynecological pathologist achieving the highest ␬ value. Even with this observer, 11 (22%) of 50 cases were classified differently on the second run. The DSS did not significantly improve intraobserver agreement except in the case of 1 individual (see Table 3). It is important here to realize the limitations of ␬ statistics in intraobserver

1200

DIAGNOSTIC DECISION SUPPORT FOR CIN (Price et al)

FIGURE 6. Direct comparison of the cumulative probability graphs for a koilocytosis diagnostic outcome for Consultant 1 (Correct) and Junior 2 (User) using the teaching mode of the DSS. The cumulative probabilities for each observer are shown in the inset table. In this case, Junior 2 diagnosed the case as CIN II with a cumulative probability of 0.46. (NP, nuclear pleomorphism; MA, mitotic activity; NU, nucleoli; LP, loss of polarity; SM, superficial maturation; KC, presence of koilocytes; MF, abnormal mitotic figures; NC, nuclear chromatin.)

studies, because this does not take into account the accuracy of the diagnoses but only the precision with which the diagnosis is made. For example, a pathologist could incorrectly classify the same case into the same diagnostic category on 2 or more separate occasions, thus yielding a high ␬ score. The hope for the DSS was that by providing a more consistent means for feature interpretation through the on-screen images and interactive slider, this would in turn enhance consistency in diagnostic classification. Although there is little deterioration in performance using the DSS, inconsistent feature assessment is still problematic for some pathologists, even when using standardized image sets for comparison. Interobserver reproducibility (agreement between pathologists) provides different information. Using conventional diagnostic methods, ␬ values for agreement ranged from 0.35 to 0.75, showing variation in the classification of cases between both consultant and junior staff. Even between consultant pathologists, there was disagreement in 18 (36%) of 50 cases. Although disagreement between consultants was mainly in the grading of CIN, junior pathologists clearly had difficulties across the range of diagnostic groups. Use of the DSS improved agreement between pathologists in 5 of

the 6 comparisons, with ␬ values ranging from 0.54 to 0.68 (see Table 5). This illustrates that although the DSS had little effect on intraobserver performance, it does instill some additional consistency between observers. When comparing the DSS with the expertdefined gold standard, all 4 pathologists achieved fair to good agreement, although variation in feature assessment remains a problem leading to misclassification. Both medical students using the DSS also achieved fair to good interobserver agreement against the gold standard, although the ␬ values were lower than for the trained pathologists. This illustrates that the system allows entirely inexperienced observers to achieve some diagnostic accuracy. This study has used a DSS to arrive at a diagnostic probability. It is important to appreciate that the DSS relies entirely on the information entered by the user. If a user incorrectly assesses a given morphological clue and enters this into the DSS, the system will incorporate this and use it to arrive at the diagnostic probability. In this study, the intraobserver and interobserver disagreement was therefore due to differences in the assessment and grading of specific morphological features. The hope was that with on-screen image templates guiding the grading of morphological clues, some additional

1201

HUMAN PATHOLOGY

Volume 34, No. 11 (November 2003)

consistency in feature assessment could be obtained over and above conventional diagnostic practice, which only uses templates within a pathologist’s mind. An additional advantage of the DSS is that it collects information that allows us to study feature assessment and its contribution to diagnosis in a more objective fashion. Not only can we examine a quantitative record of the decision route represented by the cumulative probability graph, but we can examine the slider position indicating the grade of each feature and observe how this influences the diagnostic probability (see Fig 4). In this study, we found that assessment and grading of the feature “nuclear pleomorphism” was particularly problematic and that this contributed significantly to intraobserver and interobserver variation. A similar observation was also recorded in a previous study by our group using a DSS in the classification of endometrial hyperplasia.17 In the current study, other features such as superficial maturation and nucleoli were also shown to pose difficulties, either due to inexperience or subjectivity. Clearly there is a need to further train pathologists in the recognition and grading of these features if there is any prospect for an improvement in diagnostic consistency and accuracy in these lesions. Currently the DSS can operate in 2 modes, 1 for diagnostic decision support and the other for teaching. In the teaching mode, a correct diagnostic pathway can be set as a gold standard by an experienced pathologist, against which other, less experienced pathologists can evaluate their diagnostic ability. Errors made in assessing particular clues can then be highlighted by the DSS through direct comparison of the diagnostic pathways using the cumulative probability graphs (see Fig 6). This approach to training using our DSS has been investigated in breast cytology.16 Training modalities such as this offer a valuable resource for self-directed learning for trainee pathologists. In this context, the DSS could also be used for self-assessment and quality control of diagnostic performance by more experienced pathologists. Provided that there is a perceived correct diagnostic profile as defined by an expert or a consensus panel of experts, then individuals might wish to assess their personal ability to match this profile. This approach has the potential in identifying gaps in a pathologist’s knowledge, experience or ability. It is possible that this could be simply part of a continuing educational program or, alternatively, as an accreditation program that assesses the standards of pathologists. This may be the first time that an objective method for assessing diagnostic performance in pathology has been developed. In conclusion, DSS methodology has been shown to have a significant role in studying reproducibility in the diagnosis and grading of preinvasive cervical squamous lesions. With better training in the diagnostic assessment of morphological features or through the introduction of quantitative techniques, the DSS has the potential to enhance precision and accuracy of diagnosis. Additionally, such systems have considerable potential in a teaching and training role, enabling inexperienced observers, including pathology trainees

and medical students, to be instructed in the diagnostic decision making in pathology. This is enhanced through the provision of visual examples of diagnostic features. Other professional groups may also derive benefit from such a system. For example, colposcopists, as part of their training, are now expected to have a formal understanding of cervical pathology, and DSSs may play a role in this. BBNs that form the basis of the current DSS have the ability to be easily modified as new criteria (including immunohistochemical and molecular) with diagnostic and prognostic value become available. Finally, it is hoped that future quality control programs might benefit from the use of DSSs for assessing diagnostic performance in a more objective and reliable manner. REFERENCES 1. McCluggage WG, Bharucha H, Caughley LM, et al: Interobserver variation in the reporting of cervical colposcopic biopsies: A comparison of grading systems. J Clin Pathol 49:833-835, 1996 2. McCluggage WG, Walsh MY, Thornton CM, et al: Inter- and intra-observer variation in the histopathological reporting of cervical squamous intraepithelial lesions using a modified Bethesda grading system. Br J Obstet Gynaecol 105:206-210, 1998 3. Robertson AJ, Anderson JM, Swanson Beck J, et al: Observer variability in histopathological reporting of cervical biopsy specimens. J Clin Pathol 42:231-238, 1989 4. Ismail SM, Colclough AB, Dinnen JS, et al: Observer variation in histopathological diagnosis and grading of cervical intraepithelial neoplasia. BMJ 298:707-710, 1989 5. Anderson MC, Brown CL, Buckley CH, et al: Current views on cervical intraepithelial neoplasia. J Clin Pathol 44:969-978, 1991 6. Keenan SJ, Diamond J, McCluggage WG, et al: An automated machine vision system for the histological grading of cervical intraepithelial neoplasia (CIN). J Pathol 192:351-362, 2000 7. Bartels PH, Hiessl H: Expert systems in histopathology II. Knowledge representation and rule based systems. Anal Quant Cytol Histol 11:147-153, 1989 8. Hamilton PW, Anderson N, Bartels PH, et al: Expert system support using Bayesian belief networks in the diagnosis of fine-needle aspiration biopsy specimens of the breast. J Clin Pathol 47:329-336, 1994 9. Bartels PH, Thompson D, Montironi R, Hamilton PW, Scarpelli M: Diagnostic decision support for prostate lesions. Pathol Res Pract 191:945-957, 1995 10. Hamilton PW, Montironi R, Abmayr W, et al: Clinical applications of Bayesian belief networks in pathology. Pathologica 87:237245, 1995 11. Whimster WF, Hamilton PW, Anderson NA, et al: Reproducibility of Bayesian belief network assessment of breast fine needle aspirates. Anal Quant Cytol Histol 18:267-274, 1996 12. Bartels PH, Thompson D, Weber JE: Expert systems in histopathology IV. The management of uncertainty. Anal Quant Cytol Histol 14:1-13, 1992 13. Montironi R, Whimster WF, Collan Y, et al: How to develop and use a Bayesian belief network. J Clin Pathol 49:194-201, 1996 14. Montironi R, Pomante R, Diamanti L, et al: Evaluation of prostatic intraepithelial neoplasia after treatment with a 5-␣-reductase inhibitor (finasteride). Anal Quant Cytol Histol 18:461-470, 1996 15. Hamilton PW, Bartels PH, Montironi R, et al: Improved diagnostic decision making in pathology: Do inference networks hold the key? J Pathol 175:1-5, 1995 16. Diamond J, Anderson NH, Thompson D, et al: A computerbased training system for fine needle aspiration cytology. J Pathol 196:113-121, 2002 17. Morrison ML, McCluggage WG, Price GJ, et al: Expert system support using a Bayesian belief network for the classification of endometrial hyperplasia. J Pathol 167:403-414, 2002

1202

DIAGNOSTIC DECISION SUPPORT FOR CIN (Price et al) 18. Morawski P: Understanding Bayesian belief networks. Artif Intell Expert August:44, 1989 19. Morawski P: Programming Bayesian belief networks. Artif Intell Expert May:74-79, 1989 20. Pearl J: Probabilistic Reasoning in Intelligent Systems. San Mateo, CA, Morgan Kaufman, 1988 21. Fleiss JL: Statistical Methods for Rates and Proportions, 2nd ed. New York, Wiley, 1981 22. Fleiss JL, Cohen J: The equivalence of weighted kappa and intraclass correlation coefficient as measures of reliability. Educ Psychol Meas 33:613-619, 1973 23. McCluggage WG, Buhidma M, Tang L, et al: Monoclonal antibody MIB1 in the assessment of cervical squamous intraepithelial lesions. Int J Gynecol Pathol 15:131-136, 1996 24. Kobayashi I, Matsuo K, Ishibashi Y, et al: The proliferative activity in dysplasia and carcinoma in situ of the uterine cervix analysed by proliferating cell nuclear antigen immunostaining and silver-binding argyrophilic nucleolar organiser region staining. Hum Pathol 25:198-202, 1994 25. Payne S, Kernohan NM, Walker F: Proliferation in the normal cervix and in preinvasive cervical lesions. J Clin Pathol 49:667-671, 1996 26. Al-Saleh W, Delvenne P, Greimers R, et al: Assessment of

Ki-67 antigen immunostaining in squamous intraepithelial lesions of the uterine cervix. Correlation with the histologic grade and human papillomavirus type. Am J Clin Pathol 104:154-160, 1995 27. Mittal KR, Demopoulos RI, Goswami S: Proliferating cell nuclear antigen (cyclin) expression in normal and abnormal cervical squamous epithelia. Am J Surg Pathol 17:117-122, 1993 28. McCluggage WG, Maxwell P, Bharucha H: Immunohistochemical detection of metallothionein and MIB1 in preinvasive and invasive uterine cervical squamous lesions. Int J Gynecol Pathol 17: 29-35, 1998 29. Raju GC: Expression of the proliferating cell nuclear antigen in cervical neoplasia. Int J Gynecol Pathol 13:337-341, 1994 30. Heatley MK: Proliferation in the normal cervix and in preinvasive cervical lesions. J Clin Pathol 49:957, 1996 31. Bharucha H, McCluggage G, Lee J, et al: Grading cervical dysplasia with AgNORs using a semi-automated image analysis system. Anal Quant Cytol Histol 15:323-328, 1993 32. Rowlands DC: Nucleolar organising regions in cervical intraepithelial neoplasia. J Clin Pathol 41:1200-1202, 1988 33. Bulten J, Van der Laak JAWM, Gemmink JH, et al: MIB1, a promising marker for the classification of cervical intraepithelial neoplasia. J Pathol 178:268-273, 1996

1203