A procedure for evaluating the reliability of a ... - Wiley Online Library

18 downloads 1018 Views 6MB Size Report
weighted kappa statistic for dissecting patterns in examiner agreement for ... calibrating multiple examiners, who will be using such an index in a clinical study,.
A procedure for evaluating the reliability of a gingivitis index

Albert Kingman Epidemiology and Oral Disease Prevention Program, National Institute of Dental Research, National Institutes of Health, Bethesda, MD, USA

Kingman A: A procedure for evalualing the reliability of a gingivitis index. J Clin Periodontol 1986: 13: 385-391 Abstract. A methodology is presented for assessing the reliability of an ordinallyscaled index and is illustrated by using data from a clinical trial in which gingival inflammation was assessed with the PMGI index, independently, by 5 examiners. One of the examiners was an experienced examiner, the others newly trained. All subjects were evaluated by each examiner initially and at the end of the study period. The reliability of the average score per subject, maximum score per subject, and the % of affected sites per person are estimated by the intraclass correlation coefficient. Procedures are presented that utilize various forms of the weighted kappa statistic for dissecting patterns in examiner agreement for specific sites, types of site, all sites, and for the individual components and categories of the index. It is shown how these procedures can be useful for training and calibrating multiple examiners, who will be using such an index in a clinical study, so that adequate rehability levels can be realized.

Several indices assessing plaque and gingival inflammation have been developed over the years. Reviews of these indices can be found in Fischman (1986) and Ciancio (1986). Ciancio pointed out that all gingival indices rely on gingival color, contour, bleeding or extent of involvement. Since there are a plethora of plaque and gingivitis indices available, one needs decide which index is appropriate to use in any specific clinical study. The choice will depend upon several factors, including the purpose of the study, its duration, the type and extent of change expected, and the reliability that can be achieved with this index. The degree of reliability competent examiners can achieve with a particular index is basic to many design considerations, and will be the focus of this discussion. Typically, indices used to assess plaque or gingivitis are based on an ordinally scaled variable. That is, the categories used to describe a site examined represent ordered gradations in a sign or symptom (extent or weight of plaque) or clinical manifestation (gingival inflammation) of the "disease" being evaluated. These categories have traditionally been assigned the scores 0, 1, 2, 3, representing none, mild, moderate or severe levels of the particular characteristic assessed. The average score of the sites assessed is commonly used to

summarize the current status of the person. In comparative studies, group means observed at the conclusion of the study period are often compared by using the analysis of covariance. The purpose of this presentation is to present a methodology for assessing the reliability of such an index, be it a plaque, gingival inflammation, or periodontal index. The PMGI (De la Rosa & Sturzenberger 1976) will be used to illustrate the methods, using data from a study in which gingivitis was assessed with a non-invasive form of this index (Sturzenberger et al. 1985). A schematic of the study design is given in Fig. 1. In this comparative clinical study, a group of volunteers were randomly assigned to one of two groups. All subjects were examined initially and after 40 days. Immediately after the initial exam, subjects in group I were given a prophylaxis, and nothing was given to those in group II; 30 days later those in group II were given a prophylaxis, and nothing was given to subjects in group I. All subjects were examined again after an additional 10 days. 5 examiners. A, B, C, D and F, participated in the study, with each examiner evaluating all subjects once at the initial and final exams. Examiner A was experienced in the use of the PMGI, the others were newly strained.

Key words: Methodology - statistical methods - kappa statistic - correlation.

Methods

The PMGI index is based on an ordinally-scaled variable having four categories. The clinical criteria used to differentiate among categingival inflammation. The clinical criteria adopted are those given by Loe & Silness (1963) and are described as Score Description 0 No inflammation; normal gingiva 1 Mild inflammation; slight change in color and little change in texture 2 Moderate inflammation; moderate glazing, redness, edema and hypertrophy; bleeding on pressure* 3 Severe inflammation; marked redness and hypertrophy; tendency to spontaneous bleeding; ulceration *this provocation was omitted in this study. These criteria are applied to all available papillary and marginal sites of the subject, both Iingually and labially. Thus there are 116 possible sites scored per subject. This set of scores given a subject can be summarized in several ways. The simplest, conceptually, is by the mean (PMGI). A second, termed the percent occurrence (PO), focuses on a dicho-

386

Kingman

tributed independently of subjects and error. The interaction term is assumed to have mean 0 and variance a} in either case, and distributed independently of the other effects. If examiners are fixed, there is a correlation imposed among Ij

GROUP

N

and Iif due to Z/(/=0. In the special case where one examiner makes all s assessments, that is, m = 1, there are ns observations, and model (1) becomes (2)

-30 DAYS-

where Si is defined as previously and e» represents the within subjects variation and is assumed to have mean 0 and variance al. In general, for either model (1) or (2), the analysis of variance table can be written as

GROUP II Fig. 1. Study design for an evaluation of proficieney of gingivitis examiners. Formgebung einer Studie zur Bewertung der Fdhigkeit von Vntersuchem der Gingivitis. Plan de I'etude pour evaluer Vaptilude des examinateurs de la gingivile.

tomy of the original scale, and represents the proportion of affected sites. Here, an affected site is one showing any degree of inflammation. The third measure investigated was the maximum of the site-specific scores. We will focus our attention on assessing examiner reliability of the mean score per subject (MPGI), since this is the usual summary measure. However, the reliabilities for the percent occurrence (PO) and the maximum score (MAX) will also be evaluated. Examiner reliability will be assessed at two levels: for scores representing a subject; and for scores representing individual sites. Intraclass correlation coefficients will be used to assess the reliability of subjects' scores, and various kappa statistics for specific sites within and/or between subjects.

score (initial - final). Here, the summary measures for a subject are based on all sites. There seem to be systematic differences among these examiners. These will be tested first. In general, if each of n subjects are examined by m examiners at j independent times, the PMGI score (PO or max) for the ith subject when examined by the yth examiner on the feth occasion is represented by Y/jk. The model assumed for such data is given as

Source between subjects n-\ within subjects n{ms-l) examiners w-1 interaction {n~\){m-\) error nm(s-l)

MS SSB MSB

SSW MSW SSX MSX

SSI MSI SSE MSE

using the customary notation for sums of squares and mean squares. Tests for differences among examiners can be derived depending upon the model assumed. In this study model (1) is asYiji=fi + Si+Xj+Iij+e,jt, (1) sumed. Since each examiner evaluated each subject once {s= 1) at a particular where & refers to the effect of the ith exam, the interaction and error factors subject, Xj the effect of the/th examiner, are confounded. In such a case one can Iij an interaction effect, and eijk a random still use the AOV table described above, error. This model is either a random provided one removes the row labeled effects or mixed model. The effects due "error" and uses the row labeled "interto subjects and error are assumed ranaction" in its stead. Also, the row ladom effects, distributed independently beled "within subjects" is redundant, of each other, having mean 0 and varihaving no direct relevance here. Thus, ances al and ai respectively. The examthe AOV table for this data set becomes: iner effect is either assumed as random Subject-Based Scores (if considered a "random" sample from Source df SS MS The average for 3 summary measures a larger group of examiners), or fixed between are presented in Table 1 for each exam- (only examiners of interest). When the subjects n-\ SSB MSB iner at the initial and final exam, to- examiner effect is assumed random, it examiners w-1 SSX MSX gether with those for the difference has mean 0 and variance a{, and is dis- interaction ( n - l ) ( m - l ) SSI MSI

Table 1. Average scores for specific summary measures for the PMGI index based on all sites Examiner

N

PMGI

Initial exam PO

max

PMGI

Final exam PO

max

PMGI

A B C D E

137 137 137 137 137

0.362 0.580 0.372 0.327 0.538

0.305 0.545 0.342 0.309 0.508

1.489 1.467 1.496 1.372 1.496

0.250 0.472 0.464 0.297 0.234

0.225 0.442 0.436 0.294 0.226

1.226 1.387 1.445 1.102 1.131

0.112 0.108 -0.09 0.030 0.304

Difference score PO 0.080 0.104 -0.094 0.015 0.282

max 0.263 0.080 0.051 0.270 0.365

Reliability of gingival index

387

Table 2. Analyses of variance for average PMGI scores on all sites source

d.f.

MS

subject examiners error

136 4 544

0.2160 1.8030 0.0215

Initial exam F

P

MS

10.02 83.61

0.000 0.000

0.2034 1.8504 0.0158

The analyses presented in Table 2 clearly show that there are significant differences among the 5 examiners at the initial and fmal exams (p< 0.001 and ^ < 0.001). Examiner dif^ferences were also found for the difference score (^ < 0.001). Scheffe multiple comparison tests indicated that the PMGI scores for examiners B and E were significantly larger than those of the other three at the initial exam. However, at the final exam, the PMCI scores of examiners B and C were significantly larger than those of the other three. For the difference scores, examiner E's scores were larger than the others, examiners A and B next, and examiners C and D the smallest. A more complete analysis in which prophylaxis group was included, although not presented here, showed similar patterns existed among the scores of these examiners for each group, separately. Examiner reliability is examined next. Inter- or intra-examiner relibility is assessed by using the intraclass correlation coefficient. The intraclass correlation coefficient is defined as the correlation between measurements made on the same subject (Scheffe 1959). Thus,

The exact form of this coefficient will depend upon the model assumed. That is, whether the model is a 1-way or 2way, and whether examiners are considered as a fixed or random effect. For the 2-way model (1) the intraelass correlation coefficients are (Fleiss et al. 1979) '+ol), Q

(3)

+ c^ + '^ll (4)

for examiners considered as random and fixed, respectively. In the special case involving one examiner (the 1-way model), the corresponding intraclass correlation coefficient is given as (Shrout & Fleiss 1979) Q = all(al + al).

(5)

These correlation coefficients can be estimated by using the mean squares from

Final exam F

P

F

12.80 116.41

0.000 0.000

0.1058 2.8500 0.0286

the analysis of variance table. Unbiased estimates of the variance components exist and can be used to obtain consistent estimates of the intraclass correlation coefficients in (3), (4) and (5), respectively. They are

(7)

r = (MSS-MSW)/ [MSS-f(5-l)MSW],

(8)

3.70 99.77

0.000 0.000

liability of the examiners appears fairly consistent for the two examinations, but its magnitude indicates examiner reliability is, at best, moderate. For these data the reliability of the difference score is lower than either prevalence score. Furthermore, these data suggest that the average PMGI score is more reliable than either the percent occurrence (PO) or the maximum score. Normally, if one had 5 examiners, one would probably investigate the reliability levels for various subsets of examiners to investigate the source of the differences. It could well be that one or two examiners are scoring differently than the others, and that, after further training, such deficiencies can be eliminated. In this study, however, 4 inexperienced examiners and 1 experienced examiner are participating. Here it makes sense to compare each of the inexperienced examiners with the experienced examiner separately. The intraclass correlation coefficients for pairs of examiners are presented in Table 4 for the initial and final exam, and also the difference score. There is minor variation among the four reliability magnitudes for the pairs of examiners using the mean PMGI at the initial and final exam. However, there is much more

r = n(MSS - MSI)/[/7MSS + 7«MSX + («w-n-w)MSI-l-nw(s-l)MSE], (6) /• = (MSS-MSI)/[MSS + -l)MSE],

MS P

respectively. For our example, «=137, m = 5 and .$•=1. The corresponding estimates of the reliability coefficients are obtained by using (6) or (7). Since the interaction and error factors are confounded, the variance components for error and interaction are not separately estimable. In this case one assumes the examiners a random effect or must assume the interaction is zero. For examiners as random, /• = «(MSS - MSI)/[/)MSS + wMSX + ], (6) The reliability of the 5 examiners are presented in Table 3 for the three summary measures based on all sites, papillary sites and marginal sites. The re-

Table 3. Reliability estimates for specific summary measures of the PMGI by types of sites at each examination site type

Initial exam PMGI PO max

PMGI

papillary marginal all sites

0.443 0.592 0.529

0.532 0.545 0.562

0.318 0.504 0.417

0.441 0.490 0.422

Final exam PO max 0.400 0.387 0.486

0.413 0.334 0.407

Difference score PMGI PO max 0.217 0.231 0.239

0.167 0.190 0.188

0.038 0.102 0.111

Table 4. Reliability estimates for pairs of examiners and specific summary measures of the PMGI index based on all sites Examiner pair A A A A

versus versus versus versus

B C D E

Initial exam max PMGI PO 0.49 0.63 0.52 0.54

0.31 0.56 0.47 0.38

0.44 0.42 0.50 0.33

Final exam PMGI PO max 0.58 0.53 0.64 0.66

0.44 0.42 0.61 0.60

0.49 0.40 0.32 0.42

Difference score PMGI PO max 0.33 0.23 0.18 0.27

0.27 0.16 O.IO 0.17

0.09 0.05 0.12 0.02

388

Kingman

variation among those for the difference score, both in an absolute and relative sense. These data would suggest that the final PMGI be used as the dependent variable rather than the difference score. Since there are substantial differences among the 5 examiners (variance component for examiners is 18% to 20% of the total) it is useful to identify the probable sources of these differences. This can done by a systematic study of the patterns among scores on specific sites, or collections of sites. Specific sites For assessing examiner agreement on a site specific basis, the scale of measurement is ordinal, and a categorical data analytic method is appropriate. In this setting, we assume we have a set of n sites, possibly representing n subjects at a specific site or a fewer number of subjects at several sites, which have been evaluated s times. The variable can take on one of/c ordered values (for example, none, mild, moderate or severe gingival inflammation). In this context s could be the number of different examiners who evaluated these sites, or the number of times one examiner evaluated these sites, or some combination. For i = 2, we can display the relative frequencies of pairs of diagnosis in a 2-way table. For s>2, the methods become much more involved and will not be discussed here. However, excellent discussions of such cases can be found in Davies & Fleiss (1982), Landis & Koch (1977b), and Tanner & Young (1985). In this paper, only the case of 2 examiners will be discussed. The data from examiners A (experienced) and B (inexperienced) when evaluating all sites at the initial exam will be used to illustrate

Table 5. Joint frequency distribution of diagnose for examiners A and B on all sites at initial exam A/B 0 1 2 3 total

0

I

5726 0.387 927 0.063 61 0.004 0 0.000

4481 0.303 2600 0.176 498 0.034 0 0.000

Total 68 0.005 173 0.012 240 0.016 18 0.001

1 0.000 0 0.000 6 0.001 5 0.000

10276 0.695 3700 0.251 805 0.055 23 0.001

6714 7579 499 0.454 0.512 0.034

12 0.001

14804

the procedures. The data are presented in Table 5, and represent the joint frequency distribution for all pairs of diagnoses for these examiners. The entries on the main diagonal represent those sites for which the examiners agree, the others some type of disagreement. If we let P.ydenote the proportion of sites that examiner A diagnosed as ;' and examiner B as j , then the proportion of sites on which agreement is realized for these examiners, denoted by po, where

can be evaluated. This is done by comparing the proportion of observed agreement, po, with the proportion of expected agreement, pc, where expected agreement is calculated under the assumption the two examiners call independently of each other. Thus, expected agreement, pt, is given by

The magnitude of expected agreement depends on the marginal distributions for the examiners. For example, if I of the sites were diagnosed into each of the four categories by each examiner, the expected agreement would be 0.25. However, if ^ of the sites were diagnosed with mild inflammation and ^ with no inflammation by both examiners, then expected agreement would be 0.50. Therefore, the proportion of observed agreement, po, needs to be adjusted to compensate for such chance agreement. This can be done in two ways. The first is by using a kappa statistic. Kappa statistics use a standardized ratio approach (Cohen 1960). One considers excess observed agreement over than expected by chance. This is done by computing the difference, po-pc and comparing its magnitude with the maximum possible excess agreement one could have obtained, l-pc by considering their ratio. That is.

A second approach consists of using a log-linear model approach. Here one would fit a linear model to the logarithm of the expected agreement proportions, incorporating the independence model, together with additive terms which can account for possible discrepancies in agreements from those attributed to the independence model (Tanner & Young 1985). We will use various weighted kappa

statistics (Landis & Koch 1977a) to dissect patterns in examiner agreement. The weighted kappa statistic, which incorporates a set of weights, Wij, given to the cells of the 2-way table of frequencies, is a generalization of the kappa statistic. This technique provides one with the possibility of awarding partial credits to pairs of disagreement diagnoses that are "close" to being agreements. Without loss of generality, the Wij are selected to satisfy 0i;w = 0.29. For a dishotomous-ordinal type variable Cicchetti (1976) has derived an as-

Reliability of gingival index

389

Table 6.Unrestricted and restricted linear weights for a gingival index with 4 categories

the categories "none", "mild", "moderate" and "severe" inflammation are illustrated, respectively, together with 0 3 A/B their computed weighted kappas. The 0.33 0.67 1.00 0.00 0.20 0.00 1.00 0.60 0 least reliable category is the one repres0.67 1.00 0.33 0.67 0.60 1.00 0.80 0.40 1 enting "mild" inflammation. The reliab1.00 0.67 0.67 0.33 1.00 0.20 0.80 0.80 2 ilities of the categories representing 0.67 0.33 1.00 0.80 1.00 0.00 0.00 0.40 3 "no" inflammation and "moderate" inflammation were significantly larger (p< 0.001 and ; j < 0.001). However, this sociated set of weights which considers equivalent to the intraclass correlation method considers a particular category the disagreements that involve a coefficient one would have computed versus all others, and, as such, does not "disease/no disease" type worse than using the 2-way analysis of variance mo- identify the particular type of disagreedisagreetnents involving two gradations del (1) on the actual scores (Fleiss & ment. of "disease". The pattern among these Cohen 1973). To illustrate this point, the Alternatively, one can use an iterative weights fory'=4 is given in Table 6b. data representing the maximum PMGI approach to investigate the pattern in These are termed restricted linear scores for examiners A and B from the examiner agreement for specific subsets weights. The corresponding weighted initial exam, given in Table 8a will be of the range of the diagnostic variable. kappa >i:w = 0.31. used. Here the weights are those given To do this one successively expands the If a score of 1 is evidence of disease, in Table 8b, and the value for the corre- number of cells which are assigned a as may be appropriate for gingival in- sponding weighted kappa x^ — Q.AA. weight of 1, or given full credit for flammation assessed by the PMGI, the This is the same value as that given in agreement. Differences between sucrestricted linear weights represent the Table 4 for examiners A and B using cessive weighted kappas can be used to more appropriate measure of agree- the maximum PMGI score at the initial assess the relative importance of that ment. A kappa value of 0.31 represents examination. particular type of disagreement. The fair agreement (Landis & Koch 1977a). data from Table 5 will be used to illusSo far we have considered overall In Table 7, weighted kappa using re- examiner agreement. That is, agreement trate the procedure. One begins with stricted weights are presented for all si- has been assessed over the entire range the basic kappa having weights given in tes, papillary and marginal sites for of the diagnostic variable. Methods for Table 10a. Then the sets of weights examiners A and B at the initial and dissecting the patterns among examiner given in 10b, c and d can be used to final examination. Thus, there is fair disagreement would also include an in- obtain kappas which ignore disagreeagreement between examiners A and B vestigation of the reliability of the indi- ments between categories "none"on individual sites at both examina- vidual categories. The reliability of the "mild", "mild"-"moderate" and "modtions. ith category is estimated by dichotomi- erate"-"severe", respectively. The differzing the scale into category / versus all ences between these kappas and the Another set of weights, given by others, and computing basic kappa for original kappa are used to assess the the resulting 2 x 2 table. (Fleiss 1981). relative importance of that specific type of disagreement. Clearly, the most sigcan be used to assign graduated penal- Equivalently, this can be done using nificant disagreement is that of disties to the disagreements. The associ- weighted kappa. In Table 9a, b, c and tinguishing between the "none" and ated weighted kappa is approximately d, the sets of weights corresponding to (a) Unrestricted weights 3 1 0 2

Table 7. Weighted kappas for examiners A and B corresponding to the restricted linear weights for types of sites

papillary sites marginal sites all sites

Initial exam

Final exam

0.21 0.36 0.31

0.26 0.33 0.30

Table 8. Distribution of maximum PMGI scores for examiners A and B at the initial examination

A/B

(a) Maximum PMGI scores 1 2 3

1 2 3

53 21 0

20 40 2

0 0 1

(b) Intraclass correlation weights I 2 3 1.00 0.75 0.00 0.75 1.00 0.75 0.00 0.75 1.00

(b) Restricted weights 1 2

Table 9. Reliabilities of the individual categories for examiners A and B for all sites at the initial examination

A/B

0

1

(a) 2 3

0

1

(b) 2 3

0

1

(c) 2 3

0

(d) 1 2

3

0 1 1 1 ).28

TaWe 70. Reliabilities of combinations of categories for examiners A and B for all sites at the initial examination A/B JT" J 2 3

(b) 1 2

0

(a) 1 2

3

0

1 0 0 0

0 1 0 0

0 0 0 I

1 1 0 1 1 0 0 0 1 0 0 0 x = 0.36 '

0 0 1 0

PC = 0 . 2 4

(c) 1 2

3

0

0 0 0 1

1 0 0 1 0 1 0 0

0 I 1 0

3

0

(d) 1 2

3

0

(e) 1 2

3

0 0 0 1

1 0 0 0

0 1 0 0

0 0 1 1

1 1 0 0

1 1 0 0

0 0 1 1

0 0 1 1

PC = 0 . 2 4

0 0 1 1 0 38

390

Kingman

Table 11. Marginal % distribution of diagnoses for all sites at the initial examination by examiner Examiner A B C D

E

None 69.4 45.4 65.6 69.1 49.0

Mild Moderate Severe 25.0 51.2 31.5 29.2 48.1

5.4 3.4 2.7 1.7 2.8

0.1 O.I 0.1 0.0 0.1

"mind" categories. This pattern was consistent on a site-specific basis for examiners A and B with their restricted linear weighted kappa values ranging from -0.06 to 0.48. The marginal distributions of diagnostic calls at the initial exam for the five examiners are illustrated in Table 11, X^ tests comparing the marginal distributions for examiners A and B showed they were significantly different (p< 0,001), Examiner As proportion of sites diagnosed with no inflammation was significantly larger than examiner B's (/)< 0.001), whereas the proportions for sites diagnosed with mild inflammation were significantly different (/7< 0,001), but in the opposite direction. Thus, examiner B was systematically diagnosing more inflammation than examiner A. Whenever the marginal distributions are different for a pair of examiners, substantial or excellent examiner agreement is precluded. However, excellent comparability levels between the marginal distributions for a pair of examiners does not necessarily imply a substantial examiner agreement level. An example is provided by the data for examiners A and D, illustrated in Table 12. Clearly, the marginal distributions are nearly identical, but the corresponding kappa (restricted linear weights) is ;

Suggest Documents