A GEOMETRIC APPROACH TO CONDITIONAL INTER-RATER AGREEMENT DAVID EUBANKS
Abstract. Use of multiple raters to assign categorical or ordinal data is common, e.g. in rubric scoring for assessment in education. While several measures exist to compare the actual rate of matching to a random baseline, they often produce a statistic that is hard to interpret and gives little information about internal reliability of the scale (that is between the individual categories). We derive a new a statistical test of inter-rater agreement that permits detailed analysis of rater ability to distinguish between each pair of categories. It is based on a geometrical approach that is simple to understand and leads to an intuitive and informative visual presentation of results. Asymptotic formulas for mean and variance are given, so that standard hypothesis testing can be done. Use is illustrated with a large sample of rubric ratings from composition courses. The supplementary material includes complete code in R for calculation of the statistics, Monte Carlo simulation (for small sample sizes), and graphical reports.
1. Introduction Measures of inter-rater agreement are important in distinguishing accurate classification from random assignments. In education, student work is often judged by multiple raters using an ordinal rubric scale, and in medicine multiple experts may make each make a categorical diagnosis of a subject. In cases where the number of scale outcomes is small (e.g. one to five on a typical Likert scale), it makes sense to ask how often raters agree in comparison to the frequency of agreement one would expect from random distribution of ratings. There is a long history of statistics devoted to this task, perhaps starting with Galton [7], as noted by Smeeton in [13]. The most well known measures of inter-rater agreement may be Cohen’s Kappa [3] and a more general statistic by Fleiss [5]. There is considerable literature on the general topic, spanning statistics, education, psychology, medicine, and other fields. Readers are pointed to the books [2] for the general topic of analysis of categorical data, and to [8] for a recent survey of inter-rater reliability measures. The literature on the approach of Cohen and Fleiss (and others in the same vein) is often critical of the design of the Kappa statistics, while often simultaneously admitting that use of these measures has become standard. In this paper we will propose a new measure of inter-rater agreement with an eye toward the practitioner who has to make sense of it. There are two “mundane” issues that are a bother to the Kappa consumer. First is that a single measure of agreement is not very helpful. As an analogy, it is more useful to have a graph of a random variable’s distribution than just the point estimate of the mean. Our solution is to provide more detail, including graphs that show where and how agreement happens and where it does not within a rating scale. The second practical issue is to understand how to interpret the agreement statistic. In [8] one can find some “paradoxes” that 1
2
DAVID EUBANKS
Table 1. Coin Flips Paper Paper Paper Paper
1 2 3 4
”Good” Ratings ”Poor” Ratings 0 2 1 1 1 1 2 0
arise from the assumptions built into the Cohen kappa. We will adopt a simple and intuitive approach that permits easily understood formulas (e.g. for sample mean and standard deviation) and lends itself to transparent graphical presentation. In short, the goal is to provide a useful tool that is easily understood. Throughout the paper, we will refer to those making the categorical assignments as raters, who assign ratings to subjects of study, using a common scale with a small number of ordinal or categorical outcomes. For example, a rubric for assessing a writing sample might have a rating for “Use of Evidence”, with accompanying descriptions of five outcomes that are encoded with integers one through five, such as a “poor” to “excellent” Likert scale. We will need to refer to the size of these parameters, and use S as the number of subjects, k for the number of outcomes in the scale, and the number of raters who assigned outcome j to subject i as nij , a table with subjects as rows and outcomes as columns, with counts in the body. In the literature it is conventional to replace an index with a symbol like “*” to denote the marginal sum, and we will do the same here. For example, n∗1 is the sum over all rows of column one. Generally we will be working with only two outcomes at a time, using only those subjects with at least two ratings within those outcomes, and use N for the total number of ratings in that set as a convenience. We will not insist on a constant number of raters per subject, nor take their particular characteristics into consideration (e.g. not all raters have to rate all subjects). In many practical cases, those assumptions are too restrictive. The intent is to create a multi-dimensional way to examine issues of inter-rater agreement that will be useful to those who work with such data. 2. Characteristics of Random Ratings Cohen’s idea was that it is not good enough to simply assess how often two raters agree on a categorical outcome for a particular subject, but that we should take into account how often we might see the same agreement “by chance”. Exactly what that means has been the subject of debate [11], so we will first consider that question. In the following, we will proceed more along the lines of the Fleiss Kappa than the original Cohen Kappa. See [8] chapter 2 for a nice development of the Fleiss Kappa. Consider “ratings” done by two lazy raters of student papers who simply flip coins and then assign a 1 for “Good” or 2 for “Poor”, depending on whether the coin is heads or tails. With two ratings for each paper, each row must be populated with either (2, 0), (1, 1) or (0, 2). With random coin tosses, we would expect the (1, 1) rows to occur twice as often as the other two types, according to the binomial distribution. Therefore, if we saw rows that look like table 1, we might suspect that they were random (imagine this pattern aggregated over many rows in various orders of appearance).
3
Figure 1. Random Ratings with Two Raters We will create a visualization of this random reference data by plotting it as if each row were a 2-vector on a plane, added in the usual way head-to-tail way vectors are summed. This is an intuitive way to compare rating sets, and it makes sense to adjust the scale so as not to depend on the number of rows in the table or the number of raters involved. We will do this by dividing the elements of the table by the sum of its elements (N ) so that the scaled version sums to one, like a probability density. On the adjusted scale, the axes of table 1 would range over (0, .5) instead of (0, 4). If we had three raters instead of two, a representative table of random coin flips would be (0, 3), (1, 2), (1, 2), (1, 2), (2, 1), (2, 1), (2, 1), (3, 0), according to the binomial distribution. Of course, we would not expect actual coin flips to be so accommodating as to present us with this precise pattern, but over a large number of observations we would expect a similar distribution of rows to become evident. Curves created by summing these vectors for increasing numbers of raters is shown in figure 2. We can see from the flattening of the curves as n increases that it approaches the straight diagonal (dotted) line, which is the asymptotic case (an infinite number of raters). We will be interested in the lengths of vector paths like these as a statistical
4
DAVID EUBANKS
Figure 2. The Effect of the Number of Raters on λ measure, which we will call λ. Note that by construction, all possible sets of rows with the given proportions can be represented within the rectangular dimensions of the graph. In fact, by sorting the rows (which we can do since the order carries no meaning for us), we can arrange them with steepest slopes first, so that the vector sums trace out a convex curve on or above the diagonal, as has been done with the ones in figure 2. Moreover, these curves have a very useful property. Although each of them shares the same proportion of “good” and “poor” ratings (or whatever we are assessing), the match rates between raters increase with the length of the curve. Conceptually, a straight diagonal represents the worst possible agreement, with the two outcomes occurring at the same rate for every subject. By contrast, a curve that makes an inverted ‘L’ by tracing a line straight up and then horizontally to the right, represents the maximum possible match rate as well as the longest curve (with length one). By creating a reference line for expected random curves like the ones in figure 2, we can compare the curve for the actual data to form an idea of how much matching is due to rater agreement. 3. Calculating Agreement Our understanding of what a random data set would look like for two outcomes is a basis for determining significant non-random matching. To do this, we follow
5
Fleiss [5] in finding the column frequencies in the table. For simplicity, we focus on the two outcomes in column one and column two of the table. The fraction of ratings in column one is p1 := b∗1 /N , and the column two fraction is p2 = 1−p1 , conditional on only considering these two columns at the moment. The null hypotheses we will test against is that the ratings are just binomial assignments across the two categories with the given probabilities (p1 , p2 ). Using the binomial distribution for the number of raters in each row bi∗, i = 1, ..., R, we can construct a path like the ones in figure 2, and calculate an expected length µλ0 . If our data contains more agreement than this expected random path it is evidence for agreement greater than chance. By comparing its length λ to µλ0 and using the standard deviation σλ0 , we can perform a hypothesis test. Of course, most ratings scales have more than two outcomes. This problem could be solved by using a multinomial distribution, and giving up on the idea of simple graphs, but we will go in a different direction as a service to the users of the report. Instead of assessing the rating scale as a whole, we will study it in detail by computing conditional rater agreement between each outcome and each other outcome. This will allow us to detect the agreement quality of ratings within the scale. It could easily be, for example, that raters find good agreement on the ends of an ordinal scale, but that ratings in the middle are not well distinguished from one another. For ordinal scales we would expect to see that adjacent outcomes, like 1 and 2 have less agreement than further ones like 1 and 5. That is, it is easier to distinguish “excellent” from “awful” than it is “excellent” from “very good.” With the path length λ of the rows of rating data in hand, we can compute a version of Kappa that applies to each pair of outcomes, following Cohen’s original formulation, which looks like λ − µλ0 . κλ := 1 − µλ0 This statistic is provided as just another piece of information about the rater agreement. In combination with the graph and confidence (p-value), it can help inform decisions about the quality of agreement among raters. Derivations of the statistics are found in section 5. First we illustrate with actual ratings. 4. An Example Using Student Peer Reviews The illustration of λ as a rater agreement statistic uses data drawn from student peer reviews of writing in a college first year composition course at University of South Florida via the use of an online administration system MyReviewers (myreviewers.com). The first sample includes 100 papers that were rated independently by student peers on a rubric scale with outcomes 1 to 4, where 4 represents the highest quality (there is also a rating of zero, but it is used for a different purpose and is excluded from the analysis here). The rubric has eight traits to be assessed, but we will only use one that pertains to the use of evidence in supporting claims, as described in [9] and [10]. For illustration, first consider the λ-analysis for outcomes 1 and 2, the lowest scores. This is graphed in figure 3. The graph is generated by first identifying those subjects with at least two ratings of either 1 or 2 within the data sample. The conditional column frequencies are computed as (b∗1 /N, b∗2 /N ), where N is the total number of ratings remaining in these two columns. For each of subject, the total number of ratings for that subject (bi∗ in our notation) is used together with the column frequencies to compute the
6
DAVID EUBANKS
Figure 3. Rubric Score Analysis for Outcomes 1 and 2 expected binomial vectors for the given distribution. These are sorted by slope and combined to make the thin solid reference line that appears on the graph as a visual guide. Above this reference line are dots that represent the vector addition of the actual data, again sorted by slope to trace out a convex shape. Dots are used so that the density of data is evident from the graph. Dots above the reference line comprise evidence of inter-rater agreement higher than random. The p-value gives the statistical significance of this, and the Kappa statistic is the fraction of the available “room” above the reference line that is accounted for by the data–a kind of effect size. The proportions of the graph are due to the different relative frequencies of the two outcomes: there are more 2s than 1s. We can see along the top of the graph that several dots overlap the horizontal segment, showing subjects where there was perfect agreement on outcome 2, but that all but three of those were expected by chance. In fact, we can see that divergence of the dot path from the reference line is due to about five subjects along the vertical and horizontal axis, in excess of the chance reference. The remaining dots create a diagonal roughly parallel to the reference line. This difference accounts for 46% of the possible length available. Perfect inter-rater agreement on the two outcomes would show up as dots exclusively along the vertical or horizontal, have length one and Kappa one. With that understanding, we turn to the paired comparison between all outcomes found in figure 4. The comparison of each of the different outcomes with the others is obtained by consulting the row and column indexes. Within each plot, as before, the dots represent individual papers being rated, and trace out the observed λ path. The thinner line is the expected curve for random assignments with the given distribution of scores, numbers of raters (which varies by paper), and number of papers. The calculated p-value is given on each plot, where a low enough p-value (depending on what alpha you choose) rejects the hypothesis that the assigned scores are binomial random variables. The N is the total number of ratings (not
7
Figure 4. Peer Review Sample
subjects). The diagonal of the multi-chart display contains outcomes that are adjacent, for example the top left is outcome 1 compared to outcome 2, which we saw earlier. Because these ratings are ordinals, we would expect that there would be less agreement between raters on outcomes that are neighbors, like 1 and 2, than more distant ones like 1 and 4. The p-values and Kappas show that this is true. More interestingly, we see that peer raters have an easier time finding agreement between the 1 and 2 rating than they do with 3 and 4, which looks like what we would expect from random scoring. This sort of information could be very valuable in training raters, evaluating the rubric, and using scores for grading or other purposes where supposed meaning is important. The skewed proportions that are especially evident in outcome pair (1,3) is due to the relative paucity of outcome 1 ratings in the paired columns. We can see that kappa is .56, but that the raters don’t really have much opportunity to show real agreement, since outcome 1 is so heavily favored in the distribution. The warped proportions are visual cues to the user of the report that the data are unbalanced, which is an important context for interpretation and analysis of the rating performance.
8
DAVID EUBANKS
Figure 5. All Peer Reviews Using the whole data set of peer reviews instead of sampling the first 100 subjects, we obtain figure 5. There are too many dots to been seen individually–now each subject appears as a continuation of the darker line in each graph facet. These are untrained raters who, with a large number of samples, show matching propensity barely above random for adjacent outcomes, but as the ordinal distance between outcomes increases, so does Kappa. The small p-values are due to the large number of ratings. 5. Formulas The formal definitions of the statistics are not complex. The length of the vector arc for outcomes J1 and J2 derived from the ratings data table is given by a sum over the R rows, as λJ1 J2 :=
R q 1 X 2 2 (biJ1 ) + (biJ2 ) . N i=1
Note that for these calculations, the R rows only include those that have at least two ratings for comparison. This is likely to be smaller than the total number of
9
rows present in the whole data set. Similarly, N is the total number of ratings in the remaining rows. This actual length is compared to the mean length obtained under the assumption of randomness. The null hypothesis assumes that the conditional marginal column frequencies are the fixed proportions used in binomial sampling, and that the number of samples is the number of ratings in a given row, summing over the two columns of interest. For convenience we will call these row sums ni , and the column marginal frequencies defined by p := b∗J1 /N and 1 − p. Then, for each row i = 1, ..., R we define ni X p ni j li := p (1 − p)ni −j i2 + (ni − j)2 , j j=0 and calculate the total mean path length λ by averaging over all ratings with µλ0 =
R 1 X li . N i=1
Calculating the variance is equally straightforward, with σλ2 0 =
R ni 2 p 1 XX ni j ni −j 2 + (n − j)2 − l i . p (1 − p) i i N 2 i=1 j=0 j
For a comparison to statistics for Cohen’s Kappa, see [6]. Using these statistics we can generate p-values in the usual way to test the hypothesis that the observed ratings have a mean greater than λ0 . For large enough N , the central limit theorem allows us to approximate the distribution of random lengths by a normal distribution. Because the random variables that comprise the random baseline (null hypothesis) case are simple binomials, we can easily simulate data to compare to the calculated values. This was done as a check to the code that produced the diagrams. The histogram in figure 6 shows the frequency of 10,000 simulated values for outcomes 3 and 4 from figure 4, and the solid line traces the calculated values using the formulas given for mean and variance of the length. We can see that there is good agreement between them. Either method can be used to find p-values, although simulation can be quite slow, and is most appropriate when there are a small number of ratings. For code in R and data to reproduce the figures, contact the author. 6. Lambda’s Connection to Match Rates Visually, the meaning of the λ statistic is intuitive as the length of a path. We will refer to it as an asymptotic match propensity averaged over all ratings, to distinguish from match probability, which is used almost universally in the literature. It is clear that the maximum length of λ is one (the inverted L shape), and that this corresponds to perfect matching. With the Fleiss Kappa, one can p have a match rate of zero, whereas the minimum λ is the length of the diagonal p21 + p22 , where (p1 , p2 ) are the column proportions of the two outcomes under consideration. When squared, this quantity is the base match rate considered random for Fleiss. However, Fleiss compares this baseline probability based to the actual combinatorial match rates within each row (i.e. ratings on a single subject) using a sum over n2 / k2 for
10
DAVID EUBANKS
Figure 6. Rubric Score Analysis n raters, k of whom assigned the same outcome. There are some undesirable smalln effects of this. With two raters, a row (1, 1) counts as zero matches, whereas using the same method with four rater responses of (2, 2) has a match rate of (1 + 1)/6 = 1/3. Assuming that the column proportions are (.5, .5), both of these represent the worse possible match rate. In terms of the match propensity, both are accumulating outcome 1 and outcome 2 at the same rate, which would be drawn as a diagonal from (0, 0) toward (.5, .5). Asymptotically (and graphically), two cases of (1, 1) would be counted the same as one case of (2, 2). Thus, the match propensity is more self-consistent than combinatorial matches. Rather than a worst case of zero matches, as with Fleiss, the lambda match propensity has a worst case match rate when the accumulation of the outcomes exactly matches the column marginal rates (p1 , p2 ). We saw in figure 2 that as the number of raters increases, the binomial distribution approaches the worst case rate, meaning that there is more room to show match actual propensity for tests with larger numbers of raters. The asymptotic match rates obtained by squaring the length of the λ segments (or equivalently, by squaring and summing the two components), could be averaged instead of the unsquared length as we have done here. Using match rates instead of what we call match propensities has been the traditional approach, but it comes with baggage. Conger in [4] wrote that “agreement among raters can actually be considered to be an arbitrary choice along a continuum ranging from agreement
11
for a pair of raters to agreement among all raters.” This topic is also addressed in [1], and a survey of problems with Kappa is given in [11]. Since the match propensity calculation does not need to decide how many raters need to agree, that problem goes away. Additionally, an asymptotic approach is more reasonable if we are concerned with the generalizability of the results. The match rate calculation per Fleiss has another philosophical problem. Because the number of matches, by definition, are calculated within a single column before being summed, it doesn’t matter what row they are in. For example, consider a data set with two rows (1, 1, 4, 1) and (0, 2, 5, 0), where seven raters have been employed to assess two subjects on four outcomes. The count of exact matches for the first row is six, and for the second row eleven, so the average rate is 17/21. Mathematically, the numbers in the two rows can be scrambled and still get the same result. For example, we could swap the last two entries of the rows to get (1, 1, 5, 0) and (0, 2, 4, 1) and the resulting Kappa would be unchanged. With Fleiss, these permutations are constrained by the requirement of a constant number of raters, although as long as we divide everything by the number of raters-choose-two at the end, we could actually put the numbers anywhere we want in the rows. It is more obviously an issue when that requirement is lifted, in the asymptotic case. If we compute the length of each row and square and add these, it is the same as squaring each of the normalized elements in the table and summing–their order makes no difference. Philosophically, the idea of matching should be tied to the specific characteristics of the row where the ratings are found, and the λ preserves that association, whereas the match probability (exact or asymptotic) generally does not. This seems to be ignoring important information about rater agreement. By contrast, the square root in the propensity formula algebraically binds all the information about a subject to its row statistic. Although we only use a binomial treatment for the sake of making understandable graphs, a multinomial single propensity index over all outcomes would similarly use all information about a subject. Others have computed rater agreement by using pairs of columns as we have done here. Fliess formulates a single-column statistic in [5], and Roberts gives a formula for pairs of outcomes in [12]. In [8] one can find a chapter on the topic of conditional rater agreement. The approach here is different from those, however. In chapter 3 of [8] there is a description of Kappa with the data conceived as vectors, but the vector lengths are not used, only their squares as a calculation of rater agreement. It should be noted that it is an intentional choice to leave out outcome pairs with fewer than two observations, meaning rows of (0,0), (0,1), or (1,0). It is not obvious that these should be deleted, and the analysis could incorporate them. Philosophically, a (0,0) row can be interpreted as a sign of inter-rater agreement, since both raters agreed that neither of the outcomes was appropriate for the given subject. Since the length of the vector is zero, including it only serves to increase the denominator. However, this leads to more problems, since we now need to incorporate all the ratings that went to other outcomes than the pair under study. This seems to over-complicate the simple comparison of one outcome to another. The (0,1) and (1,0) rows could easily be included, but since these are horizontal and vertical vectors that give support to inter-rater agreement, it would inflate the match propensity, and it seems more appropriate to omit them.
12
DAVID EUBANKS
One limitation of the match propensity that remains is that we are agnostic as to the characteristics of the raters. They could all be the same person or machine, or completely heterogeneous. The Cohen Kappa used rater characteristics as an essential (and controversial) element of the statistic, and in some cases it may be useful to modify the scheme above to incorporate rater identities and associated marginal rating distributions. References 1. Alan Agresti, A model for agreement between ratings on an ordinal scale, Biometrics (1988), 539–548. 2. Alan Agresti and Maria Kateri, Categorical data analysis, Springer, 2011. 3. Jacob Cohen et al., A coefficient of agreement for nominal scales, Educational and psychological measurement 20 (1960), no. 1, 37–46. 4. Anthony J Conger, Integration and generalization of kappas for multiple raters., Psychological Bulletin 88 (1980), no. 2, 322. 5. Joseph L Fleiss, Measuring nominal scale agreement among many raters., Psychological bulletin 76 (1971), no. 5, 378. 6. Joseph L Fleiss, Jacob Cohen, and BS Everitt, Large sample standard errors of kappa and weighted kappa., Psychological Bulletin 72 (1969), no. 5, 323. 7. Francis Galton, Finger prints, Macmillan and Company, 1892. 8. Kilem L Gwet, Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters, Advanced Analytics, LLC, 2014. 9. J M Moxley, Composition rubrics, website. 10. J M Moxley and Eubanks D A, On keeping score, Writing Program Administration (In Press). 11. David MW Powers, The problem with kappa, Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, 2012, pp. 345–355. 12. Chris Roberts, Modelling patterns of agreement for nominal scales, Statistics in medicine 27 (2008), no. 6, 810–830. 13. Nigel C Smeeton, Early history of the kappa statistic, 1985, p. 795. Furman University E-mail address:
[email protected]