Best practices in quantitative methods 3 Best Practices in Interrater Reliability Three Common Approaches
Contributors: Steven E. Stemler & Jessica Tsai Editors: Jason Osborne Book Title: Best practices in quantitative methods Chapter Title: "3 Best Practices in Interrater Reliability Three Common Approaches" Pub. Date: 2008 Access Date: May 10, 2014 Publishing Company: SAGE Publications, Inc. City: Thousand Oaks Print ISBN: 9781412940658 Online ISBN: 9781412995627
DOI: http://dx.doi.org/10.4135/9781412995627.d5 Print pages: 29-50 ©2008 SAGE Publications, Inc. All Rights Reserved. This PDF has been generated from SAGE Research Methods. Please note that the pagination of the online version will vary from the pagination of the print book.
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
http://dx.doi.org/10.4135/9781412995627.d5 [p. 29 ↓ ]
3 Best Practices in Interrater Reliability Three Common Approaches Steven E.Stemler JessicaTsai 1
The concept of interrater reliability permeates many facets of modern society. For example, court cases based on a trial by jury require unanimous agreement from jurors regarding the verdict, life–threatening medical diagnoses often require a second or third opinion from health care professionals, student essays written in the context of high–stakes standardized testing receive points based on the judgment of multiple readers, and Olympic competitions, such as figure skating, award medals to participants based on quantitative ratings of performance provided by an international panel of judges. Any time multiple judges are used to determine important outcomes, certain technical and procedural questions emerge. Some of the more common questions are as follows: How many raters do we need to be confident in our results? What is the minimum level of agreement that my raters should achieve? And is it necessary for raters to agree exactly, or is it acceptable for them to differ from each other so long as their difference is systematic and can therefore be corrected?
Page 3 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Key Questions to Ask Before Conducting an Interrater Reliability Study If you are at the point in your research where you are considering conducting an interrater reliability study, then there are three important questions worth considering: • • •
What is the purpose of conducting your interrater reliability study? What is the nature of your data? What resources do you have at your disposal (e.g., technical expertise, time, money)?
The answers to these questions will help determine the best statistical approach to use for your study. [p. 30 ↓ ]
What Is the Purpose of Conducting Your Interrater Reliability Study? There are three main reasons why people may wish to conduct an interrater reliability study. Perhaps the most popular reason is that the researcher is interested in getting a single final score on a variable (such as an essay grade) for use in subsequent data analysis and statistical modeling but first must prove that the scoring is not “subjective” or “biased.” For example, this is often the goal in the context of educational testing where large–scale state testing programs might use multiple raters to grade student essays for the ultimate purpose of providing an overall appraisal of each student's current level of academic achievement. In such cases, the documentation of interrater reliability is usually just a means to an end—the end of creating a single summary score for use in subsequent data analyses—and the researcher may have little inherent interest in the details of the interrater reliability analysis per se. This is a perfectly acceptable reason for wanting to conduct an interrater reliability study; however, Page 4 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
researchers must be particularly cautious about the assumptions they are making when summarizing the data from multiple raters to generate a single summary score for each student. For example, simply taking the mean of the ratings of two independent raters may, in some circumstances, actually lead to biased estimates of student ability, even when the scoring by independent raters is highly correlated (we return to this point later in the chapter). A second common reason for conducting an interrater reliability study is to evaluate a newly developed scoring rubric to see if it is “working” or if it needs to be modified. For example, one may wish to evaluate the accuracy of multiple ratings in the absence of a “gold standard.” Consider a situation in which independent judges must rate the creativity of a piece of artwork. Because there is no objective rule to indicate the “true” creativity of a piece of art, a minimum first step in establishing that there is such a thing as creativity is to demonstrate that independent raters can at least reliably classify objects according to how well they meet the assumptions of the construct. Thus, independent observers must subjectively interpret the work of art and rate the degree to which an underlying construct (e.g., creativity) is present. In situations such as these, the establishment of interrater reliability becomes a goal in and of itself. If a researcher is able to demonstrate that independent parties can reliably rate objects along the continuum of the construct, this provides some good objective evidence for the existence of the construct. A natural subsequent step is to analyze individual scores according to the criteria. Finally, a third reason for conducting an interrater reliability study is to validate how well ratings reflect a known “true” state of affairs (e.g., a validation study). For example, suppose that a researcher believes that he or she has developed a new colon cancer screening technique that should be highly predictive. The first thing the researcher might do is train another provider to use the technique and compare the extent to which the independent rater agrees with him or her on the classification of people who have cancer and those who do not. Next, the researcher might attempt to predict the prevalence of cancer using a formal diagnosis via more traditional methods (e.g., biopsy) to compare the extent to which the new technique is accurately predicting the diagnosis generated by the known technique. In other words, the reason for conducting an interrater reliability study in this circumstance is because it is not enough that independent raters have high levels of interrater reliability; what really matters is the Page 5 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
level of reliability in predicting the actual occurrence of cancer as compared with a “gold standard”—in this case, the rate of classification based on an established technique. Once you have determined the primary purpose for conducting an interrater reliability study, the next step is to consider the nature of the data that you have or will collect.
What Is the Nature of Your Data? There are four important points to consider with regard to the nature of your data. First, it is important to know whether your data are considered nominal, ordinal, interval, or ratio (Stevens, 1946). Certain statistical techniques are better suited to certain types of data. For example, if the data you are evaluating are nominal (i.e., the differences between the categories you are rating are qualitative), then there are relatively few statistical methods for you to choose from (e.g., percent agreement, Cohen's kappa). If, on the other hand, the data are measured at [p. 31 ↓ ] the ratio level, then the data meet the criteria for use by most of the techniques discussed in this chapter. Once you have determined the type of data used for the rating scale, you should then examine the distribution of your data using a histogram or bar chart. Are the ratings of each rater normally distributed, uniformly distributed, or skewed? If the rating data exhibit restricted variability, this can severely affect consistency estimates as well as consensus–based estimates, threatening the validity of the interpretations made from the interrater reliability estimates. Thus, it is important to have some idea of the distribution of ratings in order to select the best statistical technique for analyzing the data. The third important thing to investigate is whether the judges who rated the data agreed on the underlying trait definition. For example, if two raters are judging the creativity of a piece of artwork, one rater may believe that creativity is 50% novelty and 50% task appropriateness. By contrast, another rater may judge creativity to consist of 50% novelty, 35% task appropriateness, and 15% elaboration. These differences in perception will introduce extraneous error into the ratings. The extent to which your raters are defining the construct in a similar way can be empirically evaluated using
Page 6 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
measurement approaches to interrater reliability (e.g., factor analysis, a procedure that is further described later in this chapter). Finally, even if the raters agree as to the structure, do they assign people into the same category along the continuum, or does one judge assign a person “poor” in mathematics while another judge classifies that same person as %“good”? In other words, are they using the rating categories the same way? This can be evaluated using consensus estimates (e.g., via tests of marginal homogeneity). After specifying the purpose of the study and thinking about the nature of the data that will be used in the analysis, the final question to ask is the pragmatic question of what resources you have at your disposal.
What Resources Do You Have at Your Disposal? As most people know from their life experience, “best” does not always mean most expensive or most resource intensive. Similarly, within the context of interrater reliability, it is not always necessary to choose a technique that yields the maximum amount of information or that requires sophisticated statistical analyses in order to gain useful information. There are times when a crude estimate may yield sufficient information— for example, within the context of a low–stakes, exploratory research study. There are other times when the estimates must be as precise as possible—for example, within the context of situations that have direct, important stakes for the participants in the study. The question of resources often has an influence on the way that interrater reliability studies are conducted. For example, if you are a newcomer who is running a pilot study to determine whether to continue on a particular line of research, and time and money are limited, then a simpler technique such as the percent agreement, kappa, or even correlational estimates may be the best match. On the other hand, if you are in a situation where you have a high–stakes test that needs to be graded relatively quickly, and money is not a major issue, then a more advanced measurement approach (e.g., the many–facets Rasch model) is most likely the best selection. Page 7 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
As an additional example, if the goal of your study is to understand the underlying nature of a construct that to date has no objective, agreed–on definition (e.g., wisdom), then achieving consensus among raters in applying a scoring criterion will be of paramount importance. By contrast, if the goal of the study is to generate summary scores for individuals that will be used in later analyses, and it is not critical that raters come to exact agreement on how to use a rating scale, then consistency or measurement estimates of interrater reliability will be sufficient.
Summary Once you have answered the three main questions discussed in this section, you will be in a much better position to choose a suitable technique for your project. In the next section of this chapter, we will discuss (a) the most popular statistics used to compute interrater reliability, (b) the computation and interpretation of the results of statistics using worked examples, (c) the implications for summarizing data that follow from each technique, and (d) the advantages and disadvantages of each technique. [p. 32 ↓ ]
Choosing the Best Approach for the Job Many textbooks in the field of educational and psychological measurement and statistics (e.g., Anastasi & Urbina, 1997; Cohen, Cohen, West, & Aiken, 2003; Crocker & Algina, 1986; Hopkins, 1998; von Eye & Mun, 2004) describe interrater reliability as if it were a unitary concept lending itself to a single, “best” approach across all situations. Yet, the methodological literature related to interrater reliability constitutes a hodgepodge of statistical techniques, each of which provides a particular kind of solution to the problem of establishing interrater reliability.
Page 8 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Building on the work of Uebersax (2002) and J. R. Hayes and Hatch (1999), Stemler (2004) has argued that the wide variety of statistical techniques used for computing interrater reliability coefficients may be theoretically classified into one of three broad categories: (a) consensus estimates, (b) consistency estimates, and (c) measurement estimates. Statistics associated with these three categories differ in their assumptions about the purpose of the interrater reliability study, the nature of the data, and the implications for summarizing scores from various raters.
Consensus Estimates of Interrater Reliability Consensus estimates are often used when one is attempting to demonstrate that a construct that traditionally has been considered highly subjective (e.g., creativity, wisdom, hate) can be reliably captured by independent raters. The assumption is that if independent raters are able to come to exact agreement about how to apply the various levels of a scoring rubric (which operationally defines behaviors associated with the construct), then this provides some defensible evidence for the existence of the construct. Furthermore, if two independent judges demonstrate high levels of agreement in their application of a scoring rubric to rate behaviors, then the two judges may be said to share a common interpretation of the construct. Consensus estimates tend to be the most useful when data are nominal in nature and different levels of the rating scale represent qualitatively different ideas. Consensus estimates also can be useful when different levels of the rating scale are assumed to represent a linear continuum of the construct but are ordinal in nature (e.g., a Likert– type scale). In such cases, the judges must come to exact agreement about each of the quantitative levels of the construct under investigation. The three most popular types of consensus estimates of interrater reliability found in the literature include (a) percent agreement and its variants, (b) Cohen's kappa and its variants (Agresti, 1996; Cohen, 1960, 1968; Krippendorff, 2004), and (c) odds ratios. Other less frequently used statistics that fall under this category include Jaccard's J and the G–Index (see Barrett, 2001). Page 9 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Percent Agreement. Perhaps the most popular method for computing a consensus estimate of interrater reliability is through the use of the simple percent agreement statistic. For example, in a study examining creativity, Sternberg and Lubart (1995) asked sets of judges to rate the level of creativity associated with each of a number of products generated by study participants (e.g., draw a picture illustrating Earth from an insect's point of view, write an essay based on the title “2983”). The goal of their study was to demonstrate that creativity could be detected and objectively scored with high levels of agreement across independent judges. The authors reported percent agreement levels across raters of .92 (Sternberg & Lubart, 1995, p. 31). The percent agreement statistic has several advantages. For example, it has a strong intuitive appeal, it is easy to calculate, and it is easy to explain. The statistic also has some distinct disadvantages, however. If the behavior of interest has a low or high incidence of occurrence in the population, then it is possible to get artificially inflated percent agreement figures simply because most of the values fall under one category of the rating scale (J. R. Hayes & Hatch, 1999). Another disadvantage to using the simple percent agreement figure is that it is often time–consuming and labor–intensive to train judges to the point of exact agreement. One popular modification of the percent agreement figure found in the testing literature involves broadening the definition of agreement by including the adjacent scoring categories on the rating scale. For example, some testing programs include writing sections that are scored by judges using a rating scale with levels ranging from 1 (low) to 6 (high) (College Board, 2006). If a percent adjacent agreement approach were used to score this section of the exam, this would [p. 33 ↓ ] mean that the judges would not need to come to exact agreement about the ratings they assign to each participant; rather, so long as the ratings did not differ by more than one point above or below the other judge, then the two judges would be said to have reached consensus. Thus, if Rater A assigns an essay a score of 3 and Rater B assigns the same essay a score of 4, the two raters are close enough together to say that they “agree,” even though their agreement is not exact. The rationale for the adjacent percent agreement approach is often a pragmatic one. It is extremely difficult to train independent raters to come to exact agreement, no matter how good one's scoring rubric. Yet, raters often give scores that are “pretty close” to Page 10 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
the same, and we do not want to discard this information. Thus, the thinking is that if we have a situation in which two raters never differ by more than one score point in assigning their ratings, then we have a justification for taking the average score across all ratings. This logic holds under two conditions. First, the difference between raters must be randomly distributed across items. In other words, Rater A should not give systematically lower scores than Rater B. Second, the scores assigned by raters must be evenly distributed across all possible score categories. In other words, both raters should give equal numbers of ‘s, 2s, 3s, 4s, 5s, and 6s across the population of essays that they have read. If both of these assumptions are met, then the adjacent percent agreement approach is defensible. If, however, either of these assumptions is violated, this could lead to a situation in which the validity of the resultant summary scores is dubious (see the box below). Consider a situation in which Rater A systematically assigns scores that are one power lower than Rater B. Assume that they have each rated a common set of 100 essays. If we average the scores of the two raters across all essays to arrive at individual student scores, this seems, on the surface, to be defensible because it really does not matter whether Rater A or Rater B is assigning the high or low score because even if Rater A and Rater B had no systematic difference in severity of ratings, the average score would be the same. However, suppose that dozens of raters are used to score the essays. Imagine that Rater C is also called in to rate the same essay for a different sample of students. Rater C is paired up with Rater B within the context of an overlapping design to maximize rater efficiency (e.g., McArdle, 1994). Suppose that we find a situation in which Rater B is systematically lower than Rater C in assigning grades. In other words, Rater A is systematically one point lower than Rater B, and Rater B is systematically one point lower than Rater C. On the surface, again, it seems logical to average the scores assigned by Rater B and Rater C. Yet, we now find ourselves in a situation in which the students rated by the Rater B/C pair score systematically one point higher than the students rated by the Rater A/B pair, even though neither combination of raters differed by more than one score point in their ratings, thereby demonstrating “interrater reliability.” Which student would you rather be? The one who was lucky enough to draw the B/C rater combination or the one who unfortunately was scored by the A/B combination?
Page 11 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Thus, in order to make a validity argument for summarizing the results of multiple raters, it is not enough to demonstrate adjacent percent agreement between rater pairs; it must also be demonstrated that there is no systematic difference in rater severity between the rater set pairs. This can be demonstrated (and corrected for in the final score) through the use of the many–facet Rasch model. Now let us examine what happens if the second assumption of the adjacent percent agreement approach is violated. If you are a rater for a large testing company, and you are told that you will be retained only if you are able to demonstrate interrater reliability with everyone else, you would naturally look for your best strategy to maximize interrater reliability. If you are then told that your scores can differ by no more than one point from the other raters, you would quickly discover that your best bet then is to avoid giving any ratings at the extreme ends of the scale (i.e., a rating of 1 or a rating of 6). Why? Because a rating at the extreme end of the scale (e.g., 6) has two potential scores with which it can overlap (i.e., 5 or 6), whereas a rating of 5 would allow you to potentially “agree” with three scores [p. 34 ↓ ] (i.e., 4, 5, or 6), thereby maximizing your chances of agreeing with the second rater. Thus, it is entirely likely that the scale will go from being a 6–point scale to a 4–point scale, reducing the overall variability in scores given across the spectrum of participants. If only four categories are used, then the percent agreement statistics will be artificially inflated due to chance factors. For example, when a scale is 1 to 6, two participants are expected to agree on ratings by chance alone only 17% of the time. When the scale is reduced to 1 to 4, the percent agreement expected by chance jumps to 25%. If three categories, a 33% chance agreement is expected; if two categories, a 50% chance agreement is expected. In other words, a 6–point scale that uses adjacent percent agreement scoring is most likely functionally equivalent to a 4–point scale that uses exact agreement scoring. This approach is advantageous in that it relaxes the strict criterion that the judges agree exactly. On the other hand, percent agreement using adjacent categories can lead to inflated estimates of interrater reliability if there are only a limited number of categories to choose from (e.g., a 1–4 scale). If the rating scale has a limited number of points, Page 12 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
then nearly all points will be adjacent, and it would be surprising to find agreement lower than 90%. Cohen's Kappa. Another popular consensus estimate of interrater reliability is Cohen's kappa statistic (Cohen, 1960,1968). Cohen's kappa was designed to estimate the degree of consensus between two judges and determine whether the level of agreement is greater than would be expected to be observed by chance alone (see Stemler, 2001, for a practical example with calculation). The interpretation of the kappa statistic is slightly different from the interpretation of the percent agreement figure (Agresti, 1996). A value of zero on kappa does not indicate that the two judges did not agree at all; rather, it indicates that the two judges did not agree with each other any more than would be predicted by chance alone. Consequently, it is possible to have negative values of kappa if judges agree less often than chance would predict. Kappa is a highly useful statistic when one is concerned that the percent agreement statistic may be artificially inflated due to the fact that most observations fall into a single category. Kappa is often useful within the context of exploratory research. For example, Stemler and Bebell (1999) conducted a study aimed at detecting the various purposes of schooling articulated in school mission statements. Judges were given a scoring rubric that listed 10 possible thematic categories under which the main idea of each mission statement could be classified (e.g., social development, cognitive development, civic development). Judges then read a series of mission statements and attempted to classify each sampling unit according to the major purpose of schooling articulated. If both judges consistently rated the dominant theme of the mission statement as representing elements of citizenship, then they were said to have communicated with each other in a meaningful way because they had both classified the statement in the same way. If one judge classified the major theme as social development, and the other judge classified the major theme as citizenship, then a breakdown in shared understanding occurred. In that case, the judges were not coming to a consensus on how to apply the levels of the scoring rubric. The authors chose to use the kappa statistic to evaluate the degree of consensus because they did not expect the frequency of the major themes of the mission statements to be evenly distributed across the 10 categories of their scoring rubric.
Page 13 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Although some authors (Landis & Koch, 1977) have offered guidelines for interpreting kappa values, other authors (Krippendorff, 2004; Uebersax, 2002) have argued that the kappa values for different items or from different studies cannot be meaningfully compared unless the base rates are identical. Consequently, these authors suggest that although the statistic gives some indication as to whether the agreement is better than that predicted by chance alone, it is difficult to apply rules of thumb for interpreting kappa across different circumstances. Instead, Uebersax (2002) suggests that researchers using the kappa coefficient look at it [p. 35 ↓ ] for up or down evaluation of whether ratings are different from chance, but they should not get too invested in its interpretation. Krippendorff (2004) has introduced a new coefficient alpha into the literature that claims to be superior to kappa because alpha is capable of incorporating the information from multiple raters, dealing with missing data, and yielding a chance–corrected estimate of interrater reliability. The major disadvantage of Krippendorff's alpha is that it is computationally complex; however, statistical macros that compute Krippendorff's alpha have been created and are freely available (K. Hayes, 2006). In addition, however, some research suggests that in practice, alpha values tend to be nearly identical to kappa values (Dooley, 2006). Odds Ratios. A third consensus estimate of interrater reliability is the odds ratio. The odds ratio is most often used in circumstances where raters are making dichotomous ratings (e.g., presence/absence of a phenomenon), although it can be extended to ordered category ratings. In a 2 × 2 contingency table, the odds ratio indicates how much the odds of one rater making a given rating (e.g., positive/negative) increase for cases when the other rater has made the same rating. For example, suppose that in a music competition with 100 contestants, Rater 1 gives 90 of them a positive score for vocal ability, while in the same sample of 100 contestants, Rater 2 only gives 20 of them a positive score for vocal ability. The odds of Rater 1 giving a positive vocal ability score are 90 to 10, or 9:1, while the odds of Rater 2 giving a positive vocal ability score are only 20 to 80, or 1:4 = 0.25:1. Now, 9/0.25 = 36, so the odds ratio is 36. Within the context of interrater reliability, the important idea captured by the odds ratio is whether it deviates substantially from 1.0. From the perspective of interrater reliability, it would be most desirable to have an odds ratio that is close to 1.0, which would indicate that Rater 1 and Rater 2 rated the same proportion of contestants as having high vocal ability. The Page 14 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
larger the odds ratio value, the larger the discrepancy there is between raters in terms of their level of consensus. The odds ratio has the advantage of being easy to compute and is familiar from other statistical applications (e.g., logistic regression). The disadvantage to the odds ratio is that it is most intuitive within the context of a 2 × 2 contingency table with dichotomous rating categories. Although the technique can be generalized to ordered category ratings, it involves extra computational complexity that undermines its intuitive advantage. Furthermore, as Osborne (2006) has pointed out, although the odds ratio is straightforward to compute, the interpretation of the statistic is not always easy to convey, particularly to a lay audience.
Computing Common Consensus Estimates of Interrater Reliability Let us now turn to a practical example of how to calculate each of these coefficients. As an example data set, we will draw from Stemler, Grigorenko, Jarvin, and Sternberg's (2006) study in which they developed augmented versions of the Advanced Placement Psychology Examination. Participants were required to complete a number of essay items that were subsequently scored by different sets of raters. Essay Question 1, Part d was a question that asked participants to give advice to a friend who is having trouble sleeping, based on what they know about various theories of sleep. The item was scored using a 5–point scoring rubric. For this particular item, 75 participants received scores from two independent raters. Percent Agreement. Percent agreement is calculated by adding up the number of cases that received the same rating by both judges and dividing that number by the total number of cases rated by the two judges. Using SPSS, one can run the crosstabs procedure and generate a table to facilitate the calculation (see Table 3.1). The percent agreement on this item is 42%; however, the percent adjacent agreement is 87%. Cohen's Kappa. The formula for computing Cohen's kappa is listed in Formula 1.
Page 15 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
where P A = proportion of units on which the raters agree, and P C = the proportion of units for which agreement is expected by chance. It is possible to compute Cohen's kappa in SPSS by simply specifying in the crosstabs procedure the desire to produce Cohen's kappa (see Table 3.1). For this data set, the kappa value is [p. 36 ↓ ] .23, which indicates that the two raters agreed on the scoring only slightly more often than we would predict based on chance alone. Table 3.1 SPSS Code and Output for Percent Agreement and Percent Adjacent Agreement and Cohen's Kappa
Page 16 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Odds Ratios. The formula for computing an odds ratio is shown in Formula 2.
Page 17 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
The SPSS code for computing the odds ratio is shown in Table 3.2. In order to compute the odds ratio using the crosstabs procedure in SPSS, it was necessary to recode the data so that the ratings were dichotomous. Consequently, ratings of 0, 1, and 2 were assigned a value of 0 (failing) while ratings of 3 and 4 were assigned a value of 1 (passing). The odds ratio for the current data set is 30, indicating that there was a substantial difference between the raters in terms of the proportion of students classified as passing versus failing.
Implications for Summarizing Scores From Various Raters If raters can be trained to the point where they agree on how to assign scores from a rubric, then scores given by the two raters may be treated as equivalent. This fact has practical implications for determining the number of raters needed to complete a study. Thus, the remaining work of rating subsequent items can be split between the raters without both raters having to score all items. Furthermore, the summary scores may be calculated by simply taking the score from one of the judges or by averaging the scores given by all of the judges, since high interrater reliability indicates that the judges agree about how to apply the rating scale. A typical guideline found in the literature for evaluating the quality of interrater reliability based on consensus estimates is that they should be 70% or greater. If raters are shown to reach high levels of consensus, then adding more raters adds little extra information from a statistical perspective and is probably not justified from the perspective of resources.
Page 18 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
[p. 37 ↓ ] Table 3.2 SPSS Code and Output for Odds Ratios
Advantages of Consensus Estimates One particular advantage of the consensus approach to estimating interrater reliability is that the calculations are easily done by hand.
Page 19 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
A second advantage is that the techniques falling within this general category are well suited to dealing with nominal variables whose levels on the rating scale represent qualitatively different categories. A third advantage is that consensus estimates can be useful in diagnosing problems with judges’ interpretations of how to apply the rating scale. For example, inspection of the information from a crosstab table may allow the researcher to realize that the judges may be unclear about the rules for when they are supposed to score an item as zero as opposed to when they are supposed to score the item as missing. A visual analysis of the output allows the researcher to go back to the data and clarify the discrepancy or retrain the judges. When judges exhibit a high level of consensus, it implies that both judges are essentially providing the same information. One implication of a high [p. 38 ↓ ] consensus estimate of interrater reliability is that both judges need not score all remaining items. For example, if there were 100 tests to be scored after the interrater reliability study was finished, it would be most efficient to ask Judge A to rate exams 1 to 50 and Judge B to rate exams 51 to 100 because the two judges have empirically demonstrated that they share a similar meaning for the scoring rubric. In practice, however, it is usually a good idea to build in a 30% overlap between judges even after they have been trained, in order to provide evidence that the judges are not drifting from their consensus as they read more items.
Disadvantages of Consensus Estimates One disadvantage of consensus estimates is that interrater reliability statistics must be computed separately for each item and for each pair of judges. Consequently, when reporting consensus–based interrater reliability estimates, one should report the minimum, maximum, and median estimates for all items and for all pairs of judges. A second disadvantage is that the amount of time and energy it takes to train judges to come to exact agreement is often substantial, particularly in applications where exact agreement is unnecessary (e.g., if the exact application of the levels of the scoring rubric is not important, but rather a means to the end of getting a summary score for each respondent).
Page 20 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Third, as Linacre (2002) has noted, training judges to a point of forced consensus may actually reduce the statistical independence of the ratings and threaten the validity of the resulting scores. Finally, consensus estimates can be overly conservative if two judges exhibit systematic differences in the way that they use the scoring rubric but simply cannot be trained to come to a consensus. As we will see in the next section, it is possible to have a low consensus estimate of interrater reliability while having a high consistency estimate and vice versa. Consequently, sole reliance on consensus estimates of interrater reliability might lead researchers to conclude that “interrater reliability is low” when it may be more precisely stated that the consensus estimate of interrater reliability is low.
Consistency Estimates of Interrater Reliability Consistency estimates of interrater reliability are based on the assumption that it is not really necessary for raters to share a common interpretation of the rating scale, so long as each judge is consistent in classifying the phenomenon according to his or her own definition of the scale. For example, if Rater A assigns a score of 3 to a certain group of essays, and Rater B assigns a score of 1 to that same group of essays, the two raters have not come to a consensus about how to apply the rating scale categories, but the difference in how they apply the rating scale categories is predictable. Consistency approaches to estimating interrater reliability are most useful when the data are continuous in nature, although the technique can be applied to categorical data if the rating scale categories are thought to represent an underlying continuum along a unidimen–sional construct. Values greater than .70 are typically acceptable for consistency estimates of interrater reliability (Barrett, 2001). The three most popular types of consistency estimates are (a) correlation coefficients (e.g., Pearson, Spearman), (b) Cronbach's alpha (Cronbach, 1951), and (c) intraclass correlation. For information regarding additional consistency estimates of interrater
Page 21 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
reliability, see Bock, Brennan, and Muraki (2002); Burke and Dunlap (2002); LeBreton, Burgess, Kaiser, Atchley, and James (2003); and Uebersax (2002). Correlation Coefficients. Perhaps the most popular statistic for calculating the degree of consistency between raters is the Pearson correlation coefficient. Correlation coefficients measure the association between independent raters. Values approaching +1 or –1 indicate that the two raters are following a systematic pattern in their ratings, while values approaching zero indicate that it is nearly impossible to predict the score one rater would give by knowing the score the other rater gave. It is important to note that even though the correlation between scores assigned by two judges may be nearly perfect, there may be substantial mean differences between the raters. In other words, two raters may differ in the absolute values they assign to each rating by two points; however, so long as there is a 2–point difference for each rating they assign, the raters will have achieved high consistency estimates of interrater reliability. Thus, a large value for a measure of association does not imply that the raters are agreeing on the actual application of the rating scale, only that they are consistent in applying the ratings according to their own unique understanding of the scoring rubric. [p. 39 ↓ ] The Pearson correlation coefficient can be computed by hand (Glass & Hopkins, 1996) or can easily be computed using most statistical packages. One beneficial feature of the Pearson correlation coefficient is that the scores on the rating scale can be continuous in nature (e.g., they can take on partial values such as 1.5). Like the percent agreement statistic, the Pearson correlation coefficients can be calculated only for one pair of judges at a time and for one item at a time. A potential limitation of the Pearson correlation coefficient is that it assumes that the data underlying the rating scale are normally distributed. Consequently, if the data from the rating scale tend to be skewed toward one end of the distribution, this will attenuate the upper limit of the correlation coefficient that can be observed. The Spearman rank coefficient provides an approximation of the Pearson correlation coefficient but may be used in circumstances where the data under investigation are not normally distributed. For example, rather than using a continuous rating scale, each judge may rank order the essays that he or she has scored from best to worst. In this case, then, since both ratings being correlated are in the form of rankings, a correlation coefficient can be computed that is governed by the number of pairs of ratings (Glass & Hopkins, 1996). Page 22 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
The major disadvantage to Spearman's rank coefficient is that it requires both judges to rate all cases. Cronbach's Alpha. In situations where more than two raters are used, another approach to computing a consistency estimate of interrater reliability would be to compute Cronbach's alpha coefficient (Crocker & Algina, 1986). Cronbach's alpha coefficient is a measure of internal consistency reliability and is useful for understanding the extent to which the ratings from a group of judges hold together to measure a common dimension. If the Cronbach's alpha estimate among the judges is low, then this implies that the majority of the variance in the total composite score is really due to error variance and not true score variance (Crocker & Algina, 1986). The major advantage of using Cronbach's alpha comes from its capacity to yield a single consistency estimate of interrater reliability across multiple judges. The major disadvantage of the method is that each judge must give a rating on every case, or else the alpha will only be computed on a subset of the data. In other words, if just one rater fails to score a particular individual, that individual will be left out of the analysis. In addition, as Barrett (2001) has noted, “because of this ‘averaging’ of ratings, we reduce the variability of the judges’ ratings such that when we average all judges’ ratings, we effectively remove all the error variance for judges” (p. 7). Intraclass Correlation. A third popular approach to estimating interrater reliability is through the use of the intraclass correlation coefficient. An interesting feature of the intraclass correlation coefficient is that it confounds two ways in which raters differ: (a) consensus (or bias—i.e., mean differences) and (b) consistency (or association). As a result, the value of the intraclass correlation coefficient will be decreased in situations where there is a low correlation between raters and in situations where there are large mean differences between raters. For this reason, the intraclass correlation may be considered a conservative estimate of interrater reliability. If the intraclass correlation coefficient is close to 1, then chances are good that this implies that excellent interrater reliability has been achieved. The major advantage of the intraclass correlation is its capacity to incorporate information from different types of rater reliability data. On the other hand, as Uebersax (2002) has noted, “If the goal is to give feedback to raters to improve future ratings, Page 23 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
one should distinguish between these two sources of disagreement” (p. 5). In addition, because the intraclass correlation represents the ratio of within–subject variance to between–subject variance on a rating scale, the results may not look the same if raters are rating a homogeneous subpopulation as opposed to the general population. Simply by restricting the between–subject variance, the intraclass correlation will be lowered. Therefore, it is important to pay special attention to the population being assessed and to understand that this can influence the value of the intraclass correlation coefficient (ICC). For this reason, ICCs are not directly comparable across populations. Finally, it is important to note that, like the Pearson correlation coefficient, the intraclass correlation coefficient will be attenuated if assumptions of normality in rating data are violated.
Computing Common Consistency Estimates of Interrater Reliability Let us now turn to a practical example of how to calculate each of these coefficients. We will use the same data set and compute each estimate on the data. [p. 40 ↓ ] Correlation Coefficients. The formula for computing the Pearson correlation coefficient is listed in Formula 3.
Using SPSS, one can run the correlate procedure and generate a table similar to Table 3.3. One may request both Pearson and Spearman correlation coefficients. The Pearson correlation coefficient on this data set is .76; the Spearman correlation coefficient is .74.
Page 24 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Cronbach's Alpha. The Cronbach's alpha value is calculated using Formula 4,
where N is the number of components (raters), 2
σ x
is the variance of the observed total scores, and 2
σ y i is the variance of component i. In order to compute Cronbach's alpha using SPSS, one may simply specify in the crosstabs procedure the desire to produce Cronbach's alpha (see Table 3.4). For this example, the alpha value is .86. Table 3.3 SPSS Code and Output for Pearson and Spearman Correlations
Page 25 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
[p. 41 ↓ ] Table 3.4 SPSS Code and Output for Cronbach's Alpha
Page 26 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Intraclass Correlation. Formula 5 presents the equation used to compute the intraclass correlation value.
where 2
σ (b) is the variance of the ratings between judges, and 2
σ (w) is the pooled variance within raters. In order to compute intraclass correlation, one may specify the procedure in SPSS using the code listed in Table 3.5. The intraclass correlation coefficient for this data set is .75.
Page 27 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Implications for Summarizing Scores From Various Raters It is important to recognize that although consistency estimates may be high, the means and medians of the different judges may be very different. Thus, if one judge consistently gives scores that are 2 points lower on the rating scale than does a second judge, the scores will ultimately need to be corrected for this difference in judge severity if the final scores are to be summarized or subjected to further analyses. Table 3.5 SPSS Code and Output for Intraclass Correlation
[p. 42 ↓ ] Page 28 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Advantages of Consistency Estimates There are three major advantages to using consistency estimates of interrater reliability. First, the approach places less stringent demands on the judges in that they need not be trained to come to exact agreement with one another so long as each judge is consistent within his or her own definition of the rating scale (i.e., exhibits high intrarater reliability). It is sometimes the case that the exact application of the levels of the scoring rubric is not important in itself. Instead, the scoring rubric is a means to the end of creating scores for each participant that can be summarized in a meaningful way. If summarization is the goal, then what is most important is that each judge apply the rating scale consistently within his or her own definition of the rating scale, regardless of whether the two judges exhibit exact agreement. Consistency estimates allow for the detection of systematic differences between judges, which may then be adjusted statistically. For example, if Judge A consistently gives scores that are 2 points lower than Judge B does, then adding 2 extra points to the exams of all students who were scored by Judge A would provide an equitable adjustment to the raw scores. A second advantage of consistency estimates is that certain methods within this category (e.g., Cronbach's alpha) allow for an overall estimate of consistency among multiple judges. The third advantage is that consistency estimates readily handle continuous data.
Disadvantage of Consistency Estimates One disadvantage of consistency estimates is that if the construct under investigation has some objective meaning, then it may not be desirable for the two judges to “agree to disagree.” Instead, it may be important for the judges to come to an exact agreement on the scores that they are generating. A second disadvantage of consistency estimates is that judges may differ not only systematically in the raw scores they apply but also in the number of rating scale categories they use. In that case, a mean adjustment for a severe judge may provide a
Page 29 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
partial solution, but the two judges may also differ on the variability in scores they give. Thus, a mean adjustment alone will not effectively correct for this difference. A third disadvantage of consistency estimates is that they are highly sensitive to the distribution of the observed data. In other words, if most of the ratings fall into one or two categories, the correlation coefficient will necessarily be deflated due to restricted variability. Consequently, a reliance on the consistency estimate alone may lead the researcher to falsely conclude that interrater reliability was poor without specifying more precisely that the consistency estimate of interrater reliability was poor and providing an appropriate rationale.
Measurement Estimates of Interrater Reliability Measurement estimates are based on the assumption that one should use all of the information available from all judges (including discrepant ratings) when attempting to create a summary score for each respondent. In other words, each judge is seen as providing some unique information that is useful in generating a summary score for a person. As Linacre (2002) has noted, “It is the accumulation of information, not the ratings themselves, that is decisive” (p. 858). Consequently, under the measurement approach, it is not necessary for two judges to come to a consensus on how to apply a scoring rubric because differences in judge severity can be estimated and accounted for in the creation of each participant's final score. Measurement estimates are also useful in circumstances where multiple judges are providing ratings, and it is impossible for all judges to rate all items. They are best used when different levels of the rating scale are intended to represent different levels of an underlying unidimensional construct (e.g., mathematical competence). The two most popular types of measurement estimates are (a) factor analysis and (b) the many–facets Rasch model (Linacre, 1994; Linacre, Englehard, Tatem, & Myford, 1994; Myford & Cline, 2002) or log–linear models (von Eye & Mun, 2004).
Page 30 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Factor Analysis. One popular measurement estimate of interrater reliability is computed using factor analysis (Harman, 1967). Using this method, multiple judges may rate a set of participants. The judges’ scores are then subjected to a common factor analysis in order to determine the amount of shared variance in the ratings [p. 43 ↓ ] that could be accounted for by a single factor. The percentage of variance that is explainable by the first factor gives some indication of the extent to which the multiple judges are reaching agreement. If the shared variance is high (e.g., greater than 60%), then this gives some indication that the judges are rating a common construct. The technique can also be used to check the extent to which judges agree on the number of underlying dimensions in the data set. Once interrater reliability has been established in this way, each participant may then receive a single summary score corresponding to his or her loading on the first principal component underlying the set of ratings. This score can be computed automatically by most statistical packages. The advantage of this approach is that it assigns a summary score for each participant that is based only on the relevance of the strongest dimension underlying the data. The disadvantage to the approach is that it assumes that ratings are assigned without error by the judges. Many–Facets Rasch Measurement and Log–Linear Models. A second measurement approach to estimating interrater reliability is through the use of the many–facets Rasch 2
model (Linacre, 1994). Recent advances in the field of measurement have led to an extension of the standard Rasch measurement model (Rasch, 1960/1980; Wright & Stone, 1979). This new, extended model, known as the many–facets Rasch model, allows judge severity to be derived using the same scale (i.e., the logit scale) as person ability and item difficulty. In other words, rather than simply assuming that a score of 3 from Judge A is equally difficult for a participant to achieve as a score of 3 from Judge B, the equivalence of the ratings between judges can be empirically determined. Thus, it could be the case that a score of 3 from Judge A is really closer to a score of 5 from Judge B (i.e., Judge A is a more severe rater). Using a many–facets analysis, each essay item or behavior that was rated can be directly compared.
Page 31 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
In addition, the difficulty of each item, as well as the severity of all judges who rated the items, can also be directly compared. For example, if a history exam included five essay questions and each of the essay questions was rated by 3 judges (2 unique judges per item and 1 judge who scored all items), the facets approach would allow the researcher to directly compare the severity of a judge who rated only Item 1 with the severity of a judge who rated only Item 4. Each of the 11 judges (2 unique judges per item + 1 judge who rated all items = 5∗2 + 1 = 11) could be directly compared. The mathematical representation of the many–facets Rasch model is fully described in Linacre (1994). Finally, in addition to providing information that allows for the evaluation of the severity of each judge in relation to all other judges, the facets approach also allows one to evaluate the extent to which each of the individual judges is using the scoring rubric in a manner that is internally consistent (i.e., an estimate of intrarater reliability). In other words, even if judges differ in their interpretation of the rating scale, the fit statistics will indicate the extent to which a given judge is faithful to his or her own definition of the scale categories across items and people. The many–facets Rasch approach has several advantages. First, the technique puts rater severity on the same scale as item difficulty and person ability (i.e., the logit scale). Consequently, this feature allows for the computation of a single final summary score that is already corrected for rater severity. As Linacre (1994) has noted, this provides a distinct advantage over generalizability studies since the goal of a generalizability study is to determine the error variance associated with each judge's ratings, so that correction can be made to ratings awarded by a judge when he is the only one to rate an examinee. For this to be useful, examinees must be regarded as randomly sampled from some population of examinees which means that there is no way to correct an individual examinee's score for judge behavior, in a way which would be helpful to an examining board. This approach, however, was developed for use in contexts in which only estimates of population parameters are of interest to researchers. (p. 29)
Page 32 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Second, the item fit statistics provide some estimate of the degree to which each individual rater was applying the scoring rubric in an internally consistent manner. In other words, high–fit statistic values are an indication of rater drift over time. Third, the technique works with multiple raters and does not require all raters to evaluate all objects. In other words, the technique is well suited to overlapping research designs, which [p. 44 ↓ ] allows the researcher to use resources more efficiently. So long as there is sufficient connectedness in the data set (Engelhard, 1997), the severity of all raters can be evaluated relative to each other. The major disadvantage to the many–facets Rasch approach is that it is computationally intensive and therefore is best implemented using specialized statistical software (Linacre, 1988). In addition, this technique is best suited to data that are ordinal in nature.
Computing Common Measurement Estimates of Interrater Reliability Measurement estimates of interrater reliability tend to be much more computationally complex than consensus or consistency estimates. Consequently, rather than present the detailed formulas for each technique in this section, we instead refer to some excellent sources that are devoted to fully expounding the detailed computations involved. This will allow us to focus on the interpretation of the results of each of these techniques. Factor Analysis. The mathematical formulas for computing factor–analytic solutions are expounded in several excellent texts (e.g., Harman, 1967; Kline, 1998). When using factor analysis to estimate interrater reliability, the data set should be structured in such a way that each column in the data set corresponds to the score given by Rater × on Item Y to each object in the data set (objects each receive their own row). Thus, if five raters were to score three essays from 100 students, the data set should contain 15 columns (e.g., Rater1_Item1, Rater2_Item1, Rater1_Item2) and 100 rows. In this example, we would run a separate factor analysis for each essay item (e.g., a 5 × 100 Page 33 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
data matrix). Table 3.6 shows the SPSS code and output for running the factor analysis procedure. There are two important pieces of information generated by the factor analysis. The first important piece of information is the value of the explained variance in the first factor. In the example output, the shared variance of the first factor is 76%, indicating that independent raters agree on the underlying nature of the construct being rated, which is also evidence of interrater reliability. In some cases, it may turn out that the variance in ratings is distributed over more than one factor. If that is the case, then this provides some evidence to suggest that the raters are not interpreting the underlying construct in the same manner (e.g., recall the example about creativity mentioned earlier in this chapter). The second important piece of information comes from the factor loadings. Each object that has been rated will have a loading on each underlying factor. Assuming that the first factor explains most of the variance, the score to be used in subsequent analyses should be the loading on the primary factor. Many–Facets Rasch Measurement. The mathematical formulas for computing results using the many–facets Rasch model may be found in Linacre (1994). In practice, the many–facets Rasch model is best implemented through the use of specialized software (Linacre, 1988). An example output of a many–facets Rasch analysis is listed in Table 3.7. The example output presented here is derived from the larger Stemler et al. (2006) data set. The key values to interpret within the context of the many–facets Rasch approach are rater severity measures and fit statistics. Rater severity indices are useful for estimating the extent to which systematic differences exist between raters with regard to their level of severity. For example, rater CL was the most severe rater, with an estimated severity measure of +0.89 logits. Consequently, students whose test items were scored by CL would be more likely to receive lower raw scores than students who had the same test item scored by any of the other raters used in this project. At the other extreme, rater AP was the most lenient rater, with a rater severity measure of – 0.91 logits. Consequently, simply using raw scores would lead to biased estimates of student proficiency since student estimates would depend, to an important degree, on Page 34 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
which rater scored their essay. The facets program corrects for these differences and incorporates them into student ability estimates. If these differences were not taken into account when calculating student ability, students who had their exams scored by AP would be more likely to receive substantially higher raw scores than if the same item were rated by any of the other raters. [p. 45 ↓ ] Table 3.6 SPSS Code and Output for Factor Analysis
Page 35 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
The results presented in Table 3.7 show that there is about a 1.5–logit spread in systematic differences in rater severity (from –0.91 to +0.89). Consequently, assuming that all raters are defining the rating scales they are using in the same way is not a tenable assumption, and differences in rater severity must be taken into account in order to come up with precise estimates of student ability.
Page 36 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
In addition to providing information that allows us to evaluate the severity of each rater in relation to all other raters, the facets approach also allows us to evaluate the extent to which each of the individual raters is using the scoring rubric in a manner that is internally consistent (i.e., intrarater reliability). In other words, even if raters differ in their own definition of how they use the scale, the fit statistics will indicate the extent to which a given rater is faithful to his or her own definition of the scale categories across items and people. Rater fit statistics are presented in columns 5 and 6 of Table 3.7. Table 3.7 Output for a Many–Facets Rasch Analysis
Fit statistics provide an empirical estimate of the extent to which the expected response patterns for each individual match the observed response patterns. These fit statistics are interpreted much the same way as item or person infit statistics are interpreted (Bond & Fox, 2001; Wright & Stone, 1979). An infit value greater than 1.4 indicates that there is 40% more variation in the data than predicted by the Rasch model. Conversely, an infit value of 0.5 indicates that there is 50% less [p. 46 ↓ ] variation in the data than predicted by the Rasch model. Infit mean squares that are greater than 1.3 indicate that there is more unpredictable variation in the raters’ responses than we would expect Page 37 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
based on the model. Infit mean square values that are less than 0.7 indicate that there is less variation in the raters’ responses than we would predict based on the model. Myford and Cline (2002) note that high infit values may suggest that ratings are noisy as a result of the raters’ overuse of the extreme scale categories (i.e., the lowest and highest values on the rating scale), while low infit mean square indices may be a consequence of overuse of the middle scale categories (e.g., moderate response bias). The infit and outfit mean–square indices are unstandardized, information–weighted indices; by constrast the infit and outfit standardized indices are unweighted indices that are standardized toward a unit–normal distribution. These standardized indices are sensitive to sample size and, consequently, the accuracy of the standardization is data dependent. The expectation for the mean square index is 1.0; the range is 0 to infinity (Myford & Cline, 2002, p. 14). The results in Table 3.7 reveal that 6 of the 12 raters had infit mean–square indices that exceeded 1.3. Raters CL (infit of 3.4), JW (infit of 2.4), and AM (infit of 2.2) appear particularly problematic. Their high infit values suggest that these raters are not using the scoring rubrics in a consistent way. The table of misfitting ratings provided by the facets computer program output allowed for an investigation of the exact nature of the highly unexpected response patterns associated with each of these raters. The table of misfitting ratings provides information on discrepant ratings based on two criteria: (a) how the other raters scored the item and (b) the particular raters’ typical level of severity in scoring items of similar difficulty.
Implications for Summarizing Scores From Various Raters Measurement estimates allow for the creation of a summary score for each participant that represents that participant's score on the underlying factor of interest, taking into account the extent to which each judge influences the score.
Page 38 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Advantages of Measurement Estimates There are several advantages to estimating interrater reliability using the measurement approach. First, measurement estimates can take into account errors at the level of each judge or for groups of judges. Consequently, the summary scores generated from measurement [p. 47 ↓ ] estimates of interrater reliability tend to more accurately represent the underlying construct of interest than do the simple raw score ratings from the judges. Second, measurement estimates effectively handle ratings from multiple judges by simultaneously computing estimates across all of the items that were rated, as opposed to calculating estimates separately for each item and each pair of judges. Third, measurement estimates have the distinct advantage of not requiring all judges to rate all items in order to arrive at an estimate of interrater reliability. Rather, judges may rate a particular subset of items, and as long as there is sufficient connectedness (Linacre, 1994; Linacre et al., 1994) across the judges and ratings, it will be possible to directly compare judges.
Disadvantages of Measurement Estimates The major disadvantage of measurement estimates is that they are unwieldy to compute by hand. Unlike the percent agreement figure or correlation coefficient, measurement approaches typically require the use of specialized software to compute. A second disadvantage is that certain methods for computing measurement estimates (e.g., facets) can handle only ordinal–level data. Furthermore, the file structure required to use facets is somewhat counterintuitive.
Page 39 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Summary and Conclusion In this chapter, we have attempted to outline a framework for thinking about interrater reliability as a multifaceted concept. Consequently, we believe that there is no silver bullet “best” approach for its computation. There are multiple techniques for computing interrater reliability, each with its own assumptions and implications. As Snow, Cook, Lin, Morgan, and Magaziner (2005) have noted, “Percent/proportion agreement is affected by chance; kappa and weighted kappa are affected by low prevalence of condition of interest; and correlations are affected by low variability, distribution shape, and mean shifts” (p. 1682). Yet each technique (and class of techniques) has its own strengths and weaknesses. Consensus estimates of interrater reliability (e.g., percent agreement, Cohen's kappa, odds ratios) are generally easy to compute and useful for diagnosing rater disparities; however, training raters to exact consensus requires substantial time and energy and may not be entirely necessary, depending on the goals of the study. Consistency estimates of interrater reliability (e.g., Pearson and Spearman correlations, Cronbach's alpha, and intraclass correlations) are familiar and fairly easy to compute. They have the additional advantage of not requiring raters to perfectly agree with each other but only require consistent application of a scoring rubric within raters—systematic variance between raters is easily tolerated. The disadvantage to consistency estimates, however, is that they are sensitive to the distribution of the data (the more it departs from normality, the more attenuated the results). Furthermore, even if one achieves high consistency estimates, further adjustment to an individual's raw scores may be required in order to arrive at an unbiased final score that may be used in subsequent data analyses. Measurement estimates of interrater reliability (e.g., factor analysis, many–facets Rasch measurement) can deal effectively with multiple raters, easily derive adjusted summary scores that are corrected for rater severity, and allow for highly efficient designs (e.g., not all raters need to rate all objects); however, this comes at the expense of added computational complexity and increased demands on resources (e.g., time and expertise). Page 40 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
In the end, the best technique will always depend on (a) the goals of the analysis (e.g., the stakes associated with the study outcomes), (b) the nature of the data, and (c) the desired level of information based on the resources available. The answers to these three questions will help to determine how many raters one needs, whether the raters need to be in perfect agreement with each other, and how to approach creating summary scores across raters. We conclude this chapter with a brief table that is intended to provide rough interpretive guidance with regard to acceptable interrater reliability values (see Table 3.8). These values simply represent conventions the authors have encountered in the literature and via discussions with colleagues and reviewers; however, keep in mind that these guidelines are just rough estimates and will vary depending on the purpose of the study and the stakes associated with the [p. 48 ↓ ] outcomes. The conventions articulated here assume that the interrater reliability study is part of a low–stakes, exploratory research study. Table 3.8 General Guidelines for Interpreting Various Interrater Reliability Coefficients
Notes Page 41 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
1. Also known as interobserver or interjudge reliability or agreement. 2. Readers interested in this model can refer to Chapters 4 and 5 on Rasch measurement for more information.
References Agresti, A. (1996). An introduction to categorical data analysis (2nd ed.). New York: John Wiley. Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River, NJ: Prentice Hall. Barrett, P. (2001, March). Assessing the reliability of rating data . Retrieved June 16, 2003, from http://www.liv.ac.uk/~pbarrett/rater.pdf Bock, R., Brennan, R. L., and Muraki, E. The information in multiple ratings Applied Psychological Measurement, vol. 26(4)364-375(2002). Bond, T., & Fox, C. (2001). Applying the Rasch model . Mahwah, NJ: Lawrence Erlbaum. Burke, M. J. and Dunlap, W. P. Estimating interrater agreement with the average deviation index: A user's guide Organizational Research Methods, vol. 5(2)159-172(2002). Cohen, J. A coefficient for agreement for nominal scales Educational and Psychological Measurement, vol. 20,37-46(1960). Cohen, J. Weighted kappa: Nominal scale agreement with provision for scale disagreement or partial credit Psychological Bulletin, vol. 70,213-220(1968). Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/ correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Lawrence Erlbaum. Page 42 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
College Board . (2006). How the essay is scored . Retrieved November 4, 2006, from http://www.coUegeboard.com/student/testing/sat/about/sat/essay_scoring.html Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory . Orlando, FL: Harcourt Brace Jovanovich. Cronbach, L. J. Coefficient alpha and the internal structure of tests Psychometrika, vol. 16,297-334(1951). Dooley, K. (2006). Questionnaire Programming LanguageInterrater reliability report . Retrieved November 4, 2006, from http://qpl.gao.gov/ca050404.htm Engelhard, G. Constructing rater and task banks for performance assessment Journal of Outcome Measurement, vol. 1(1)19-33(1997). Glass, G. v., & Hopkins, K. H. (1996). Statistical methods in education and psychology . Boston: Allyn & Bacon. Harman, H. H. (1967). Modern factor analysis . Chicago: University of Chicago Press. Hayes, J. R. and Hatch, J. A. Issues in measuring reliability: Correlation versus percentage of agreement Written Communication, vol. 16(3)354-367(1999). Hayes, K. (2006). SPSS Macro for computing Krippendorff's alpha . Retrieved from http://www.comm.ohio-state.edu/ahayes/SPSS%20programs/kalpha.htm Hopkins, K. H. (1998). Educational and psychological measurement and evaluation (8th ed.). Boston: Allyn & Bacon. Kline, R. (1998). Principles and practice of structural equation modeling . New York: Guilford. Krippendorff, K. Reliability in content analysis: Some common misconceptions and recommendations Human Communication Research, vol. 30(3)411-433(2004). Landis, J. R. and Koch, G. G. The measurement of observer agreement for categorical data Biometrics, vol. 33,159-174(1977). Page 43 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
LeBreton, J. M., Burgess, J. R., Kaiser, R. B., Atchley, E., and James, L. R. The restriction of variance hypothesis and interrater reliability and agreement: Are ratings from multiple sources really dissimilar? Organizational Research Methods, vol. 6(1)80-128(2003). Linacre, J. M. (1988). FACETS: A computer program for many-facet Rasch measurement (Version 3.3.0) . Chicago: MESA Press. Linacre, J. M. (1994). Many-facet Rasch measurement . Chicago: MESA Press. Linacre, J. M. Judge ratings with forced agreement Rasch Measurement Transactions, vol. 16(1)857-858(2002). Linacre, J. M., Englehard, G., Tatem, D. S., and Myford, C. M. Measurement with judges: Many-faceted conjoint measurement International Journal of Educational Research, vol. 21(4)569-577(1994). McArdle, J. J. Structural factor analysis experiments with incomplete data Multivariate Behavioral Research, vol. 29(4)409-454(1994). Myford, C. M., & Cline, F. (2002, April 1-5). Looking for patterns in disagreements: A facets analysis of human raters’ and e-raters’ scores on essays written for the Graduate Management Admission Test (GMAT) . Paper presented at the annual meeting of the American Educational Research Association, New Orleans, LA. Osborne, J. W. Bringing balance and technical accuracy to reporting odds ratios and the results of logistic regression analyses Practical Assessment, Research & Evaluation, vol. 11(7).Retrieved from http://pareonline.net/getvn.asp?v=11&n=17(2006). Rasch, G. (1980). Probabilistic models for some intelligence and attainment tests (Expanded ed.). Chicago: University of Chicago Press. (Original work published 1960) Snow, A. L., Cook, K. F., Lin, P.-S., Morgan, R. O., and Magaziner, J. Proxies and other external raters: Methodological considerations Health Services Research, vol. 40(5)1676-1693(2005).
Page 44 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
University of Arizona ©2008 SAGE Publications, Inc. All Rights Reserved.
SAGE Research Methods
Stemler, S. E. An overview of content analysis Practical Assessment, Research and Evaluation, vol. 7(17).Retrieved from http://ericae.net/pare/getvn.asp?v=7&n=17(2001). Stemler, S. E. A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability Practical Assessment, Research & Evaluation, vol. 9(4).Retrieved from http://pareonline.net/getvn.asp?v=9&n=4(2004). Stemler, S. E., & Bebell, D. (1999, April). An empirical approach to understanding and analyzing the mission statements of selected educational institutions . Paper presented at the New England Educational Research Organization (NEERO), Portsmouth, NH. Stemler, S. E., Grigorenko, E. L., Jarvin, L., and Sternberg, R. J. Using the theory of successful intelligence as a basis for augmenting AP exams in psychology and statistics Contemporary Educational Psychology, vol. 31(2)75-108(2006). Sternberg, R. J., & Lubart, T. I. (1995). Defying the crowd: Cultivating creativity in a culture of conformity . New York: Free Press. Stevens, S. S. On the theory of scales of measurement Science, vol. 103,677-680(1946). Uebersax, J. (2002). Statistical methods for rater agreement . Retrieved August 9, 2002, from http://ourworld.compuserve.com/homepages/jsuebersax/agree.htm von Eye, A., & Mun, E. Y. (2004). Analyzing rater agreement: Manifest variable methods . Mahwah, NJ: Lawrence Erlbaum. Wright, B. D., & Stone, M. H. (1979). Best test design . Chicago: MESA Press. http://dx.doi.org/10.4135/9781412995627.d5
Page 45 of 45
Best practices in quantitative methods: 3 Best Practices in Interrater Reliability Three Common Approaches
http://www.jmde.com/
Articles
Quantitative Methods for Estimating the Reliability of Qualitative Data Jason W. Davey Fenwal, Inc.
P. Cristian Gugiu Western Michigan University
Chris L. S. Coryn Western Michigan University Background: Measurement is an indispensable aspect of conducting both quantitative and qualitative research and evaluation. With respect to qualitative research, measurement typically occurs during the coding process.
Research Design: Case study.
Purpose: This paper presents quantitative methods for determining the reliability of conclusions from qualitative data sources. Although some qualitative researchers disagree with such applications, a link between the qualitative and quantitative fields is successfully established through data collection and coding procedures.
Findings: The calculation of the kappa statistic, weighted kappa statistic, ANOVA Binary Intraclass Correlation, and Kuder-Richardson 20 is illustrated through a fictitious example. Formulae are presented so that the researcher can calculate these estimators without the use of sophisticated statistical software.
Setting: Not applicable.
Data Collection and Analysis: Narrative data were collected from a random sample of 528 undergraduate students and 28 professors.
K e y w o r d s : qualitative coding; qualitative methodology; reliability coefficients __________________________________
Intervention: Not applicable.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
140
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn
T
he rejection of using quantitative methods for assessing the reliability of qualitative findings by some qualitative researchers is both frustrating and perplexing from the vantage point of quantitative and mixed method researchers. While the need to distinguish one methodological approach from another is understandable, the reasoning sometimes used to justify the wholesale rejection of all concepts associated with quantitative analysis are replete with mischaracterizations, overreaching arguments, and inadequate substitutions. One of the first lines of attack against the use of quantitative analysis centers on philosophical arguments. Healy and Perry (2000), for example, characterize qualitative methods as flexible, inductive, and multifaceted, whereas quantitative methods are often characterized as inflexible and fixed. Moreover, most qualitative researchers view quantitative methods as characteristic of a positivist paradigm (e.g., Stenbacka, 2001; Davis, 2008; Paley, 2008)—a term that has come to take on a derogatory connotation. Paley (2008) states that “doing quantitative research entails commitment to a particular ontology and, specifically, to a belief in a single, objective reality that can be described by universal laws” (p. 649). However, quantitative analysis should not be synonymous with the positivist paradigm because statistical inference is concerned with probabilistic, as opposed to deterministic, conclusions. Nor do statisticians believe in a universal law measured free of error. Rather, statisticians believe that multiple truths may exist and that despite the best efforts of the researcher these truths are measured with some degree of error. If that were not the case, statisticians would ignore interaction effects, assume that measurement errors do not exist, and fail
to consider whether differences may exist between groups. Yet, most statisticians consider all these factors before formulating their conclusions. While statisticians may be faulted for paying too much attention to measures of central tendency (e.g., mean, median) at the expense of interesting outliers, this is not the same as believing in one all-inclusive truth. The distinction between objective research and subjective research also appears to emerge from this paradigm debate. Statisticians are portrayed as detached and neutral investigators while qualitative researchers are portrayed as embracing personal viewpoints and even biases to describe and interpret the subjective experience of the phenomena they study (Miller, 2008). While parts of these characterizations do, indeed, differentiate between the two groups of researchers, they fail to explain why a majority of qualitative researchers dismiss the use of statistical methods. After all, the formulas used to conduct such analyses do not know or care whether the data were gathered using an objective rather than a subjective method. Moreover, certain statistical methods lend themselves to, and were even specifically developed for, the analysis of qualitative data (e.g., reliability analysis). Other qualitative researchers have come to equate positivism, and by extension quantitative analysis, with causal explanations (Healy & Perry, 2000). To date, the gold standard for substantiating causal claims is through the use of a wellconducted experimental design. However, the implementation of an experimental design does not necessitate the use of quantitative analysis. Furthermore, quantitative analysis may be conducted for any type of research design, including
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
141
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn qualitative research, as is the central premise of this paper. For some qualitative researchers (e.g., Miller, 2008; Stenbacka, 2001), the wholesale rejection of all concepts perceived to be quantitative has extended to general research concepts like reliability and validity. According to Stenbacka (2001), “reliability has no relevance in qualitative research, where it is impossible to differentiate between researcher and method” (p. 552). From the perspective of quantitative research, this statement is inaccurate because several quantitative methods have been developed for differentiating between the researcher, data collection method, and informant (e.g., generalizability theory), provided, of course, data are available for two or more researchers and/or methods. Stenbacka (2001) also objected to traditional forms of validity because “the purpose in qualitative research never is to measure anything. A qualitative method seeks for a certain quality that is typical for a phenomenon or that makes the phenomenon different from others” (p. 551). It would seem to the present authors, however, that this notion is inconsistent with traditional qualitative research. Measurement is a indispensable aspect of conducting research, regardless if it is quantitative or qualitative. With respect to qualitative research, measurement occurs during the coding process. Illustrating the integral nature of coding in qualitative research, Benaquisto (2008) noted: The coding process refers to the steps the researcher takes to identify, arrange, and systematize the ideas, concepts, and categories uncovered in the data. Coding consists of identifying potentially interesting events, features, phrases, behaviors, or stages of a process and distinguishing them with labels. These are
then further differentiated or integrated so that they may be reworked into a smaller number of categories, relationships, and patterns so as to tell a story or communicate conclusions drawn from the data. (p. 85)
Clearly, in absence of utilizing a coding process, researchers would be forced to provide readers with all of the data, which, in turn, would place the burden of interpretation on the reader. However, while the importance of coding to qualitative research is self-evident to all those who have conducted such research, the role of measurement may not be as obvious. In part, this may be attributed to a misunderstanding on the part of many researchers as to what is measurement. Measurement is the process of assigning numbers, symbols, or codes to phenomena (e.g., events, features, phrases, behaviors) based on a set of prescribed rules (i.e., a coding rubric). There is nothing inherently quantitative about this process or, at least, there does not need to be. Moreover, it does not limit qualitative research in any way. In fact, many times, measurement may only be performed in a qualitative context. For example, suppose that a researcher conducts an interview with an informant who states that “the bathrooms in the school are very dirty.” Now further suppose that the researcher developed a coding rubric, which, for the sake of simplicity, only contained two levels: cleanliness and academic performance. Clearly, the informant’s statement addressed the first level (cleanliness) and not the second. Whether the researcher chooses to assign this statement a checkmark for the cleanliness category or a 1, and an ‘X’ or 0 (zero) for the academic performance category, does not make a difference. The researcher clearly used his or her judgment to transform the raw
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
142
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn statement made by the informant into a code. However, when the researcher decided that the statement best represented cleanliness and not academic performance, he or she also performed a measurement process. Therefore, if one accepts this line of reasoning, qualitative research depends upon measurement to render judgments. Furthermore, three questions may be asked. First, does statement X fit the definition of code Y? Second, how many of the statements collected fit the definition of code Y? And third, how reliable is the definition of code Y for differentiating between statements within and across researchers (i.e., intrarater and interrater reliability, respectively)? Fortunately, not every qualitative researcher has accepted Stenbacka’s notion, in part, because qualitative researchers, like quantitative researchers, compete for funding and therefore, must persuade funders of the accuracy of their methods and results (Cheek, 2008). Consequently, the concepts of reliability and validity permeate qualitative research. However, owing to the desire to differentiate itself from quantitative research, qualitative researchers have espoused the use of “interpretivist alternatives” terms (Seale, 1999). Some of the most popular terms substituted for reliability include confirmability, credibility, dependability, and replicability (Coryn, 2007; Golafshani, 2003; Healy & Perry, 2000; Morse, Barrett, Mayan, Olson, & Spiers, 2002; Miller, 2008; Lincoln & Guba, 1985). In the qualitative tradition, confirmability is concerned with confirming the researcher’s interpretations and conclusions are grounded in actual data that can be verified (Jensen, 2008; Given & Saumure, 2008). Researchers may address this
reliability indicator through the use of multiple coders, transparency, audit trails, and member checks. Credibility, on the other hand, is concerned with the research methodology and data sources used to establish a high degree of harmony between the raw data and the researcher’s interpretations and conclusions. Various means can be used to enhance credibility, including accurately and richly describing data, citing negative cases, using multiple researchers to review and critique the analysis and findings, and conducting member checks (Given & Saumure, 2008; Jensen, 2008; Saumure & Given, 2008). Dependability recognizes that the most appropriate research design cannot be completely predicted a priori. Consequently, researchers may need to alter their research design to meet the realities of the research context in which they conduct the study, as compared to the context they predicted to exist a priori (Jensen, 2008). Dependability can be addressed by providing a rich description of the research procedures and instruments used so that other researchers may be able to collect data in similar ways. The idea being that if a different set of researchers use similar methods then they should reach similar conclusions (Given & Saumure, 2008). Finally, replicability is concerned with repeating a study on participants from a similar background as the original study. Researchers may address this reliability indicator by conducting the new study on participants with similar demographic variables, asking similar questions, and coding data in a similar fashion to the original study (Firmin, 2008). Like qualitative researchers, quantitative researchers have developed numerous definitions of reliability, including interrater and intrarater
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
143
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn reliability, test-retest reliability, internal consistency, and interclass correlations to name a few (Crocker & Algina, 1986; Hopkins, 1998). A review of the qualitative alternative terms revealed them to be indirectly associated with quantitative notions of reliability. However, although replicability is conceptually equivalent to test-retest reliability, the other three terms appear to describe research processes tangentially related to reliability. Moreover, they have two major liabilities. First, they place the burden of assessing reliability squarely on the reader. For example, if a reader wanted to determine the confirmability of a finding they would need to review the audit trail and make an independent assessment. Similar reviews of the data would be necessary, if a reviewer wanted to assess the credibility of a finding or dependability of a study design. Second, they fail to consider interrater reliability, which, in our experience, accounts for a considerable amount, if not a majority, of the variability in findings in qualitative studies. Interrater reliability is concerned with the degree to which different raters or coders appraise the same information (e.g., events, features, phrases, behaviors) in the same way (van den Hoonaard, 2008). In other words, do different raters interpret qualitative data in similar ways? The process of conducting an interrater reliability analysis, which is detailed in the next section, is relatively straightforward. Essentially, the only additional step beyond development and finalization of a coding rubric is that, at least two or more raters must independently rate all of the qualitative data using the coding rubric. Although collaboration, in the form of consensus agreement, may be used to finalize ratings after each rater has had an opportunity to rate all data, each rater
must work independently of the other to reduce bias in the first phase of analysis. Often, this task is greatly facilitated by use of a database system that, for example, (1) displays the smallest codable unit of a transcript (e.g., a single sentence), (2) presents the available coding options, and (3) records the rater’s code before displaying the next codable unit. While it is likely that qualitative researchers who prescribe to a constructionist paradigm may object to the constraint of forcing qualitative researchers to use the same coding rubric for a study, rather than developing their own, this is an indispensable process for attaining a reasonable level of interrater reliability. An example of the perils of not attending to this issue may be found in an empirical study conducted by Armstrong, Gosling, Weinman, and Marteau (1997). Armstrong and his colleagues invited six experienced qualitative researchers from Britain and the United States to analyse a transcript (approximately 13,500 words long) from a focus group comprised of adults living with cystic fibrosis that was convened to discuss the topic of genetic screening. In return for a fee, each researcher was asked to prepare an independent report in which they identified and described the main themes that emerged from the focus group discussion, up to a maximum of five. Beyond these instructions, each researcher was permitted to use any method for extracting the main themes they felt was appropriate. Once the reports were submitted, they were thematically analyzed by one of the authors, who deliberately abstained from reading the original transcript to reduce external bias. The results uncovered by Armstrong and his colleagues paint a troubling picture. On the surface, it was clear that a
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
144
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn reasonable level of consensus in the identification of themes was achieved. Five of the six researchers identified five themes, while one identified four themes. Consequently, only four themes are discussed in the article: visibility; ignorance; health service provision; and genetic screening. With respect to the presence of each theme, there was unanimous agreement for the visibility and genetic screening themes, while the agreement rates were slightly lower for the ignorance and health service provision themes (83% and 67%, respectively). Overall, these are good rates of agreement. However, a deeper examination of the findings revealed two troubling issues. First, a significant amount of disagreement existed with respect to how the themes were organized. Some researchers classified a theme as a basic structure whereas others organized it under a larger basic structure (i.e., gave it less importance than the overarching theme they assigned it to). Second, a significant amount of disagreement existed with respect to the manner in which themes were interpreted. For example, some of the researchers felt that the ignorance theme suggested a need for further education, other researchers raised concern about the eugenic threat, and the remainder thought it provided parents with choice. Similar inconsistencies with regard to interpretability occurred for the genetic screening theme where three researchers indicated that genetic screening provided parents with choice while one linked it with the eugenic threat. These results serve as an example of how “reality” is relative to the researcher doing the interpretation. However, they also demonstrate how the quality of a research finding requires knowledge of the degree to which consensus is reached
by knowledgeable researchers. Clearly, by this statement, we are assuming that reliability of findings across different researchers is a desirable quality. There certainly may be instances in which reliability is not important because one is only interested in the findings of a specific researcher, and the perspectives of others are not desired. That being the case, one may consider examining intrarater reliability. In all other instances, however, it is reasonable to assume that it is desirable to differentiate between the perspectives of the informants and those of the researcher. In other words, are the researcher’s findings truly grounded in the data or do they reflect his or her personal ideological perspectives. For a politician, for example, knowing the answer to this question may mean the difference between passing and rejecting a policy that allows parents to genetically test embryos. Although qualitative researchers can address interrater reliability by following the method used by Armstrong and his colleagues, the likelihood of achieving a reasonable level of reliability will be low simply due to researcher differences (e.g., the labels used to describe themes, structural organization of themes, importance accorded to themes, interpretation of data). In general, given the importance of reducing the variability in research findings attributed solely to researcher variability, it would greatly benefit qualitative researchers to utilize a common coding rubric. Furthermore, use of a common coding rubric does not greatly interfere with normal qualitative procedures, particularly if consensus is reached beforehand by all the researchers on the rubric that will be used to code all the data. Of equal importance, this procedure permits the researcher to
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
145
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn remain to be the instrument by which data are interpreted (Brodsky, 2008). Reporting the results of, to this point, this qualitative process should considerably improve the credibility of research findings. However, three issues still remain. First, reporting the findings of multiple researchers places the burden of synthesis on the reader. Therefore, researchers should implement a method to synthesize all the findings through a consensus-building procedure or averaging results, where appropriate and possible. Second, judging the reliability of a study requires that deidentified data are made available to anyone who requests it. While no one, to the best of our knowledge, has studied the degree to which this is practiced, our experience suggests it is not prevalent in the research community. Third, reporting the findings of multiple researchers will only permit readers to get an approximate sense of the level of interrater reliability or whether it meets an acceptable standard. Moreover, comparisons between the reliability of the study to another qualitative study are impractical for complex studies. Fortunately, simple quantitative solutions exist that enable researchers to report the reliability of their conclusions rather than shift the burden to the reader. The present paper will expound upon four quantitative methods for calculating interrater reliability that can be specifically applied to qualitative data and thus, should not be regarded as products of a positivist position. In fact, reliability estimates, which can roughly be conceptualized as the degree to which variability of research findings are or are not due to differences in researchers, illustrate the degree to which reality is socially constructed or not. Data that are subject to a wide range of interpretations will likely produce low reliability
estimates, whereas data whose interpretations are consistent will likely produce high reliability estimates. Finally, calculating interrater reliability in addition to reporting a narrative of the discrepancies and consistencies between researchers can be thought of as a form of methodological triangulation.
Method Data Collection Process Narrative data were collected from 528 undergraduate students and 28 professors randomly selected from a university population. Data were collected with the help of an open-ended survey that asked respondents to identify the primary challenges facing the university that should be immediately addressed by the university’s administration. Data were transcribed from the surveys to an electronic database (Microsoft Access) programmed to resemble the original questionnaire. Validation checks were performed by upper-level graduate students to assess the quality of the data entry process. Corrections to the data entered into the database were made by the graduate students in the few instances in which discrepancies were found between the responses noted on the survey and those entered in the database. Due to the design of the original questionnaire, which encouraged respondents to bullet their responses, little additional work was necessary to further break responses into the smallest codable units (typically 1-3 sentences). That said, it was possible for the smallest codable units to contain multiple themes although the average number of themes was less than two per unit of analysis.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
146
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn
Coding Procedures Coding qualitative data is an arduous task that requires iterative passes through the raw data in order to generate a reliable and comprehensive coding rubric. This task was conducted by two experienced qualitative researchers who independently read the original narratives and identified primary and secondary themes, categorized these themes based on their perception of the internal structure (selective coding; Benaquisto, 2008), and produced labels for each category and subcategory based on the underlying data (open coding; Benaquisto, 2008). Following this initial step, the two researchers further differentiated or integrated their individual coding rubric (axial coding; Benaquisto, 2008) into a unified coding rubric. Using the unified coding rubric, the two researchers attempted an initial coding of the raw data to determine (1) the ease with which the coding rubric could be applied, (2) problem areas that needed further clarification, (3) the trivial categories that could be eliminated or integrated with other categories, (4) the extensive categories that could be further refined to make important distinctions, and (5) the overall coverage of the coding rubric. Not surprisingly, several iterations were necessary before the coding rubric was finalized. In the following section, for ease of illustration, reliability estimates are presented only for a single category.
Statistical Procedures Very often, coding schemes follow a binomial distribution. That is, coders indicate whether a particular theme either is or is not present in the data. When two or more individuals code data to identify
such themes and patterns, the reliability of coder’s efforts can be determined, typically by coefficients of agreement. This type of estimate can be used as a measure that objectively permits a researcher to substantiate that his or her coding scheme is replicable. Most estimators for gauging the reliability of continuous agreement data predominately evolved from psychometric theory (Cohen, 1968; Lord & Novick, 1968; Gulliksen, 1950; Rozeboom, 1966). Similar methods for binomial agreement data shortly followed (Cohen, 1960; Lord & Novick, 1968). Newer forms of these estimators, called binomial intraclass correlation coefficients (ICC), were later developed to handle more explicit patterns in agreement data (Fleiss & Cuzick, 1979; Kleinman, 1973; Lipsitz, Laird, & Brennan, 1994; Mak, 1988; Nelder & Pregibon, 1987; Smith, 1983; Tamura & Young, 1987; Yamamoto & Yanagimoto, 1992). In this paper four methods that can be utilized to assess the reliability of binomial coded agreement data are presented. These estimators are the kappa statistic (κ), the weighted kappa statistic (κW), the ANOVA binary ICC, and the Kuder-Richardson 20 (KR-20). The kappa statistic was one of the first statistics developed for assessing the reliability of binomial data between two or more coders (Cohen, 1960; Fleiss, 1971). A modified version of this statistic introduced the use of numerical weights. This statistic allows the user to apply different probability weights to cells in a contingency table (Fleiss, Cohen, & Everitt, 1969) in order to apply different levels of importance to various coding frequencies. The ANOVA binary ICC is based on the mean squares from an analysis of variance (ANOVA) model modified for binomial data (Elston, Hill, &
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
147
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn Smith, 1977). The last estimator was developed by Kuder and Richardon (1937), and is commonly known as KR-20 or KR (20), because it was the 20th numbered formula in their seminal article. This estimator is based on the ratios of agreement to the total discrete variance. These four reliability statistics are functions of i x j contingency tables, also known as cross-tabulation tables. The current paper will illustrate the use of these estimators for a study dataset that comprises the binomial coding patterns of two investigators. Because these coding patterns are from two coders and the coded responses are binomial (i.e., theme either is or is not present in a given interview response, the contingency table has two rows (i = 2) and two columns (j = 2). The layout of this table is provided in Table 1. The first cell, denoted (i1 = Present, j1 = Present), of this table consists of the total frequency of cases where Coder 1 and Coder 2 both agree that a theme is present in the participant interview responses. The second cell, denoted (i1 = Present, j2 = Not Present), of this table consists of the total frequency of cases where Coder 1 feels that a theme is present in the interview responses, and the second coder does not agree with this assessment. The third cell, denoted (i2 = Not Present, j1 = Present), of this table consists of the total frequency of cases where Coder 2 feels that a theme is present, and the first coder does not agree with this assessment. The fourth cell, denoted (i2 = Not Present, j1 = Not Present), of this table consists of the total frequency of cases where both Coder 1 and Coder 2 agree that a theme is not present in the interview responses (Soeken & Prescott, 1986).
Table 1 General Layout of Binomial Coder Agreement Patterns for Qualitative Data Coder 1
Coder 2
Theme Present (j1)
Theme Not Present (j2)
Theme Present (i1)
Cell11
Cell21
Theme Not Present (i2)
Cell12
Cell22
Participants Interview data were collected for and transcribed from 28 professors and 528 undergraduate students randomly selected from a university population. The binomial coding agreement patterns for these two groups of interview participants are provided in Table 2 and Table 3. For the group of professor and student interview participants, the coders agreed that one professor and 500 students provided a response that pertains to overall satisfaction of university facilities. Coder 1 felt that an additional seven professors and two students made a response pertinent to overall satisfaction, whereas Coder 2 did not feel that response from these two professors pertained to the interview response of interest. Coder 2 felt that one professor and one student made a response pertinent to overall satisfaction, whereas Coder 1 did not feel that response from this professor pertained to the interview response of interest. Coder 1 and Coder 2 agreed that responses from the final 19 professors and 25 students did not pertain to the topic of interest.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
148
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn Table 2 Binomial Coder Agreement Patterns for Professor Interview Participants
Four Estimators for Calculating the Reliability of Qualitative Data
Coder 1
Coder 2
Kappa
Theme Present (j1)
Theme Not Present (j2)
Theme Present (i1)
1
1
Theme Not Present (i2)
7
19
Table 3 Binomial Coder Agreement Patterns for Student Interview Participants Coder 1
Coder 2
Theme Present (j1)
Theme Not Present (j2)
Theme Present (i1)
500
1
Theme Not Present (i2)
2
25
According to Brennan and Hays (1992), the κ statistic “determines the extent of agreement between two or more judges exceeding that which would be expected purely by chance” (p. xx). This statistic is based on the observed and expected level of agreement between two or more raters with two or more levels. The observed level of agreement (po) equals the frequency of records where both coders agree that a theme is present plus the frequency of records where both coders agree that a theme is not present divided by the total number of ratings. The expected level of agreement (pe) equals the summation of the cross product of the marginal probabilities. In other words, this is the expected rate of agreement by random chance alone. The kappa statistic (κ) then equals (po-pe)/(1-pe). The traditional formulae for po and pe are c
c
po = ∑ ∑ pij i =1 j =1
c
c
p e = ∑ ∑ p i. p. j
i =1 j =1 and , where c denotes the total number of cells, i denotes the ith row, and j denotes the jth column (Fleiss, 1971; Soeken & Prescott, 1986). These formulae are illustrated in Table 4.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
149
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn Table 4 2 x 2 Contingency Table for the Kappa Statistic Coder 1
Coder 2
Theme present
Theme not present
pi.
Theme present
c11
c21
p1. = (c11 + c21) / N
Theme not present
c12
c22
p2. = (c12 + c22) / N
p.j
p.1 = (c11 + c12) / N
p.2 = (c21 + c22) / N
N = (c11 + c21 + c12 + c22)
Marginal Column Probabilities c
c
po = ∑∑ pij = i =1 j =1 c
Marginal Row Probabilities
c11 + c22 N
observed level of agreement for professors is (1+19)/556 = 0.0360. The expected level of agreement for professors is 0.0036(0.0144) + 0.0468(0.0360) = 0.0017.
and
c
pe = ∑∑ pi. p. j = p1. p.1 + p2. p.2 i =1 j =1
Estimates from professor interview participants for calculating the kappa statistic are provided in Table 5. The Table 5 Estimates from Professor Interview Participants for Calculating the Kappa Statistic Marginal Row Probabilities
Coder 1
Coder 2
Marginal Column Probabilities
Theme present
Theme not present
pi.
Theme present
1
1
p1. = 2/556 = 0.0036
Theme not present
7
19
p2. = 26/556 = 0.0468
p.j
p.1 = 8/556 = 0.0144
p.2 = 20/556 = 0.0360
N = 28 + 528 = 556
Estimates from student interview participants for calculating the kappa statistic are provided in Table 6. The observed level of agreement for students is (500+25)/556 = 0.9442. The expected
level of agreement for students 0.9011(0.9029) + 0.0486(0.0486) 0.8160.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
is =
150
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn Table 6 Estimates from Student Interview Participants for Calculating the Kappa Statistic Marginal Row Probabilities
Coder 1
Coder 2
Marginal Column Probabilities
Theme present
Theme not present
pi.
Theme present
500
1
p1. = 501/556 = 0.9011
Theme not present
2
25
p2. = 27/556 = 0.0486
p.j
p.1 = 502/556 = 0.9029
p.2 = 26/556 = 0.0468
N = 556
The total observed level of agreement for the professor and student interview groups is po = 0.0360 + 0.9442 = 0.9802. The total expected level of agreement for the professor and student interview groups is pe = 0.0017 + 0.8160 = 0.8177. For the professor and student and professor groups, the kappa statistic equals κ = (0.9802 – 0.8177)/(1 – 0.8177) = 0.891. The level of agreement between the two coders is 0.891 beyond that which is expected purely by chance.
Weighted Kappa The reliability coefficient, κW, has the same interpretation as the kappa statistic, κ, but the researcher can differentially weight each cell to reflect varying levels of importance. According to Cohen (1968), κW is “the proportion of weighted agreement corrected for chance, to be used when different kinds of disagreement are to be differentially weighted in the agreement index” (p. xx). As an example, the frequencies of coding patterns where both raters agree that a theme is present can be given a larger weight than patterns where both raters
agree that a theme is not present. The same logic can be applied where the coders disagree on the presence of a theme in participant responses. The weighted observed level of agreement (pow) equals the frequency of records where both coders agree that a theme is present times a weight plus the frequency of records where both coders agree that a theme is not present times another weight divided by the total number of ratings. The weighted expected level of agreement (pew) equals the summation of the cross product of the marginal probabilities, where each cell in the contingency table has its own weight. The weighted kappa statistic κW then equals (pow-pew)/(1-pew). The traditional formulae for pow and pew are c
c
c
c
pow = ∑∑ wij pij and pew = ∑∑ wij pi. p. j , i =1 j =1
i =1 j =1
where c denotes the total number of cells, i denoted the ith row, j denotes the jth column, and wij denotes the i, jth cell weight (Fleiss, Cohen, & Everitt, 1969; Everitt, 1968). These formulae are illustrated in Table 7.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
151
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn Table 7 2 x 2 Contingency Table for the Weighted Kappa Statistic
Marginal Row Probabilities
Coder 1
Coder 2
Marginal Column Probabilities c
c
po = ∑∑ pij = i =1 j =1 c
Theme present
Theme not present
pi.
Theme present
w11c11
w21c21
p1. = (w11c11 + w21c21) / N
Theme not present
w12c12
w22c22
p2. = (w12c12 + w22c22) / N
p.j
p.1 = (w11c11 + w12c12) / N
p.2 = (w21c21 + w22c22) / N
N = (c11 + c21 + c12 + c22)
w11c11 + w22 c22 N
and
c
pe = ∑∑ wij pi. p. j i =1 j =1
. Karlin, Cameron, and Williams (1981) provided three methods for weighting probabilities as applied to the calculation of a kappa statistic. The first method equally weights each pair of observations. n This weight is calculated as w i = i , where N ni is the sample size of each cell and N is the sum of the sample sizes from all of cells of the contingency table. The second method equally weights each group (e.g., undergraduate students and professors) irrespective of its size. These weights can 1 be calculated as w i = , where k is kn i (n i − 1) the number of groups (e.g., k = 2). The last method weights each cell according to the sample size in each cell. The formula 1 for this weighting option is w i = . N (n i − 1) There is no single standard for applying probability weights to each cell in a contingency table. For this study, the
probability weights used are provided in Table 8. In the first row and first column, the probability weight is 0.80. This weight was chosen arbitrarily to reflect the overall level of importance in the agreement of a theme being present as identified by both coders. In the second row and first column, the probability weight is 0.10. In the first row and second column, the probability weight is 0.09. These two weights were used to reduce the impact of differing levels of experience in qualitative research between the two raters. In the second row and second column, the probability weight is 0.01. This weight was employed to reduce the effect of the lack of existence of a theme from the interview data.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
152
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn Table 8 Probability Weights on Binomial Coder Agreement Patterns for Professor and Student Interview Participants Coder 1
Coder 2
Theme Present (j1)
Theme Not Present (j2)
Theme Present (i1)
0.80
0.09
Theme Not Present (i2)
0.10
0.01
Estimates from professor interview participants for calculating the weighted kappa statistic are provided in Table 9. The observed level of agreement for professors is [0.8(1)+0.01(19)]/556 = 0.0018. The expected level of agreement for professors is 0.0016(0.0027) + 0.0016(0.0005) = 0.00001.
Table 9 Estimates from Professor Interview Participants for Calculating the Weighted Kappa Statistic Marginal Row Probabilities
Coder 1
Coder 2
Marginal Column Probabilities
Theme present
Theme not present
pi.
Theme present
0.8(1) = 0.8
0.09(1) = 0.09
p1. = 0.89/556 = 0.0016
Theme not present
0.1(7) = 0.7
0.01(19) =0.19
p2. = 0.89/556 = 0.0016
p.j
p.1 = 1.5/556 = 0.0027
p.2 = 0.28/556 = 0.0005
N = 28 + 528 = 556
Estimates from professor interview participants for calculating the weighted kappa statistic are provided in Table 10. The observed level of agreement for professors is [0.8(500)+0.01(25)]/556 =
0.7199. The expected level of agreement for professors is 0.7196(0.7198) + 0.0008(0.0006) = 0.5180.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
153
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn Table 10 Estimates from Student Interview Participants for Calculating the Weighted Kappa Statistic Marginal Row Probabilities
Coder 1
Coder 2
Theme present
Theme not present
pi.
Theme present
0.8(500) = 400
0.09(1) = 0.09
p1. = 400.09/556 = 0.7196
Theme not present
0.1(2) = 0.2
0.01(25) =0.25
p2. = 0.45/556 = 0.0008
p.j
p.1 = 400.2/556 = 0.7198
p.2 = 0.34/556 = 0.0006
N = 28 + 528 = 556
Marginal Column Probabilities
The total observed level of agreement for the professor and student interview groups is pow = 0.0018 + 0.7199 = 0.7217. The total expected level of agreement for the professor and student interview groups is pew = 0.00001 + 0.5180 = 0.5181. For the professor and student and professor groups, the weighted kappa statistic equals κW = (0.7217 – 0.5181)/(1 – 0.5181) = 0.423. The level of agreement between the two coders is 0.423 beyond that which is expected purely by chance after applying importance weights to each cell. This reliability statistic is notably smaller than the unadjusted kappa statistic because of the number of downweighted cases where both coders agreed that the theme is not present in the interview responses.
ANOVA Binary ICC From the writings of Shrout and Fleiss (1979), the currently available ANOVA Binary ICC that is appropriate for the current data set is based on what they refer to as ICC(3,1). More specifically, this version of the ICC is based on within mean squares and between mean squares
for two or more coding groups/categories from an analysis of variance model modified for binary response variables by Elston (1977). This reliability statistic measures the consistency of the two ratings (Shrout and Fleiss, 1979), and is appropriate when two or more raters rate the same interview participants for some item of interest. ICC(3,1) assumes that the raters are fixed; that is, the same raters are utilized to code multiple sets of data. The statistic ICC(2,1) that assumes the coders are randomly selected from a larger population of raters (Shrout and Fleiss, 1979) is recommended for use but not currently available for binomial response data. The traditional formulae for these mean squares within and between along with an adjusted sample size estimate are provided in Table 11. In these formulae, k denotes the total number of groups or categories. Yi denotes the frequency of agreements (both coders indicate a theme is present, or both coders indicate a theme is not present) between coders for the ith group or category, ni is the total sample size for the ith group or category, and N is the total sample size across all groups or
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
154
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn categories. Using these estimates, the reliability estimate equals MS B − MSW (Elston, Hill, & ρˆ AOV = MS B + ( n0 − 1) MSW Smith, 1977; Ridout, Demétrio, & Firth, 1999). Estimates from professor and student interview participants for calculating the ANOVA Binary ICC are provided in Table 11. Given that k = 2 and N = 556, the adjusted sample size equals 54.2857. The within and between mean squares equal
0.0157 and 2.0854, respectively. Using these estimates, the ANOVA binary ICC equals MS B − MSW 2.0854 − 0.0157 = MS B + (n0 − 1) MSW 2.0854 + (54.5827 − 1)0.0157 = 0.714, which denotes the consistency of coding between the two coders on the professor and student interview responses.
Table 11 Formulae and Estimates from Professor and Student Interview Participants for Calculating the ANOVA Binary ICC Description of Statistic
Statistic
Formula
Mean Squares Within
MSW
k Yi 2 ⎤ 1 ⎡k 1 Y − [545 − 536.303] = 0.0157 ⎢∑ i ∑ ⎥ = N − k ⎣ i =1 556 − 2 i =1 ni ⎦
MSB
2 1 ⎡ 5452 ⎤ 1 ⎡ k Yi 2 1 ⎛ k ⎞ ⎤ 536 . 303 − − Y = ⎜∑ i ⎟ ⎥ ⎢∑ ⎢ ⎥ = 2.0854 556 ⎦ k − 1 ⎢⎣ i =1 ni N ⎝ i =1 ⎠ ⎥⎦ 2 − 1 ⎣
n0
1 ⎡ 1 k 2⎤ 1 ⎡ 1 ⎤ N − ni ⎥ = 556 (528 2 + 282 ) ⎥ = 54.5827 ∑ ⎢ ⎢ k −1 ⎣ N i =1 ⎦ 2 - 1 ⎣ 556 ⎦
Mean Squares Between Adjusted Sample Size
Note: ΣYi denotes the total number of cases where both coders indicate that a theme either is or is not present in a given response.
Kuder-Richardson 20 In their landmark article, Kuder and Richardson (1937) presented the derivation of the KR-20 statistic, a coefficient that they used to determine the reliability of test items. This estimator is a function of the sample size, summation of item variances, and total variance. Two observations in these formulae require further inquiry. These authors do not appear to discuss the distributional
requirements of the data in relation to the calculation of the correlation rii , possibly due to its time of development in relation to the infancy of mathematical statistics. This vagueness has lead to some incorrect calculations of the KR-20. Crocker and Algina (1986) present examples on the calculation of the KR-20 in Table 7.2 based on data from Table 7.1 (pp. 136140). In Table 7.1, the correlation on the two split-halves is presented as ρˆ AB = 0.34 . It is not indicated that this statistic is the Pearson correlation. This is problematic
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
155
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn because this statistic assumes that the two random variables are continuous, when in actuality they are discrete. An appropriate statistic is Kendall-τc and this correlation equals 0.35. As can be seen, the correlation may be notably underestimated as well as the KR-20 if the incorrect distribution is assumed. For the remainder of this paper, the Pearson correlation will be substituted with the Kendall-τc correlation. Second, Kuder and Richardson (1937) present formulae for the calculation of σ t2 and rii that are not mutually exclusive. This lack of exclusiveness has caused some confusion in appropriate calculations of the total variance σ t2 . Lord and Novick (1968) indicated that this statistic is equal to coefficient α (continuous) under certain circumstances, and Crocker and Algina (1986) elaborated on this statement by indicating “This formula is identical to coefficient alpha with the substitution of piqi for σˆ i2 ” (p. 139). This is unfortunately incomplete. Not only must this substitution be made for the numerators variances, the denominator variances must also be adjusted in the same manner. That is, if the underlying distribution of the data is binomial, all estimators should be based on the level of measurement appropriate for the distribution. Otherwise, KR-20 formula will be based on a ratio of a discrete variance to a continuous variance. The resulting total variance will be notably to substantially inflated. For the current paper, the KR-20 will be a function of a total variance based on the discrete level of measurement. This variance will equal the summation of the main and off diagonals of a variancecovariance matrix. These calculations are further detailed in the next section.
KR-20 will be computed using the 1 k Yi ⎛ Yi ⎞⎤ N ⎡ formula ⎢1 − 2 ∑ ⎜⎜1 − ⎟⎟⎥ , where N − 1 ⎣ σ T i =1 ni ⎝ ni ⎠⎦ k denotes the total number of groups or categories, Yi denotes the number of agreements between coders for the ith group or category, ni is the total sample size for the ith group or category, and N is the total sample size across all groups or categories (Lord & Novick, 1968). The total variance (σ T2 ) for coder agreement patterns equals the summation of elements in a variance-covariance matrix for binomial data (i.e., 2 2 σ 1 + σ 2 + 2COV ( X 1 , X 2 ) = 2 2 σ 1 + σ 2 + 2 ρ12σ 1σ 2 ) (Stapleton, 1995). The variance-covariance matrix takes the general form ⎡ σ 12 L ρ σσ ⎤ ρ σ σ ij i j ⎥ 12 1 2 ⎢ 2 ⎢ρ σ σ M L ⎥ σ2 Σ = ⎢ 21 2 1 ⎥ (Kim L L O M ⎢ ⎥ 2 L L σn ⎥ ⎢ρ σ σ ⎣ ij i j ⎦ & Timm, 2007), and reduces to ⎡ σ 12 ρ12σ1σ 2 ⎤ Σ =⎢ ⎥ for a coding σ 22 ⎥⎦ ⎢⎣ ρ 21σ 2σ1 scheme comprised of two raters. In this
(
)
2 2 σ1 , σ 2 matrix, the variances of th agreement for the i group or category should be based on discrete expectations (Hogg, McKean, & Craig, 2004). The form of this variance equals the second moment minus the square of the first moment; that is, E(X2) – [E(X)]2 (Ross, 1997). For
continuous data, E (X 2 ) = ∫ x 2 ⋅ f ( x )∂x and +∞
−∞
+∞
E(X) = ∫ x ⋅ f ( x )∂x where f(x) denotes the −∞
probability density function (pdf) for continuous data. For the normal pdf, for example, E(X) = μ and E X 2 − E ( X ) = σ 2 .
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
( )
156
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn For discrete 2 2 E(X ) = ∑ xi Pr ob( X = xi )
data, and
i
E(X) = ∑ xi Pr ob( X = xi ) , where Prob(X = i
xi) denotes the pdf for discrete data (Hogg & Craig, 1995). For the binomial pdf, E(X) = n ⋅ p and E (X 2 ) − E ( X ) = n ⋅ p (1 − p ) (Efron & Tibshirani, 1993). If a discrete distribution cannot be assumed or is unknown, it is most appropriate to use the distribution-free expectation (Hettmansperger & McKean, 1998). Basic algebra is only needed to solve for E (X 2 ) and E ( X ) . For this last scenario it is important to also note that if the underlying distribution is discrete, methods assuming continuity for 2 calculating E (X ) and E ( X ) should not be utilized because the standard error can be substantially inflated, and reducing the accuracy of statistical inference (Bartoszynski & Niewiadomska-Bugaj, 1996). As with the calculation of E (X 2 ) and E ( X ) the distribution of data must also be considered in the calculation of correlations. Otherwise, standard errors will be inflated. For data that take the form as either the presence or absence of a theme, which clearly have a discrete distribution, the correlation should be based on distributions suitable for this
type of data. In this paper, the correlation ρ12 for agreement patterns between the coders will be Kendall-τc (Bonett and Wright, 2000). This correlation can be readily estimated using the PROC CORR procedure in the statistical software package SAS. Estimates for calculating the KR-20 based on coder agreement patterns for the professor and student interview groups are provided in Table 12. Letting x2 = 2 for non-agreed responses, the variance is 0.816 and 0.023, respectively, for the professor and undergraduate student groups. The Kendall-τc correlation equals 0.881. Using these estimates, the covariance between the groups equals 0.121. The total variance then equals 10.081. The final component of the KR-20 formula is the proportion of agreement times one minus this proportion (i.e., pi(1pi)) for each of the groups. This estimate for the professor and undergraduate student interview groups equals 0.204 and 0.006. The sum of these values is 0.210. The KR-20 reliability estimate thus 556 ⎡ 1 ⎤ equals 10.210 ⎥ = 0.807, ⎢ 556 - 1 ⎣ 1.081 ⎦ which equals the reliability between professor and student interview responses on the theme of interest.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
157
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn Table 12 Estimates from Professor and Student Interview Participants for Calculating the KR 20 Estimate
Professor Group
Student Group
Individual Variances
0.816
0.023
Kendall-τc Correlation
0.881
Covariance
(0.881)(0.816)1/2(0.023)1/2 = 0.121
Total Variance
0.816 + 0.023 + 2*0.121 = 1.081
pi(1-pi)
0.714(1-0.714) = 0.204
Σpi(1-pi)
Discussion This paper presented four quantitative methods for gauging interrater reliability of qualitative findings following a binomial distribution (theme is present, theme is absent). The κ statistic is a measure of observed agreement beyond the expected agreement between two or more coders. The κW statistic has the same interpretation as the kappa statistic, but permits the differential weights of cell frequencies reflecting patterns of coder agreement. The ANOVA (binary) ICC measures the degree to which two or more ratings are consistent. The KR-20 statistic is a reliability estimator based on the ratio of variances. That being said, it is important to note that the reliability of binomial coding patterns is invalid if based on continuous agreement statistics (Maclure & Willett, 1987). Some researchers have developed tools for interpreting reliability coefficients, but do not provide guidelines for determining the sufficiency of such statistics. According to Landis and Koch (1977), coefficients of 0.41-0.60, 0.61-0.80, and 0.81-1.00 have ‘Moderate,’ ‘Substantial,’ and ‘Almost Perfect’ agreement, in that order. George and Mallery (2003) indicate that reliability coefficients of 0.9-1.0 are
0.994(1-0.994) = 0.006 0.210
“Excellent,” of 0.8-0.9 are “Good,” of 0.70.8] are “Acceptable,” of 0.6-0.7 are “Questionable,” of 0.5-0.6] are “Poor,” and less than 0.5 are “Unacceptable,” where coefficients of at least 0.8 should be a researcher’s target. According this tool, the obtained κ of 0.891 demonstrates ‘Almost Perfect’ to ‘Good’ agreement between the coders. The κW statistic of 0.423 demonstrates ‘Fair’ to ‘Unacceptable’ agreement between the coders. The obtained ANOVA ICC of 0.714 demonstrates ‘Substantial’ to ‘Acceptable’ agreement between the coders. Last, the obtained KR-20 of 0.807 demonstrates ‘Substantial’ to ‘Good’ agreement between the coders. The resulting question from these findings is “Are these reliability estimates sufficient?” The answer is dependent upon on the focus of the study, the complexity of the theme(s) under investigation, and the comfort level of the researcher. The more complicated the topic being investigated, the lower the proportion of observed agreement between the coders may be. According to Nunnally (1978), Cascio (1991), and Schmitt (1996), reliabilities of at least 0.70 are typically sufficient for use. The κ statistic, ANOVA ICC, and KR-20 meet this cutoff, demonstrating acceptable reliability coefficients.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
158
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn What happens if the researcher has an acceptable level of reliability in mind, but does not meet the requirement? What methods should be employed in this situation? If a desired reliability coefficient is not achieved, it is recommended that the coders revisit their coding decisions on patterns of disagreement on the presence of themes in the binomial data (e.g., interview responses). After the coders revisit their coding decisions, the reliability coefficient would be re-estimated. This process would be recursive until a desired reliability coefficient is achieved. Although this process may seem tedious, the confidence in which the coders identified themes increases and thus improves the interpretability of the data.
Future Research Three areas of research are recommended for furthering the use of reliability estimators for discrete coding patterns of binomial responses (e.g., qualitative interview data). In the current paper estimators that can be used to gauge agreement pattern reliability within a theme were presented. The development of quality reliability estimators applicable across themes should be further developed and investigated. This would allow researchers to determine the reliability of one’s grounded theory, for example, as opposed to a component of the theory. Sample size estimation methods also should be further developed for reliability estimators, but are presently limited to the κ statistic (Bonett, 2002; Feldt & Ankenmann, 1998). Sample size estimation would inform the researcher, in the example of the current paper, as to how many interviews should be conducted in order to achieve a desired reliability
coefficient on their coded qualitative interview data with a certain likelihood prior to the initiation of data collection. The current study simulated coder agreement data that follow a binomial probability density function. Further investigation should be conducted to determine if there are more appropriate discrete distributions to model agreement data. Possible densities may include the geometric, negative binomial, betabinomial, and Poisson, for example. This development could lead to better estimators of reliability coefficients (e.g., for the investigation of ‘rare’ events).
References Armstrong, D., Gosling, A., Weinman, J., & Marteau, T. (1997). The place of inter-rater reliability in qualitative research: An empirical study. Sociology, 31(3), 597-606. Bartoszynski, R., & Niewiadomska-Bugaj, M. (1996). Probability and statistical inference. New York, NY: John Wiley. Benaquisto, L. (2008). Axial coding. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 1, pp. 51-52). Thousand Oaks, CA: SAGE. Benaquisto, L. (2008). Coding frame. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (pp. 88-89). Thousand Oaks, CA: Sage. Benaquisto, L. (2008). Open coding. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 2, pp. 581-582). Thousand Oaks, CA: Sage. Benaquisto, L. (2008). Selective coding. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods. Thousand Oaks, CA: Sage.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
159
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn Bonett, D. G. (2002). Sample size requirements for testing and estimating coefficient alpha. Journal of Educational and Behavioral Statistics, 27, 335-340. Bonett, D. G. & Wright, T. A. (2000). Sample size requirements for estimating Pearson, Kendall, and Spearman correlations. Psychometrika, 65, 23-28. Brodsky, A. E. (2008). Researcher as instrument. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 2, p. 766). Thousand Oaks, CA: Sage. Burla, L., Knierim, B., Barth, J., Liewald, K., Duetz, M., & Abel, T. (2008). From text to codings: Intercoder reliability assessment in qualitative content analysis. Nursing Research, 57, 113117. Cascio, W. F. (1991). Applied psychology in personnel management (4th ed.). Englewood Cliffs, NJ: Prentice-Hall International. Cheek, J. (2008). Funding. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 1, pp. 360-363). Thousand Oaks, CA: Sage. Cohen, J. (1960). A coefficient of agreement from nominal scales. Educational and Psychological Measurement, 20, 37-46. Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213-220. Coryn, C. L. S. (2007). The holy trinity of methodological rigor: A skeptical view. Journal of MultiDisciplinary Evaluation, 4(7), 26-31. Creswell, J. W. (2007). Qualitative inquiry & research design: Choosing
among five approaches (2nd ed.). Thousand Oaks, CA: Sage. Crocker, L., & Algina, J. (1986). Introduction to classical & modern test theory. Fort Worth, TX: Holt, Rinehart, & Winston. Davis, C. S. (2008). Hypothesis. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 1, pp. 408-409). Thousand Oaks, CA: SageAGE. Dillon, W. R., & Mulani, N. (1984). A probabilistic latent class model for assessing inter-judge reliability. Multivariate Behavioral Research, 19, 438-458. Efron, B. & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York, NY: Chapman & Hall/CRC. Elston, R. C., Hill, W. G., & Smith, C. (1977). Query: Estimating “Heritability” of a dichotomous trait. Biometrics, 33, 231-236. Everitt, B. S. (1968). Moments of the statistics kappa and weighted kappa. The British Journal of Mathematical and Statistical Psychology, 21, 97-103. Feldt, L. S. & Ankenmann, R. D. (1998). Appropriate sample size for comparison alpha reliabilities. Applied Psychological Measurement, 22, 170178. Firmin, M. W. (2008). Replication. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 2, pp. 754-755). Thousand Oaks, CA: Sage. Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378-382. Fleiss, J. L., Cohen, J., & Everitt, B. S. (1969). Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72, 323-327. Fleiss, J. L., & Cuzick, J. (1979). The reliability of dichotomous judgments:
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
160
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn Unequal numbers of judges per subject. Applied Psychological Measurement, 3, 537-542. George, D. & Mallery, P. (2003). SPSS for Windows step by step: A simple guide and reference. 11.0 update (4th ed.). Boston, MA: Allyn & Bacon. Given, L. M., & Saumure, K. (2008). Trustworthiness. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 2, pp. 895896). Thousand Oaks, CA: Sage. Golafshani, N. (2003). Understanding reliability and validity in qualitative research. The Qualitative Report, 8(4), 597-607. Greene, J. C. (2007). Mixed methods in social inquiry. Thousand Oaks, CA: Sage. Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. Hettmansperger, T. P. & McKean, J. (1998). Kendalls library of statistics 5, robust nonparametric statistical models. London: Arnold. Hogg, R. V. & Craig, A. T. (1995). Introduction to mathematical statistics (5th ed.). Upper Saddle River, NJ: Prentice Hall. Hogg, R. V., McKean, J. W., & Craig, A. T. (2004). Introduction to mathematical statistics (6th ed.). Upper Saddle Rover, NJ: Prentice Hall. Hopkins, K. D. (1998). Educational and psychological measurement and evaluation (8th ed.). Boston, MA: Allyn and Bacon. Jensen, D. (2008). Confirmability. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 1, p. 112). Thousand Oaks, CA: Sage. Jensen, D. (2008). Credibility. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 1, pp. 138-139). Thousand Oaks, CA: Sage.
Jensen, D. (2008). Dependability. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 1, pp. 208-209). Thousand Oaks, CA: Sage. Karlin, S., Cameron, P. E., & Williams, P. (1981). Sibling and parent-offspring correlation with variable family age. Proceedings of the National Academy of Science, U.S.A. 78, 2664-2668. Kim, K. & Timm, N. (2007). Univariate and multivariate general linear models: Theory and applications with SAS (2nd ed.). New York, NY: Chapman & Hall/CRC. Kleinman, J. C. (1973). Proportions with extraneous variance: Single and independent samples. Journal of the American Statistical Association, 68, 46-54. Krippendorf, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Thousand Oaks, CA: Sage. Kuder, G. F., & Richardson, M. W. (1937). The theory of estimation of test reliability. Psychometrika, 2, 151-160. Landis, J. R., & Koch, G. C. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174. Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Newbury Park, CA: Sage. Lipsitz, S. R., Laird, N. M., & Brennan, T. A. (1994). Simple moment estimates of the κ-coefficient and its variance. Applied Statistics, 43, 309-323. Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Maclure, M. & Willett, W. C. (1987). Misinterpretation and misuse of the kappa statistic. Journal of Epidemiology, 126, 161-169. Magee, B. (1985). Popper. London: Routledge Falmer.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
161
Jason W. Davey, P. Cristian Gugiu, Chris L. S. Coryn Mak, T. K. (1988). Analyzing intraclass correlation for dichotomous variables. Applied Statistics, 37, 344-252. Marshall, C., & Rossman, G. B. (2006). Designing qualitative research (4th ed.). Thousand Oaks, CA: Sage. Maxwell, A. E. (1977). Coefficients of agreement between observers and their interpretation. British Journal of Psychiatry, 130, 79-83. Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis: An expanded sourcebook (2nd ed.). Thousand Oaks, CA: Sage. Miller, P. (2008). Reliability. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 2, pp. 753-754). Thousand Oaks, CA: Sage. Mitchell, S. K. (1979). Interobserver agreement, reliability, and generalizability of data collected in observational studies. Psychological Bulletin, 86, 376-390. Morse, J. M., Barrett, M., Mayan, M., Olson, K., & Spiers, J. (2002). Verification strategies for establishing reliability and validity in qualitative research. International Journal of Qualitative Methods, 1(2), 13-22. Nelder, J. A., & Pregibon, D. (1987). An extended quasi-likelihood function. Biometrika, 74, 221-232. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGrawHill. Paley, J. (2008). Positivism. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 2, pp. 646-650). Thousand Oaks, CA: Sage. Ridout, M. S., Demétrio, C. G. B., & Firth, D. (1999). Estimating intraclass correlations for binary data. Biometrics, 55, 137-148.
Ross, S. (1997). A first course in probability (5th ed.). Upper Saddle River, NJ: Prentice Hall. Rozzeboom, W. W. (1966). Foundations of the theory of prediction. Homewood, IL: Dorsey. Saumure, K., & Given, L. M. (2008). Rigor in qualitative research. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 2, pp. 795-796). Thousand Oaks, CA: Sage. Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 81-84. Seale, C. (1999). Quality in qualitative research. Qualitative Inquiry, 5(4), 465-478. Smith, D. M. (1983). Algorithm AS189: Maximum likelihood estimation of the parameters of the beta binomial distribution. Applied Statistics, 32, 196-204. Soeken, K. L., & Prescott, P. A. (1986). Issues in the use of kappa to estimate reliability. Medical Care, 24, 733-741. Stapleton, J. H. (1995). Linear statistical models. New York, NY: John Wiley & Sons, Inc. Stenbacka, C. (2001). Qualitative research requires quality concepts of its own. Management Decision, 39(7), 551-555. Tamura, R. N., & Young, S. S. (1987). A stabilized moment estimator for the beta-binomial distribution. Biometrics, 43, 813-824. van den Hoonaard, W. C. (2008). Interand intracoder reliability. In L. M. Given (Ed.), The Sage encyclopedia of qualitative research methods (Vol. 1, pp. 445-446). Thousand Oaks, CA: Sage. Yamamoto, E., & Yanagimoto, T. (1992). Moment estimators for the binomial distribution. Journal of Applied Statistics, 19, 273-283.
Journal of MultiDisciplinary Evaluation, Volume 6, Number 13 ISSN 1556-8180 February 2010
162
JOURNAL OF CONSUMER 71-73 Iacobucci, Dawn (ed.) (2001), Journal of Consumer Psychology's Special Issue on Methodological and Statistical PSYCHOLOGY, 10(1&2), ? 2001,Lawrence Erlbaum Inc. Researcher, 10 (1&2), Mahwah, NJ: Lawrence Erlbaum Associates, 71-73. Copyright Associates, Concerns of the Experimental Behavioral
InterraterReliability IV.INTERRATERRELIABILITY ASSESSMENT IN CONTENT ANALYSIS Whatis thebest way to assess reliabilityin contentanalysisIs percentageagreementbetweenjudges best (NO!)? Or, statedin a slightly differentmannerfrom anotherresearcher:There are several tests that give indexes of rater agreementfor nominal data and some other tests or coefficientsthatgive indexesofinterraterreliabilityformetricscale data.For my databased on metric scales, I have established raterreliability using the intraclasscorrelationcoefficient, butI also wantto look at interrateragreement(for two raters). Whatappropriatetest is therefor this? I have huntedaround butcannotfindanything.I havethoughtthata simplepercentage of agreement(i.e., 1 point difference using a 10-point scale is 10%disagreement)adjustedfor the amountof variance for each questionmay be suitable.
Professor Kent Grayson London Business School Kolbe and Burnett(1991) offered a nice-and prettydamning-critique of the qualityof contentanalysis in consumer research.Theyhighlighta numberof criticisms,one of which is this concern about percentageagreementas a basis for judging the qualityof contentanalysis. The basic concernis thatpercentagesdo not take into account the likelihood of chance agreementbetween raters. Chanceis likely to inflateagreementpercentagesin all cases, but especially with two coders, and low degrees of freedom on each coding choice (i.e., few codingcategories).Thatis, if CoderA andCoderB haveto decideyes-no whethera coding unithas propertyX, then mere chancewill have themagreeing at least 50% of the time (i.e., in a 2 x 2 table with codes randomlydistributed,25% in each cell, therewould be 50% of the scores already randomly along the diagonal, which would representspuriousapparentagreement). Severalscholarshave offered statisticsthattry to correct for chanceagreement.The one thatI havebeen using lately is "Krippendorffsalpha"(Krippendorff,1980), which he describedin his chapteron reliability.I use it becausethe math seems intuitive-it seems to be roughly based on the observedand expected logic thatunderlieschi-square.
Alternatively,HughesandGarrett(1990) outlineda number of different options (including Krippendorff's, 1980) and then offered their own solution based on a generalizability theory approach. Rust and Cooil (1994) took a "proportionalreductionin loss" approachand provided a generalframeworkfor reliabilityindexes for quantitative and qualitativedata.
Professor Roland Rust VanderbiltUniversity Recent work in psychometrics(Cooil & Rust, 1994, 1995; Rust& Cooil, 1994)haveset forththeconceptofproportional reductionof loss (PRL)as a generalcriterionof reliabilitythat subsumesboth the categoricalcase andthe metriccase. This criterionconsiders the expected loss to a researcherfrom wrong decisions, and it turs out to include some popularly used methods (e.g., Cronbach's alpha: Cronbach, 1951; generalizabilitytheory:Cronbach,Gleser, Nanda, & Rajaratnam,1972;Perreaultand Leigh's, 1989, measure)as special cases. Simplyusing percentageof agreementbetweenjudges is not so good becausesome agreementis sureto occur,if only by chance,andthe fewerthe numberof categories,the more randomagreementis likely to occur,thusmakingthe reliability appearbetterthanit reallyis. This randomagreementwas theconceptualbasisofCohen's kappa(Cohen,1960),apopularmeasurethatis not a PRLmeasureandhas otherbadproperties (e.g., overconservatismand, under some conditions, inabilityto reachone, even if thereis perfectagreement). Editor: For the discussion that follows, imagine a concrete example. Perhaps two experts have been asked to code whether 60 print advertisementsare "emotionalimagery," "rationalinformative,"or "mixedambiguous"in appeal.The ratingstablethatfollows depictsthe coding scheme with the threecategories-let us call the numberof coding categories "c."The row and columnrepresentthe codes assignedto the advertisementsby the two independentjudges. Forexample, n21representsone kind of disagreement-the numberof ads thatRaterI judgedas rationalthatRater2 thoughtemotional. (Thenotation,nl +,representsthe sum forthe firstrow,having aggregatedoverthecolumns.)If theratersagreedcompletely,
72
ASSESSMENT RELIABILITY
all ads would fall into the nl, 1n22,or n33cells, with zeros off the main diagonal. Rater2Rater1:
row emotional rational
mixed
sums
emotional
n,,
n,2
ni3
nj+
rational
n21
n22
n23
n2+
mixed
n3
n32
n33
n3+
column
n
n+2
n+3
n
sums
e.g.,60
Cohen (1960) proposed K, kappa, the "coefficient of agreement," drawing the analogy (pp. 37-38) between interrateragreementand item reliabilityas pursuitsof evaluating the quality of data. His index was intendedas an improvementon the simple (he says "primitive")computation of percentageagreement.Percentageagreementis computed as the sum of the diagonal(agreedon) ratingsdividedby the numberof units being coded: (nll + n22+ n33)/n++.Cohen stated,"Ittakesrelativelylittle in the way of sophisticationto appreciatethe inadequacyof this solution"(p. 38). The problem to which he speaks is thatthis index does not correctfor the fact that there will be some agreementsimply due to chance. Hughes and Garrett(1990) and Kolbe and Burett (1991), respectively,reported65% and 32% of the articles they reviewedas relyingon percentageagreementas the primary index of interraterreliability.Thus, althoughCohen's criticismwas clear40 yearsago, thesereviews,only 10 years ago, suggestthatthe issues andsolutionsstill havenotpermeated the social sciences. Cohen (1960, pp. 38-39) also criticized the use of the chi-squaretest of associationin this application,becausethe requirementof agreementis more stringent.That is, agreementrequiresall non-zerofrequenciesto be along the diagonal; association could have all frequenciesconcentratedin off-diagonalcells. Hence, Cohen (1960) proposedkappa: K=_(Pa -Pc) (1-pc)
wherepais the proportionof agreedonjudgments(in ourexample, Pa = (nll + n22 + n33)/n++). The termpc is the propor-
tion of agreementsone would expect by chance;pc = (el + for exe22 + e33)/n++),where eii = (ni+/n++)(n+i/n++)(n++); in the of the number 2) (2, (rationalcode) agreements ample, cell would be e22 = (n2+/n++)(n+2/n++)(n++)(just as you would compute expected frequenciesin a chi-squaretest of independence in a two-way table-it is just that here, we care only about the diagonal entries in the matrix). Cohen also provided his equation in terms of frequencies rather
thanproportionsandanequationforanapproximatestandard errorfor the index. Researchershavecriticizedthekappaindexforsome of its properties and proposed extensions (e.g., Brennan & Prediger, 1981; Fleiss, 1971; Hubert, 1977; Kaye, 1980; Kraemer,1980; Tanner& Young, 1985). To be fair, Cohen (1960, p. 42) anticipatedsome of these qualities(e.g., thatthe upperboundfor kappacan be less than 1.0, dependingon the marginaldistributions),andso he providedan equationto determinethe maximumkappaone might achieve. If Cohen's(1960) kappahas some problems,what might serve as a superiorindex? Perreaultand Leigh (1989) reasoned throughexpectedlevels of chanceagreementin a way thatdidnotdependon themarginalfrequencies.Theydefined an "indexof reliability,"Iras follows (p. 141):
Pa\Pa \ (
C)
c-\
)
whenpa > (l/c). Ifpa < (/lc), Iris set to zero. (Recallthatc is the numberof codingcategoriesas definedpreviously.)They also providean estimatedstandarderror(p. 143):
SI =
-
|IIr(l-Ir) , O
n++
whichis important,becausewhentheconditionholdsthatIrx n++> 5, these two statisticsmay be used in conjunctionto form a confidence interval,in essence a test of the significance of the reliabilityindex: Ir ? (1.96)si. Rust and Cooil (1994) is anotherlevel of achievement, extending Perreaultand Leigh's (1989) index to the situation of three or more ratersand creatinga frameworkthat subsumesquantitativeand qualitativeindexes of reliability (e.g., coefficient alphafor ratingscales andinterrateragreement for categorical coding). Hughes and Garrett(1990) used generalizabilitytheory,which is basedon a random-effects analysis of a variance-like modeling approach to aportionvariance due to rater,stimuli, coding conditions, and so on. (Hughes & Garrettalso criticized the use of intraclass correlationcoefficients as insensitive to differences between coders due to mean or variance; p. 187.) Ubersax (1988) attemptedto simultaneouslyestimate reliability and validity from coding judgments using a latent class approach,which is prevalentin marketing. In conclusion,perhapswe can at leastagreeto finallybanish the simplepercentageagreementas anacceptableindexof interraterreliability.In terms of an index suited for general endorsement,Perreaultand Leigh's (1989) index (discussed earlier)would seem to fit manyresearchcircumstances(e.g.,
RELIABILITYASSESSMENT
two raters).Furthermore, it appearssufficientlystraightforwardthatonecouldcomputetheindexwithouta mathematically-induced coronary.
REFERENCES Brennan,Robert L., & Prediger,Dale J. (1981). Coefficient kappa:Some uses, misuses, and alternatives.Educationaland Psychological Measurement,41, 687-699. Cohen,Jacob.(1960). A coefficientof agreementfornominalscales. Educational and Psychological Measurement,20, 37-46. Cooil, Bruce,& Rust,RolandT. (1994). Reliabilityandexpectedloss: A unifying principle.Psychometrika,59, 203-216. Cooil, Bruce,& Rust,RolandT. (1995). Generalestimatorsforthe reliability of qualitativedata.Psychometrika,60, 199-220. Cronbach,Lee J. (1951). Coefficientalphaandthe internalstructureof tests. Psychometrika,16, 297-334. Cronbach,Lee J., Gleser, Goldine C., Nanda, Harinder,& Rajaratnam, Nageswari. (1972). The dependabilityof behavioral measurements: Theoryof generalizabilityforscores andprofiles. New York:Wiley. Fleiss, JosephL. (1971). Measuringnominalscale agreementamongmany raters.PsychologicalBulletin, 76, 378-382.
73
Hubert,Lawrence.(1977). Kapparevisited.PsychologicalBulletin,84, 289297. Hughes,MarieAdele, & Garrett,Dennis E. (1990). Intercoderreliabilityestimationapproachesin marketing:A generalizabilitytheoryframework for quantitativedata.Journalof MarketingResearch,27, 185-195. Kaye, Kenneth.(1980). Estimatingfalse alarms and missed events from interobserveragreement:A rationale.PsychologicalBulletin,88, 458468. Kolbe,RichardH., & Burnett,Melissa S. (1991). Content-analysisresearch: An examinationof applicationswith directivesfor improvingresearch reliabilityandobjectivity.Journalof ConsumerResearch,18,243-250. Kraemer,Helena Chmura. (1980). Extension of the kappa coefficient. Biometrics,36, 207-216. Krippendorff,Klaus.(1980). Contentanalysis:An introductionto its methodology.NewburyPark,CA: Sage. Perreault,WilliamD., Jr.,& Leigh, LaurenceE. (1989). Reliabilityof nominal data based on qualitativejudgments. Journal of MarketingResearch, 26, 135-148. Rust,RolandT., & Cooil, Bruce.(1994). Reliabilitymeasuresfor qualitative data:Theoryandimplications.JournalofMarketingResearch,31, 1-14. Tanner,MartinA., & Young,MichaelA. (I 985). Modelingagreementamong raters.Journal of the AmericanStatisticalAssociation, 80(389), 175180. Ubersax,JohnS. (1988). Validityinferencesfrominter-observeragreement. PsychologicalBulletin, 104, 405-416.
Literature Review of Inter-rater Reliability Inter-rater reliability, simply defined, is the extent to which the way information being collected is being collected in a consistent manner (Keyton, et al, 2004). That is, is the information collecting mechanism and the procedures being used to collect the information solid enough that the same results can repeatedly be obtained? This should not be left to chance, either. Having a good measure of inter-rater reliability rate (combined with solid survey/interview construction procedures) allows project manners to state with confidence they can be confident in the information they have collected. Statistical measures are used to measure inter-rater reliability in order to provide a logistical proof that the similar answers collected are more than simple chance (Krippendorf, 2004a). Inter-rater reliability also alerts project managers to problems that may occur in the research process (Capwell, 1997; Keyton, et al, 2004; Krippendorf, 2004a,b; Neuendorf, 2002). These problems include poorly executed coding procedures in qualitative surveys/interviews (such as a poor coding scheme, inadequate coder training, coder fatigue, or the presence of a rogue coder – all examined in a later section of this literature review) as well as problems regarding poor survey/interview administration (facilitators rushing the process, mistakes on part of those recording answers, the presence of a rogue administrator) or design (see Survey Methods, Interview/Re-Interview Methods, or Interview/Re-Interview Design literature reviews). From all of the potential problems listed here alone, it is evident measuring inter-rater reliability is important in the interview and re-interview process.
Preparing qualitative/open-ended data for inter-rater reliability checks If closed data was not collected for the survey/interview, then the data will have to be coded before it is analyzed for inter-rater reliability. Even if closed data was collected, then coding may be important because in many cases closed-ended data has a large amount of possibilities. A common consideration, the YES/NO priority (Green, 2004), requires answers to be placed into yes or no paradigms as a simple data coding mechanism for determining inter-rater reliability. For instance, it would be difficult to determine inter-rater reliability for information such as birthdates. Instead of recording birthdates, then, it can be determined whether the two data collections netted the same result. If so, then YES can be recorded for each respective survey. If not, then YES should be recorded for one survey and NO for the other (do not enter NO for both, as that would indicate agreement). While placing qualitative data into a YES/NO priority could be a working method for the information collected in the ConQIR Consortium given the high likelihood that interview data will match, the forced categorical separation is not considered to be the best available practice and could prove faulty in accepting or rejecting hypotheses (or for applying analyzed data toward other functions). It should, however, be sufficient in evaluating whether reliable survey data is being obtained for agency use. For best results, the survey design should be created with reliability checks in mind, employing either a YES/NO choice option (this is different than what is reviewed above – a YES/NO option would include questions like, “Were you born before July 13, 1979?” where the participant would have to answer yes or no) or a likert-scale type mechanism. See the Interview/Re-Interview Design literature review for more details.
How to compute inter-rater reliability Fortunately, computing inter-rater reliability is a relatively easy process involving a simple mathematical formula based on a complicated statistical proof (Keyton, et al, 2004). In the case of qualitative studies, where survey or interview questions are openended, some sort of coding scheme will need to be put into place before using this formula (Friedman, et al, 2003; Ketyon, et al, 2004). For closed-ended surveys or interviews where participants are forced to choose one choice, then the collected data is immediately ready for inter-rater checks (although quantitative checks often produce lower reliability scores, especially when the likert scale is used) (Friedman, et al, 2003). To compute inter-rater reliability in quantitative studies (where closed-answer question data is collected using a likert scale, a series of options, or yes/no answers), follow these steps to determine Cohen’s kappa (1960), a statistical measure determining inter-rater reliability: 1. Arrange the responses from the two different surveys/interviews into a contingency table. This means you will create a table that demonstrates, essentially, how many of the answers agreed and how many answers disagreed (and how much they disagreed, even). For example, if two different survey/interview administrators asked ten yes or no questions, their answers would first be laid out and observed:
Question Number
1 2 3 4 5 6 7 8 9 10
Interviewer #1
Y N Y N Y Y Y Y Y Y
Interviewer #2
Y N Y N Y Y Y N Y N
From this data, a contingency table would be created: RATER #1 (Going across) RATER #2 (Going down)
YES
NO
YES
6
0
NO
2
2
Notice that the number six (6) is entered in the first column because when looking at the answers there were six times when both interviewers found a YES answer to the same question. Accordingly, they are placed where the two YES answers overlap in the table (with the YES going across the top of the table representing Rater/Interviewer #1 and the YES going down the left side of the table representing Rater/Interviewer #2). A zero (0) is entered in the second column in the first row because for that particular intersection in the table there were no occurrences (that is, Interviewer/Rater #1 never found a NO answer when Interviewer/Rater #2 found a YES). The number two (2) is entered in the first column of the second row since Interviewer/Rater #1 found a YES answer two times when Interviewer/Rater #2 found a NO; and a two (2) is entered in the second column of the second row because both Interviewer/Rater #1 and Interviewer/Rater #2 found NO answers to the same question two different times. NOTE: It is important to consider that the above table is for a YES/NO type survey. If a different number of answers are available for the questions in a survey, then the number of answers should be taken into consideration in creating the table. For instance, if a five
question likert-scale were used in a survey/interview, then the table would have five rows and five columns (and all answers would be placed into the table accordingly). 2. Sum the row and column totals for the items. To find the sum for the first row in the previous example, the number six would be added to the number zero for a first row total of six. The number two would be added to the number two for a second row total of four. Then the columns would be added. The first column would find six being added to two for a total of eight; and the second column would find zero being added to two for a total of two. 3. Add the respective sums from step two together. For the running example, six (first row total) would be added to four (second row total) for a row total of ten (10). Eight (first column total) would be added to two (second column total) for a column total of ten (10). At this point, it can be determined whether the data has been entered and computed correctly by whether or not the row total matches the column total. In the case of this example, it can be seen that the data seems to be in order since both the row and column total equal ten. 4. Add all of the agreement cells from the contingency table together. In the running example, this would lead to six being added to two for a total of eight because there were six times where the YES answers matched from both interviewers/raters (as designated by the first column in the first row) and two times where the NO answers matched from both interviewers/raters (as designated by the second column in the second row). The sum of agreement then, and the answer to this step, would be eight (8). The agreement cells will
always appear in a diagonal pattern across the chart – so, for instance, if participants had five possibilities for answers then there should be five cells going across and down the chart in a diagonal pattern that will be added. NOTE: At this point simple agreement can be computed by dividing the answer in step four by the answer in step five. In the case of this example, that would lead to eight being divided by ten for a result of 0.8. This number would be rejected by many researchers, however, since it does not take into account the probability that some of these agreements in answers could have been by chance. That is why the rest of the steps must be followed to determine a more accurate assessment of inter-rater reliability. 5. Compute the expected frequency for each of the agreement cells appearing in the diagonal pattern going across the chart. To do this, find the row total for the first agreement cell (row one column one) and multiply that by the column total for the same cell. Divide this by the total number possible for all answers (this is the row/column total from step three). So, for this example, first the cell containing the number six would be located (since it is the first agreement cell located in row one column one) and the column and row totals would be multiplied by each other (these were found in step two) and then divided by the total: 6 x 8 =48
48/10=4.8. The next diagonal
cell (one over to the right and one down) is the next row to be computed: 2 x 4=8
8/10=0.8. Since this is the final cell in the diagonal, this is the final
computation that needs to be made in this step for the sample problem; however, if more answers were possible, then the step would be repeated as many times as there are answers. For a five answer likert scale, for instance,
the process would be repeated for five agreement cells going across the chart diagonally in order to consider how those answers matched up and provide a full account of inter-rater reliability. 6. Add all of the expected frequencies found in step five together. This represents the expected frequencies of agreement by chance. For the example used in this literature review, that would be 4.8 + 0.8 for a sum of 5.6. For a five answer likert scale, all five of the totals found in step five would be added together. 7. Compute kappa. To do this, take the answer from step four and subtract the answer from step six. Place the result of that computation aside. Then take the total number of items from the survey/interview and subtract the answer from step six. After this has been completed, take the first computation from this step (the one that was set aside) and divide it by the second computation from this step. The resulting computation represents kappa. For the running example that has been provided in this literature review, it would look like this: 8 - 5.6 = 2.4; 10 – 5.6 = 4.4
2.4/4.4 = 0.545.
8. Determine whether the reliability rate is satisfactory. If kappa is at 0.7 or higher, then the inter-rater reliability rate is generally considered satisfactory (CITE). If not, then it is often rejected. What to do if inter-rater reliability is not at an appropriate level Unfortunately, if inter-rater reliability is not at the appropriate level (generally 0.7) then it is often recommended that the data be thrown out (Krippendorf, 2004a). In cases such as these, it is often wise to administer an additional data collection so a third
set of information can compared to the other collected data (and calculated against both in order to determine if an acceptable inter-rater reliability level has been achieved with either of the previous data collecting attempts). If many cases of inter-rater issues are occurring, then the data from these cases can often be observed in order to determine what the problem may be (Keyton, et al, 2004). If data has been prepared for inter-rater checks from qualitative collection measures, for instance, the coding scheme used to prepare the data may be examined. It may also be helpful to check with the person who coded the data to make sure they understood the coding procedure (Ketyon, et al, 2004). This inquiry can also include questions about whether they became fatigued during the coding process (often those coding large sets of information tend to make more mistakes) and whether or not they agree with the process selected for coding (Keyton, et al, 2004). In some cases a rogue coder may be the culprit for failure to achieve inter-rater reliability (Neuendorf, 2002). Rogue coders are coders who disapprove of the methods used for analyzing the data and who assert their own coding paradigms. Facilitators of projects may also be to blame for the low inter-rater reliability, especially if they have rushed the process (causing rushed and hasty coding), required one individual to code a large amount of information (leading to fatigue), or if the administrator has tampered with the data (Keyton, et al, 2004).
References Capwell, A. (1997). Chick flicks: An analysis of self-disclosure in friendships. Cleveland: Cleveland State. Cohen, J. (1960). Kappa test: A coefficient of agreement for nominal scales. Education Psychology Measures, 20, 37-46. Friedman, P. G., Chidester, P. J., Kidd, M. A., Lewis, J. L., Manning, J. M., Morris, T. M., Pilgram, M. D., Richards, K., Menzie, K., & Bell, J. (2003). Analysis of ethnographic interview research procedures in communication studies: Prevailing norms and exciting innovations. National Communication Association, Miami, FL. Green, B. (2004). Personal construct psychology and content analysis. Personal Construct Theory and Practice, 1, 82-91. Keyton, J., King, T., Mabachi, N. M., Manning, J., Leonard, L. L., & Schill, D. (2004). Content analysis procedure book. Lawrence, KS: University of Kansas. Krippendorf, K. (2004a). Content analysis: An introduction to its methodology. Thousand Oaks, CA: Sage. Krippendorf, K. (2004b). Reliability in content analysis: Some common misconceptions and recommendations. Human Communication Research, 30, 411-433. Neuendorf, K. A. (2002). The content analysis guidebook. Thousand Oaks, CA: Sage.
The Qualitative Report Volume 10 Number 3 September 2005 439-462 http://www.nova.edu/ssss/QR/QR10-3/marques.pdf
The Application of Interrater Reliability as a Solidification Instrument in a Phenomenological Study Joan F. Marques Woodbury University, Burbank, California
Chester McCall Pepperdine University, Malibu, California
Interrater reliability has thus far not been a common application in phenomenological studies. However, once the suggestion was brought up by a team of supervising professors during the preliminary orals of a phenomenological study, the utilization of this verification tool turned out to be vital to the credibility level of this type of inquiry, where the researcher is perceived as the main instrument and where bias may, hence, be difficult to eliminate. With creativeness and the appropriate calculation approach the researcher of the here reviewed qualitative study managed to apply this verification tool and found that the establishment of interrater reliability served as a great solidification to the research findings. Key Words: Phenomenology, Interrater Reliability, Applicability, Bias Reduction, Qualitative Study, Research Findings, and Study Solidification
Introduction This paper intends to serve as support for the assertion that interrater reliability should not merely be limited to being a verification tool for quantitative research, but that it should be applied as a solidification strategy in qualitative analysis as well. This should be applied particularly in a phenomenological study, where the researcher is considered the main instrument and where, for that reason, the elimination of bias may be more difficult than in other study types. A “verification tool,” as interrater reliability is often referred to in quantitative studies, is generally perceived as a means of verifying coherence in the understanding of a certain topic, while the term “solidification strategy,” as referred to in this case of a qualitative study, reaches even further: Not just as a means of verifying coherence in understanding, but at the same time a method of strengthening the findings of the entire qualitative study. The following provides clarification of the distinction in using interrater reliability as a verification tool in quantitative studies versus using this test as a solidification tool in qualitative studies. Quantitative studies, which are traditionally regarded as more scientifically based than qualitative studies, mainly apply interrater reliability as a percentage-based agreement in findings that are usually fairly straightforward in their interpretability. The interraters in a quantitative study are not necessarily required to engage deeply into the material in order to obtain an
Joan F. Marques and Chester McCall
440
understanding of the study’s findings for rating purposes. The findings are usually obvious and require a brief review from the interraters in order to state their interpretations. The entire process can be a very concise and insignificant one, easily understandable among the interraters, due to the predominantly numerical-based nature of the quantitative findings. However, in a qualitative study the findings are usually not represented in plain numbers. This type of study is regarded as less scientific and its findings are perceived in a more imponderable light. Applying interrater reliability in such a study requires the interraters to engage in attentive reading of the material, which then needs to be interpreted, while at the same time the interraters are expected to display a similar or basic understanding of the topic. The use of interrater reliability in these studies as more than just a verification tool because qualitative studies are thus far not unanimously considered scientifically sophisticated. It is seen more as a solidification tool —that can contribute to the quality of these types of studies and the level of seriousness with which they will be considered in the future. As explained earlier, the researcher is usually considered the instrument in a qualitative study. By using interrater reliability as a solidification tool, the interraters could become true validators of the findings of the qualitative study, thereby elevating the level of believability and generalizability of the outcomes of this type of study. As a clarification to the above, as the “instrument” in the study the researcher can easily fall into the trap of having his or her bias influence the study’s findings. This may happen even though the study guidelines assume that he or she will dispose of all preconceived opinions before immersing himself or herself into the research. Hence, the act of involving independent interraters, who have no prior connection with the study, in the analysis of the obtained data will provide substantiation of the “instrument” and significantly reduce the chance of bias influencing the outcome. Regarding the “generalizability” enhancement Myers (2000) asserts Despite the many positive aspects of qualitative research, [these] studies continue to be criticized for their lack of objectivity and generalizability. The word 'generalizability' is defined as the degree to which the findings can be generalized from the study sample to the entire population. (¶ 9) Myers continues that The goal of a study may be to focus on a selected contemporary phenomenon […] where in-depth descriptions would be an essential component of the process. (¶ 9) This author subsequently suggests that, “in such situations, small qualitative studies can gain a more personal understanding of the phenomenon and the results can potentially contribute valuable knowledge to the community” (¶ 9). It is exactly for this purpose, the potential contribution of valuable knowledge to the community, that the researcher mentioned the elevation of generalizability in qualitative studies, through the application of interrater reliability as a solidification and thus bias-reducing tool.
441
The Qualitative Report September 2005
Before immersing into specifics it might be appropriate to explain that there are two main prerequisites considered when applying interrater reliability to qualitative research: (1) The data to be reviewed by the interraters should only be a segment of the total amount, since data in qualitative studies are usually rather substantial and interraters usually only have limited time and (2) It needs to be understood that there may be different configurations in the packaging of the themes, as listed by the various interraters, so that the researcher will need to review the context in which these themes are listed in order to determine their correspondence (Armstrong, Gosling, Weinman, & Marteau, 1997). It may also be important to emphasize here that most definitions and explanations about the use of interrater reliability to date are mainly applicable to the quantitative field, which suggests that the application of this solidification strategy in the qualitative area still needs significant review and subsequent formulation regarding its possible applicability. This paper will first explain the two main terms to be used, namely “interrater reliability” and “phenomenology,” after which the application of interrater reliability in a phenomenological study will be discussed. The phenomenological study that will be used for analysis in this paper is one that was conducted to establish a broadly acceptable definition of spirituality in the workplace. In this study the researcher interviewed six selected participants in order to obtain a listing of the vital themes of spirituality in the workplace. This process was executed as follows: First, the researcher formulated the criteria, which each participant should meet. Subsequently, she identified the participants. The six participants were selected through a snowball sampling process: Two participants referred two other participants who each referred to yet another eligible person. The researcher interviewed each participant in a similar way, using an interview protocol that was validated on its content by two recognized authors on the research topic, Drs. Ian Mitroff and Judi Neal. • Ian Mitroff is “distinguished professor of business policy and founder of the USC Center for Crisis Management at the Marshall School of Business, University of Southern California, Los Angeles. (Ian I. Mitroff, 2005, ¶ 1). He has published “over two hundred and fifty articles and twenty-one books of which his most recent are Smart Thinking for Difficult Times: The Art of Making Wise Decisions, A Spiritual Audit of Corporate America, and Managing Crises Before They Happen (Ian I. Mitroff, ¶ 4). • Judi Neal is the founder of the Association for Spirit at Work and the author of several books and “numerous academic journal articles on spirituality in the workplace” (Association for Spirit at Work, 2005, ¶ 10-11). She has also established her authority in the field of spirituality in the workplace in her position of “executive director of The Center for Spirit at Work at the University of New Haven, […] a membership organization and clearinghouse that supports personal and organizational transformation through coaching, education, research, speaking, and publications” (School of Business at the University of New Haven, 2005, ¶ 2). After transcribing the six interviews the researcher developed a horizonalization table; all six answers to each question were listed horizontally. She subsequently eliminated redundancies in the answers and clustered the themes that emerged from this
Joan F. Marques and Chester McCall
442
process, which in phenomenological terms is referred to as “phenomenological reduction.” This process was fairly easy, as the majority of questions in the interview protocol were worded in such a way that they solicited enumerations of topical phenomena from the participants. To clarify this with an example one of the questions was “What are some words that you consider to be crucial to a spiritual workplace?” This question solicited a listing of words that the participants considered identifiable with a spiritual workplace. From six listings of words, received from six participants, it was relatively uncomplicated to distinguish overlapping words and eliminate them. Hence, phenomenological reduction is much easier to execute these types of answers when compared to answers provided in essay-form. This, then, is how the “themes” emerged. To provide the reader with even more clarification regarding the question formulations, the interview protocol that was used in this study is included as an appendix (see Appendix A). Interrater Reliability Interrater reliability is the extent to which two or more individuals (coders or raters) agree. Although widely used in quantitative analyses, this verification strategy has been practically barred from qualitative studies since the 1980’s because “a number of leading qualitative researchers argued that reliability and validity were terms pertaining to the quantitative paradigm and were not pertinent to qualitative inquiry” (Morse, Barrett, Mayan, Olson, & Spiers, 2002, p. 1). “Interrater reliability addresses the consistency of the implementation of a rating system” (Colorado State University, 1997, ¶ 1). The CSU on-line site further clarifies interrater reliability as follows: A test of interrater reliability would be the following scenario: Two or more researchers are observing a high school classroom. The class is discussing a movie that they have just viewed as a group. The researchers have a sliding rating scale (1 being most positive, 5 being most negative) with which they are rating the student's oral responses. Interrater reliability assesses the consistency of how the rating system is implemented. For example, if one researcher gives a "1" to a student response, while another researcher gives a "5," obviously the interrater reliability would be inconsistent. Interrater reliability is dependent upon the ability of two or more individuals to be consistent. Training, education and monitoring skills can enhance interrater reliability. (¶ 2) Tashakkori and Teddlie (1998) refer to this type of reliability as “interjudge” or “interobserver,” describing it as the degree to which ratings of two or more raters or observations of two or more observers are consistent with each other. According to these authors, interrater reliability can be determined by calculating the correlation between a set of ratings done by two raters ranking an attribute in a group of individuals. Tashakkori and Teddlie continue “for qualitative observations, interrater reliability is determined by evaluating the degree of agreement of two observers observing the same phenomena in the same setting” (p. 85).
443
The Qualitative Report September 2005
In the past several years interrater reliability has rarely been used as a verification tool in qualitative studies. A variety of new criteria were introduced for the assurance of credibility in these research types instead. According to Morse et al. (2002), this was particularly the case in the United States. The main argument against using verification tools with the stringency of interrater reliability in qualitative research has, so far, been that “expecting another researcher to have the same insights from a limited data base is unrealistic” (Armstrong et al., 1997, p. 598). Many of the researchers that oppose the use of interrater reliability in qualitative analysis argue that it is practically impossible to obtain consistency in qualitative data analysis because “a qualitative account cannot be held to represent the social world, rather it ‘evokes’ it, which means, presumably, that different researchers would offer different evocations” (Armstrong et al., p. 598). On the other hand, there are qualitative researchers who maintain that responsibility for reliability and validity should be reclaimed in qualitative studies, through the implementation of verification strategies that are integral and self-correcting during the conduct of inquiry itself (Morse et al., 2002). These researchers claim that the currently used verification tools for qualitative research are more of an evaluative (post hoc) than of a constructive (during the process) nature (Morse et al.), which leaves room for assumptions “that qualitative research must therefore be unreliable and invalid, lacking in rigor, and unscientific” (Morse et al., p. 4). These investigators further explain that post-hoc evaluation does “little to identify the quality of [research] decisions, the rationale behind those decisions, or the responsiveness and sensitivity of the investigator to data” (Morse et al., p. 7) and can therefore not be considered a verification strategy. The above-mentioned researchers emphasize that the currently used post-hoc procedures may very well evaluate rigor but do not ensure it (Morse et al.). The concerns addressed by Morse et al. (2002) above about verification tools in qualitative research being more of an evaluative nature (post hoc) than of a constructive (during the process) nature can be omitted by utilizing interrater reliability as it was applied to this study, which is, right after the initial attainment of themes by the researcher yet before formulating conclusions based on the themes registered. This method of verifying the study’s findings represents a constructive way (during the process) of measuring the consistency in the interpretation of the findings rather than an evaluative (post hoc) way. It therefore avoids the problem of concluding insufficient consistency in the interpretations after the study has been completed and it leaves room for the researcher to further substantiate the study before it is too late. The substantiation can happen in various ways. For instance, this might be done by seeking additional study participants, adding their answers to the material to be reviewed, performing a new cycle of phenomenological reduction, or resubmitting the package of text to the interraters for another round of theme listing. As suggested on the Colorado State University (CSU) website (1997) interrater reliability should preferably be established outside of the context of the measurement in your study. This source claims that interrater reliability should preferably be executed as a side study or pilot study. The suggestion of executing interrater reliability as a side study corresponds with the above-presented perspective from Morse et al. (2002) that verification tools should not be executed post-hoc, but constructively during the execution of the study. As stated before, the results from establishing interrater reliability as a “side study” at a critical point during the execution of the main study (see
Joan F. Marques and Chester McCall
444
explanation above) will enable the researcher, in case of insufficient consistency between the interraters, to perform some additional research in order to obtain greater consensus. In the opinion of the researcher of this study, the second option suggested by CSU, using interrater reliability as a “pilot study”, would mainly establish consistency in the understandability of the instrument. In this case such would be the interview protocol to be used in the research, since there would not be any findings to be evaluated at that time. However, the researcher perceives no difference between this interpretation of interrater reliability and the content validation here applied to the interview protocol by Mitroff and Neal. The researcher further questions the value of such a measurement without the additional review of study findings, or a part thereof. For this reason, the researcher decided that interrater reliability in this qualitative study would deliver optimal value if performed on critical parts of the study findings. This, then, is what was implemented in the here reviewed case. Phenomenology A phenomenological study entails the research of a phenomenon by obtaining authorities’ verbal descriptions based on their perceptions of this phenomenon: aiming to find common themes or elements that comprise the phenomenon. The study is intended to discover and describe the elements (texture) and the underlying factors (structure) that comprise the experience of the researched phenomenon. Phenomenology is regarded as one of the frequently used traditions in qualitative studies. According to Creswell (1998) a phenomenological study describes the meaning of the lived experiences for several individuals about a concept or the phenomenon. Blodgett-McDeavitt (1997) presents the following definition, Phenomenology is a research design used to study deep human experience. Not used to create new judgments or find new theories, phenomenology reduces rich descriptions of human experience to underlying, common themes, resulting in a short description in which every word accurately depicts the phenomenon as experienced by coresearchers. (¶ 10) Creswell suggests for a phenomenological study the process of collecting information should involve primarily in-depth interviews with as many as 10 individuals. According to Creswell, “Dukes recommends studying 3 to 10 subjects, and the Riemen study included 10. The important point is to describe the meaning of a small number of individuals who have experienced the phenomenon” (p. 122). Given these recommendations, the researcher of the phenomenological study described here chose to interview a number of participants between 3 and 10 and ended up with the voluntary choice of 6. Creswell (1998) describes the procedure that is followed in a phenomenological approach to be undertaken: In a natural setting where the researcher is an instrument of data collection who gathers words or pictures, analyzes them inductively, focuses on the
445
The Qualitative Report September 2005
meaning of participants, and describes a process that is expressive and persuasive in language. (p. 14) Like all qualitative studies, the researcher who engages in the phenomenological approach should realize that “phenomenology is an influential and complex philosophic tradition” (Van Manen, 2002a, ¶1) as well as “a human science method” (Van Manen, 2002a, ¶2), which “draws on many types and sources of meaning” (Van Manen, 2002b, ¶1). Creswell (1998) presents the procedure in a phenomenological study as follows: 1. 2.
3.
4.
5. 6.
The researcher begins [the study] with a full description of his or her own experience of the phenomenon (p. 147). The researcher then finds statements (in the interviews) about how individuals are experiencing the topic, lists out these significant statements (horizonalization of the data) and treats each statement as having equal worth, and works to develop a list of nonrepetitive, nonoverlapping statements (p. 147). These statements are then grouped into “meaning units”: the researcher lists these units, and he or she writes a description of the “textures” (textural description) of the experience - what happened - including verbatim examples (p. 150). The researcher next reflects on his or her own description and uses imaginative variation or structural description, seeking all possible meanings and divergent perspectives, varying the frames of reference about the phenomenon, and constructing a description of how the phenomenon was experienced (p. 150). The researcher then constructs an overall description of the meaning and the essence of the experience (p. 150). This process is followed first for the researcher’s account of the experience and then for that of each participant. After this, a “composite” description is written (p. 150).
Based on the above-presented explanations and their subsequent incorporation in a study on workplace spirituality, the researcher developed the following model (Figure 1), which may serve as an example of a possible phenomenological process with incorporation of interrater reliability as a constructive solidification tool.
446
Joan F. Marques and Chester McCall
Figure 1. Research process in picture. Interviewee A
Interviewee B
Interviewee C
Interviewee D
Interviewee E
Interviewee F
Horizonalization Table
Phenomenological Reduction
Meaning Clusters
Interrater 1
Emergent Themes
Integrated external/internal aspects
Internal aspects
Leadership imposed aspects
Interrater 2
External aspects
Employee imposed aspects Meaning of this Phenomenon Textural and Structural Description
Definition
Possible Structural Meanings of the Experience
Spirituality in the Workplace Underlying Themes and Contexts
Precipitating Factors
Implications of Findings Invariant Themes Recommendations for Individuals and Organizations
In the here-discussed phenomenological study, which aimed to establish a broadly acceptable definition of spirituality in the workplace and therefore sought to obtain vital themes that would be applicable in such a work environment, the researcher considered the application of interrater reliability most appropriate at the time when the phenomenological reduction was completed. The meaning clusters also had been formed. Since the most important research findings would be derived from the emergent themes, this seemed to be the most crucial as well as the most applicable part for soliciting interrater reliability. However, the researcher did not submit any pre-classified information to the interraters, but instead provided them the entirety of raw transcribed data with highlights of 3 topical questions from which common themes needed to be derived. In other words, the researcher first performed phenomenological reduction, concluded which questions provided the largest numbers of theme listings, and then submitted the raw version of the answers to these questions to the interraters to find out whether they would come up with a decent amount of similar theme findings. This process will be explained in more detail later in the paper. Blodgett-McDeavitt (1997) cites one of the prominent researchers in phenomenology, Moustakas, in a presentation of the four main steps of phenomenological processes: epoche, reduction, imaginative variation, and synthesis of composite textural and composite structural descriptions. The way Moustakas’ steps can
447
The Qualitative Report September 2005
be considered to correspond with the earlier presented procedure, as formulated by Creswell, is that “epoche” (which is the process of bracketing previous knowledge of the researcher on the topic) happens when the researcher describes his or her own experiences of the phenomenon and thereby symbolically “empties” his or her mind (see Creswell step 1); “reduction” occurs when the researcher finds nonrepetitive, nonoverlapping statements, and groups them into meaning units (Creswell step 2 and 3); “imaginative variation” takes place when the researcher engages in reflection (Creswell step 4); and “synthesis” is applied when the researcher constructs an overall description and formulates his or her own accounts as well as those of the participants (Creswell steps 5 and 6). Elaborating on the interpretation of epoche, Blodgett-McDeavitt (1997) explains, Epoche clears the way for a researcher to comprehend new insights into human experience. A researcher experienced in phenomenological processes becomes able to see data from new, naive perspective from which fuller, richer, more authentic descriptions may be rendered. Bracketing biases is stressed in qualitative research as a whole, but the study of and mastery of epoche informs how the phenomenological researcher engages in life itself. (p. 3) Although epoche may be considered an effective way for the experienced phenomenologist to empty him or herself and subsequently see the obtained data from a naïve perspective, the chance is that bias is still very present for the less experienced investigator. The inclusion of interrater reliability as a bias reduction tool could therefore lead to significant quality enhancement of the study’s findings (as will be discussed below). Using Interrater Reliability in a Phenomenological Study Interrater reliability has thus far not been a common application in phenomenological studies. However, once the suggestion was brought up by a team of supervising professors about vital themes in a spiritual workplace, the utilization of this constructive verification tool emerged into an interesting challenge and, at the same time, required a high level of creativeness from the researcher in charge. Because of the uncommonness of using this verification strategy in a qualitative study, especially a phenomenology where the researcher is highly involved in the formulation of the research findings, it was fairly difficult to determine the applicability and positioning of this tool in the study. It was even more complicated to formulate the appropriate approach in calculating this rate, since there were various ways possible for computing it. The first step for the researcher in this study was to find a workable definition for this verification tool. It was rather obvious that the application of this solidification strategy toward the typical massive amount of descriptive data of a phenomenology would have to differ significantly from the way this tool was generally used in quantitative analysis where kappa coefficients are the common way to go. After in-depth source reviews, the researcher concluded that there was no established consistency to date in defining interrater reliability, since the appropriateness of its outcome depends on
Joan F. Marques and Chester McCall
448
the purpose it is used for. Isaac and Michael (1997) illuminate this by stating that “there are various ways of calculating interrater reliability, and that different levels of determining the reliability coefficient take account of different sources of error” (p. 134). McMillan and Schumacher (2001) elaborate on the inconsistency issue by explaining that researchers often ask how high a correlation should be for it to indicate satisfactory reliability. McMillan and Schumacher conclude that this question is not answered easily. According to them, it depends on the type of instrument (personality questionnaires generally have lower reliability than achievement tests), the purpose of the study (whether it is exploratory research or research that leads to important decisions), and whether groups or individuals are affected by the results (since action affecting individuals requires a higher correlation than action affecting groups). Aside from the above presented statements about the divergence in opinions with regards to the appropriate correlation coefficient to be used, as well as the proper methods of applying interrater reliability, it is also a fact that most or all of these discussions pertain to the quantitative field. This suggests that there is still intense review and formulation needed in order to determine the applicability of interrater reliability in qualitative analyses, and that every researcher that takes on the challenge of applying this solidification strategy in his or her qualitative study will therefore be a pioneer. The first step for the researcher of this phenomenological study was attempting to find the appropriate degree of coherence that should exist in the establishment of interrater reliability. It was the intention of the researcher to use a generally agreed upon percentage, if existing, as a guideline in her study. However, after assessing multiple electronic (online) and written sources regarding the application of interrater reliability in various research disciplines, the researcher did not succeed in finding a consistent percentage for use of this solidification strategy. Source included Isaac and Michael’s (1997) Handbook in Research and Evaluation, Tashakkori and Teddlie’s (1998) Mixed Methodology, and McMillan and Schumacher’s (2001) Research in Education; Proquest’s extensive article and paper database as well as its digital dissertations file; and other common search engines such as “Google”.. Consequently, this researcher presented the following illustrations for the observed basic inconsistency, in applying interrater reliability, as she perceived them throughout a variety of studies, which were not necessarily qualitative in nature. 1. Mott, Etsler, and Drumgold (2003) presented the following reasoning for his interrater reliability findings in their study, Applying an Analytic Writing Rubric to Children's Hypermedia “Narratives.” A comparative approach to the examination of the technical qualities of a pen and paper writing assessment for elementary students’ hypermediacreated productsPearson correlations averaged across 10 pairs of raters found acceptable interrater reliability for four of the five subscales. For the four subscales, theme, character, setting, plot and communication, the r values were .59, .55, .49, .50 and .50, respectively (Mott, Etsler, & Drumgold, 2003, ¶1).
449
The Qualitative Report September 2005
2. Butler and Strayer (1998) assert the following in their online-presented research document, administered by Stanford University and titled The Many Faces of Empathy. Acceptable interrater reliability was established across both dialogues and monologues for all of the verbal behaviors coded. The Pearson correlations for each variable, as rated by two independent raters, are as follows: Average intimacy of disclosure, r =.94, t (8) = 7.79 p < .05; Focused empathy, r =.78, t (14) = 4.66 p < .05; and Shared Affect, r =.85, t (27) = 8.38, p < .05 (¶1). 3. Srebnik, Uehara, Smukler, Russo, Comtois, and Snowden (2002) approach interrater reliability in their study on Psychometric Properties and Utility of the Problem Severity Summary for Adults with Serious Mental Illness as follows: “Interrater reliability: A priori, we interpreted the intraclass correlations in the following manner: .60 or greater, strong; .40 to .59, moderate; and less than .40, weak ” (¶15). Through multiple reviews of accepted reliability rates in various studies, this researcher finally concluded that the acceptance rate for interrater reliability varies between 50% and 90%, depending on the considerations mentioned above in the citation of McMillan and Schumacher (1997). The researcher did not succeed in finding a fixed percentage for interrater reliability in general and definitely not for phenomenological research. She contacted the guiding committee of this study to agree upon a usable rate. The researcher found that in the phenomenological studies she reviewed through the Proquest digital dissertation database, interrater reliability had not been applied, although she did find a master’s thesis from the Trinity Western University that briefly mentioned the issue of using reliability in a phenomenological study by stating Phenomenological research must concern itself with reliability for its results to have applied meaning. Specifically, reliability is concerned with the ability of objective, outside persons to classify meaning units with the appropriate primary themes. A high degree of agreement between two independent judges will indicate a high level of reliability in classifying the categories. Generally, a level of 80 percent agreement indicates an acceptable level of reliability. (Graham, 2001, p. 66) Graham (2001) then states “the percent agreement between researcher and the student [the external judge] was 78 percent” (p. 67). However, in the explanation afterwards it becomes apparent that this percentage was not obtained by comparing the findings from two independent judges aside from the researcher, but by comparing the findings from the researcher to one external rater. Considering the fact that the researcher in a phenomenological study always ends up with an abundance of themes on his or her list (since he or she manages the entirety of the data, while the external rater only reviews a limited part of the data) calculating a score as high as 78% should not be difficult to obtain depending on the calculation method (as will be demonstrated later in this paper). The citation Graham used as a guideline in his thesis referred to the agreement between
Joan F. Marques and Chester McCall
450
two independent judges and not to the agreement between one independent judge and the researcher. The researcher of the here-discussed phenomenological study on spirituality in the workplace also learned that the application of this solidification tool in qualitative studies has been a subject of ongoing discussion (without resolution) in recent years, which may explain the lack of information and consistent guidelines currently available. The guiding committee for this particular research agreed upon an acceptable interrater reliability of two thirds, or 66.7% at the time of the suggestion for applying this solidification tool. The choice for 66.7% was based on the fact that, in this team, there were quantitative as well as qualitative oriented authorities, who after thorough discussion came to the conclusion that there were variable acceptable rates for interrater reliability in use. The team also considered the nature of the study and the multiinterpretability of the themes to be listed and subsequently decided the following: Given the study type and the fact that the interraters would only review part of the data, it should be understood that a correspondence percentage higher than 66.7% between two external raters might be hard to attain. This correspondence percentage becomes even harder to achieve if one considers that there might also be such a high number of themes to be listed, even in the limited data provided, that one rater could list entirely different themes than the other, without necessarily having a different understanding of the text; The researcher subsequently performed the following measuring procedure: 1. The data gained for the purpose of this study were first transcribed and saved. This was done by obtaining a listing of the vital themes applicable to a spiritual workplace and consisted of interviews taken with a pre-validated interview protocol from 6 participants. 2. Since one of the essential procedures in phenomenology is to find common themes in participants’ statements, the transcribed raw data were presented to two pre-identified interraters. The interraters were both university professors and administrators, each with an interest in spirituality in the workplace and, expectedly, with a fairly compatible level of comprehensive ability. These individuals were approached by the researcher and, after their approval for participation, separately visited for an instructional session. During this session, the researcher handed each interrater a form she had developed, in which the interrater could list the themes he found when reviewing the 6 answers to each of the three selected questions. Each interrater was thoroughly instructed with regards to the philosophy behind being an interrater, as well as with regards to the vitality of detecting themes that were common (either through direct wording or interpretative formulation by the 6 participants). The interraters, although acquainted with each other, were not aware of each other’s assignment as an interrater. The researcher chose this option to guarantee maximal individual interpretation and eliminate mutual influence. The interraters were thus presented with the request to list all the common themes they could detect from the answers to three particular interview questions. For this procedure, the researcher made sure to select those questions that solicited a listing of words and phrases from the participants. The reason for selecting these questions and their answers was to provide the interraters with a fairly clear and obvious overview of possible themes to choose from.
451
The Qualitative Report September 2005
3. The interraters were asked to list the common themes per highlighted question on a form that the researcher developed for this purpose and enclosed in the data package. Each interrater thus had to produce three lists of common themes: one for each highlighted topical question. The highlighted questions in each of the six interviews were: (1) What are some words that you consider to be crucial to a spiritual workplace? (2) If a worker was operating at his or her highest level of spiritual awareness, what would he or she actually do? and (3) If an organization is consciously attempting to nurture spirituality in the workplace, what will be present? One reason for selecting these particular responses was that the questions that preceded these answers asked for a listing of words from the interviewees, which could easily be translated into themes. Another important reason was that these were also the questions from which the researcher derived most of the themes she listed. However, the researcher did not share any of the classifications she had developed with the interraters, but had them list their themes individually instead in order to be able to compare their findings with hers. 4. The purpose of having the interraters list these common themes was to distinguish the level of coordinating interpretations between the findings of both interraters, as well as the level of coordinating interpretations between the interraters’ findings and those of the researcher. The computation methods that the researcher applied in this study will be explained further in this paper. 5. After the forms were filled out and received from the interraters, the researcher compared their findings to each other and subsequently to her own. Interrater reliability would be established, as recommended by the dissertation committee for this particular study, if at least 66.7% (2/3) agreement was found between interraters and between interraters’ and researcher’s findings. Since the researcher serves as the main instrument in a phenomenological study, and even more because this researcher first extracted themes from the entire interviews, her list was much more extensive than those of the interraters who only reviewed answers to a selected number of questions. It may therefore not be very surprising that there was 100% agreement between the limited numbers of themes submitted by the interraters and the abundance of themes found by the researcher. In other words, all themes of interrater 1 and all themes of interrater 2 were included in the theme-list of the researcher. It is for this reason that the agreement between the researcher’s findings and the interraters’ findings was not used as a measuring scale in the determination of the interrater reliability percentage. A complication occurred when the researcher found that the interraters did not return an equal amount of common themes per question. This could happen because the researcher omitted setting a mandatory amount of themes to be submitted. In other words, the researcher did not set a fixed number of themes for the interraters to come up with, but rather left it up to them to find as many themes they considered vital in the text provided. The reason for refraining from limiting the interraters to a predetermined number of themes was because the researcher feared that a restriction could prompt random choices by each interrater among a possible abundance of available themes, ultimately leading to entirely divergent lists and an unrealistic conclusion of low or no interrater reliability.
Joan F. Marques and Chester McCall
452
To clarify the researcher’s considerations a simple example would be if there was a total of 100 obvious themes to choose from and the researcher required the submission of only 15 themes per interrater, there would be no guarantee which part of the 100 available themes each interrater would choose. It could very well be that interrater 1 would select the first 15 themes encountered, while interrater 2 would choose the last 15. If this were the case there would be zero percent interrater reliability, even though the interraters may have actually had a perfect common understanding of the topic. Therefore, the researcher decided to just ask each interrater to list as many common themes he could find among the highlighted statements from the 6 participants. It may also be appropriate to stress here that the researcher explained well in advance to the raters what the purpose of the study was, so there would be no confusion with regards to the understanding of what exactly were considered to be “themes.” Dealing with the problem of establishing interrater reliability with an unequal amount of submissions from the interraters was thus another interesting challenge. Before illustrating how the researcher calculated interrater reliability for this particular case, note the following information: • • •
Interrater 1 (I1) submitted a total of 13 detected themes for the selected questions. Interrater 2 (I2) submitted a total of 17 detected themes for the selected questions. The researcher listed a total of 27 detected themes for the selected questions.
Between both interraters there were 10 common themes found. The agreement was determined on two counts: (1) On the basis of exact listing, which was the case with 7 of these 10 themes and (2) on the basis of similar interpretability, such as “giving to others” and “contributing”; “encouraging” and “motivating”; “aesthetically pleasing workplace”; and “beauty” of which the latter was mentioned in the context of a nice environment. The researcher color-coded the themes that corresponded with the two interraters (yellow) and subsequently color-coded the additional themes that she shared with either interrater (green for additional corresponding themes between the researcher and interrater 1 and blue for additional corresponding themes between the researcher and interrater 2). All of the corresponding themes between both interraters (the yellow category) were also on the list of the researcher and therefore also colored yellow on her list. Before discussing the calculation methods reviewed by this researcher about spirituality in the workplace, it may be useful to clarify that phenomenology is a very divergent and complicated study type, entailing various sub-disciplines and oftentimes described as “the study of essences, including the essence of perception and of consciousness” (Scott, 2002, ¶1). In his presentation of Merleau-Ponty’s Phenomenology of Perception Scott explains, “phenomenology is a method of describing the nature of our perceptual contact with the world. Phenomenology is concerned with providing a direct description of human experience” (¶1). This may clarify to the reader that the phenomenological researcher is aware that reality is a subjective phenomenon, interpretable in many different ways. Based on this conviction, this researcher did not make any pre-judgments on the quality of the various calculation methods presented below, but merely utilized them on the basis of their perceived applicability to this study type.
453
The Qualitative Report September 2005
The researcher came across various possible methods for calculating interrater reliability described. Calculation Method 1 Various electronic sources, among which a website from Richmond University (n.d.), mentions the percent agreement between two or more raters as the easy way to calculate interrater reliability. In this case, reliability would be calculated as: (Total # agreements) / (Total # observations) x 100. In the case of this study, the outcome would be: 20/30 x 100 = 66.7%, whereby 20 equals the added number of agreements from both interraters (10 + 10) and 30 equals the added number of observations from both interraters (13 + 17). The recommendation from Posner, Sampson, Ward, and Cheney (1990) is that interrater reliability, R = number of agreements / number of agreements + number of disagreements, also leads to the same outcome. This calculation would be executed as follows: 20 / (20+10) = 2/3 = 66.7%. Various authors recommend the “confusion matrix,” which is a standard classification matrix, as a valid option for calculating interrater reliability. A confusion matrix, according to Hamilton, Gurak, Findlater, and Olive (2003), “contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix.” (¶1) According to these authors, the meaning of the entries in the confusion matrix should be specified as they pertain to the context of the study. In this study the following meanings will be ascribed to the various entries, a is the number of agreeing themes that Interrater 1 listed in comparison with Interrater 2; b is the number of disagreeing themes that Interrater 1 listed in comparison with Interrater 2; c is the number of disagreeing themes that Interrater 2 listed in comparison with Interrater 1; and d is the total number of disagreeing themes that both interraters listed. The confusion matrix that Hamilton et al. (2003) present is similar to the one displayed in Table 1. However, this researcher has specified the entries as recommended by these authors for the purpose of this study. Table 1 Confusion Matrix 1
Interrater 2
Agree Disagree
Interrater 1 Agree a c
Disagree B D
Hamilton et al. (2003) subsequently present a number of equations relevant to their specific study. The researcher of this study substituted the actual values pertaining to this particular study in the authors’ equations and came to some interesting findings: 1. The rate that these authors label as “the accuracy rate” (AC), named this way because it measures the proportion of the total number of findings from Interrater 1 -- the one with the lowest number of themes submitted -- that are “accurate.” In this case
454
Joan F. Marques and Chester McCall
“accurate” means in agreement with the submissions of Interrater 2 (adopted from Hamilton et al., 2003, ¶5, and modified toward the values used in this particular study), is calculated as seen below. AC
= (a + d) / (a + b + c + d) = (10 + 10) / (10 + 3 + 7 + 10) = 20/30 = 66.7%
2. The rate these authors label as “the true agreement rate:” The title of this rate has been modified by substituting the names of values applicable in this particular study. The true agreement rate was named this way because it measures the proportion of agreed upon themes (10) perceived from the entire number of submitted themes from Interrater 1, the one with the lowest number of submissions (adopted from Hamilton et al., 2003, ¶8, and modified toward the values used in this particular study), is calculated as seen below. TA
= a / (a + b) = 10 / (10 + 3) = 10/13 = 76.9%
Dr. Brian Dyre (2003), associate professor of experimental psychology at the University of Idaho also uses the Confusion Matrix for determining interrater reliability. Dyre recommends the following computation under the heading: Establishing Reliable Measures for Non-Experimental Research. As mentioned above, this researcher inserted the values that were derived from the interrater reliability test for this particular study about spirituality in the workplace in the recommended columns and rows, presented below as Table 2. The interraters are referred to as R1 and R2. Table 2 Confusion Matrix 2 with Substitution of Actual Values R1 Agree Disagree Agree 10 3 R2 Disagree 7 10 (=3+7) 17
13
13 17 30
According to Dyre (2003), interrater reliability = (Number of agreeing themes) + (Number of disagreeing themes) / (Total number of observed themes) = (10 + 10) / 30 = 2/3 = 66.7%, which is similar to the earlier discussed accuracy rate (AC) from Hamilton et al. (2003).
455
The Qualitative Report September 2005
Calculation Method 2 Since the interraters did not submit an equal number of observations, as is general practice in interrater reliability measures, the above-calculated rate of 66.7% can be disputed. Although the researcher did not manage to find any written source to base the following computation on, she considered it logical that in case of unequal submissions, the lowest submitted number of findings from similar data by any of two or more interraters used in a study should be used as the denominator in measuring the level of agreement. Based on this observation, interrater reliability would be: (Number of common themes) / (Lowest Number of submission) x 100 = 10/13 x 100% = 76.9%. Rationale for this calculation: if the numbers of submissions by both interraters had varied even more, say 13 for interrater 1 versus 30 for interrater 2, interrater reliability would be impossible to be established even if all the 13 themes submitted by interrater 1 were also on the list of interrater 2. With the calculations as presented under calculation method 1, the outcome would then be: (13 +13) / (30 + 13) = 26/43 = 60.5%, whereby 13 would be the number of agreements and 43 the total number of observations. This does not correspond at all with the logical conclusion that a total level of agreement from one interrater’s list onto the other should equal 100%. If, therefore, the “rational” justification of calculation method 2 is accepted, then interrater reliability is 76.9%, which exceeds the minimally consented rate of 66.7%. Expanding on this reasoning, further comparison leads to the following findings: All 13 listed themes from interrater 1 (13/13 x 100% = 100%) were on the researcher’s list and 16 of the 17 themes on interrater 2’s list (16/17 X 100% = 94.1%) were also on the list of the researcher. These calculations are based on calculation method 2. The researcher thought it to be interesting that the percentage of 76.9 between both interraters was also reached in the true agreement rate (TA) as presented earlier by Hamilton et al. (2003). Calculation Method 3 Elaborating on Hamilton et al.’s (2003) true agreement rate (TA), which is the proportion of corresponding themes identified between both interraters, it is calculated using the equation: TA = a / (a+b), whereby “a” equals the amount of corresponding themes between both interraters and “b” equals the amount of non-corresponding themes as submitted by the interrater with the lowest number of themes. The researcher thought it to be interesting to examine the calculated outcomes in the case that the names of the two interraters would have been placed differently in the confusion matrix. When exchanging the interraters’ places in the matrix the outcome of this rate turned out to be different, since the value substituted for “b” now became that of the number of noncorresponding themes, as submitted by the interrater with the highest number of themes. In fact, the new computation led to an unfavorable, but also unrealistic interrater reliability rate of 58.8%. The “unrealistic” reference lies in the fact that it becomes apparent that the interrater reliability rate, in the case of the above-mentioned substitution, starts turning out extremely low as the submission numbers of the two interraters start differing to an increasing degree. In such a case, it does not even matter anymore whether the two interraters have full correspondence as far as the submissions
456
Joan F. Marques and Chester McCall
of the lowest submitter goes: The percentage of the interrater reliability, which is supposed to reflect the common understanding of both interraters, will decrease to almost zero. To illustrate this assertion, the confusion matrix is presented in Table 3 with the names of the interraters switched. Table 3 Confusion Matrix with Names of Interraters Switched Interrater 2 Agree Agree a Interrater 1 Disagree c
Disagree b d
With this exchange, the outcome for TA changes significantly: 1. The rate that these authors label as “the accuracy rate” (AC), remains the same: AC = (a + d) / (a + b + c + d) = (10 + 10) / (10 + 3 + 7 + 10) = 20/30 = 66.7% 2. “The true agreement rate” (title name substituted with names of values applicable in this study), is calculated as follows. TC = a / (a + b) = 10 / (10 + 7) = 10/17 = 58.8% In this study, TA rationally presented a rate of 76.9%, which was higher than the minimum requirement of 66.7% in both, calculation methods 1 and 2. On the other hand it is demonstrated in the new true agreement rate here that the less logical process of exchanging the interraters’ positions to where the highest number of submissions would be used as the common denominator instead of the lowest (see first part of calculation method 3), delivered a percentage below the minimum requirement. As a reminder to the reader the irrationality of using the highest number of submissions as the denominator may serve the example given under the “rationale” section for calculation method 2, in which numbers of submissions would diverge significantly (30 vs. 13). It is the researcher’s opinion that this new suggested “computation of moderation” would lead to the following outcome for the true agreement reliability rate (TAR): TAR
= ((TA-1) + (TA-2)) / 2 = (76.9% + 58.8%) / 2 = 135.7% / 2 = 67.9%
It was the researcher’s conclusion that whether the reader considers calculation method 1, calculation method 2, or calculation method 3 as the most appropriate one for this particular study, all three methods demonstrated that there was sufficient common
457
The Qualitative Report September 2005
understanding and interpretation of the essence of the interviewees’ declarations, as they all resulted in outcomes equal to, or greater than, 66.7%. Hence, for this study interrater reliability could be considered established. Recommendations 1. The researcher of this study has found that although interraters in a phenomenological study, and presumably generally in qualitative studies, can very well select themes with a similar understanding of essentials in the data she also found that there are three major attention points to address in order to enhance the success rate and swiftness of the process:1. The data to be reviewed by the interraters should be only a segment of the total amount, since data in qualitative studies are usually rather substantial and interraters usually only have limited time.2. The researcher will need to understand that there are different configurations possible in the packaging of the themes as listed by the various interraters, so that he or she will need to review the context in which these themes are listed in order to determine their correspondence (Armstrong et al., 1997). In this paper the researcher gave examples of themes that could be considered similar, although they were “packaged” different by the interraters, such as “giving to others” and “contributing;” “encouraging” and “motivating;” “aesthetically pleasing workplace;” and “beauty,” of which the latter was mentioned in the context of a nice environment. 2. In order to obtain results with similar depth from all raters, the researcher should set standards in the number of observations to be listed by the interraters as well as the time allotted to them. The fact that these confines were not specified to the interraters resulted in a diverged level of input: One interrater spent only two days in listing the words and came up with a total of 13 themes and the other interrater spent approximately one week in preparing his list and consequently came up with a more detailed list of 17 themes. Although there was a majority of congruent themes between the two interraters (there were 10 common themes between both lists), the calculation of interrater reliability was complicated by the unequal numbers of submissions. All interrater reliability calculation methods assume equal numbers of submissions by the interraters. The officially recognized reliability rate of 66.7% for this study is therefore lower than it would have been when both interraters had been limited to a pre-specified number of themes to be listed. If, for example, both interraters had been required to select 15 themes within an equal time span of, say, one week, the puzzle regarding the use of either the lowest or highest common denominator would be resolved because there would be only one denominator, as well as an equal level of input from both interraters. If, in this case, the interraters came up with 12 common themes out of 15, the interrater reliability rate could be easily calculated as 12/15 = .8 = 80%. Even in the case of only 10 common themes on a total required submission of 15, the rate would still meet the minimum requirements: 10/15 = .67 = 66.7%. This may be valuable advice for future applications of this valuable tool to qualitative studies.4. The solicited number of submissions from the interraters should be set as high as possible, especially if there is a multiplicity of themes to choose from. If the solicited number is kept too low it may be that two raters have perfectly similar understanding of the text yet submit
458
Joan F. Marques and Chester McCall
different themes, which may erroneously elicit the idea that there was not enough coherence in the raters’ perceptions and, thus, no sufficient interrater reliability. 3. The interraters should have at least a reasonable degree of similarity in intelligence, background, and interest level in the topic in order to ensure a decent degree of interpretative coherence. It would further be advisable to attune the educational and interest level of the interraters to the target group of the study, so that the reader could encounter a greater level of recognition with the study topic as well as the findings. Conclusion As mentioned previously, interrater reliability is not a commonly used tool in phenomenological studies. Of the eight phenomenology dissertations that this researcher reviewed, prior to embarking on her own experiential journey, none applied this instrument of control and solidification. This was possibly attributable to the fact that various qualitative oriented scholars have asserted in the past years that it is difficult to obtain consistency in qualitative data analysis and interpretation (Armstrong et al., 1997). These scholars instead, introduced a variety of “new criteria for determining reliability and validity, and hence ensuring rigor, in qualitative inquiry” (Morse et al., 2002, p. 2). Unfortunately, the majority of these criteria are either of a “post hoc” (evaluative) nature, which entails that they are applied after the study had been executed and correction is not possible anymore; or of a non-rigorous nature, such as member checks, which are merely used as a confirmation tool for the study participants regarding the authenticity of the provided raw data, but have nothing to do with the data analysis procedures (Morse et al.). However, having been confronted by the guiding committee in a phenomenological study on spirituality in the workplace, with the application of this tool as an enhancement of the reliability of the findings as well as a bias reduction mechanism, the researcher found that the establishment of interrater reliability or interrater agreement was a major solidification of the themes that were ultimately listed as the most significant ones in this study. It is the researcher’s opinion that the process of interrater reliability should be applied more often to phenomenological studies, in order to provide them with a more scientifically recognizable basis. Up to now, it is still a general perception that qualitative study, a category to which phenomenology belongs, is less scientifically grounded than quantitative study. This perception is supported by the arguments from various scholars that different reviewers cannot coherently analyze a single package of qualitative data. However, the researcher of this particular study has found that the interraters, given the prerequisite of a certain minimal similarity in educational and cultural background as well as interest, could very well select themes with a similar understanding of essentials in the data. This conclusion is shared with Armstrong et al. (1997), who came to similar findings in an empirical study in which they attempted to detect the level to which various external raters could detect themes from similar data and demonstrate similar interpretations. The two main prerequisites presented by Armstrong et al., entailing data limitation and contextual interpretability, were similar to those from the researcher in this phenomenological study. These prerequisites were presented in this paper in the recommendations section.
459
The Qualitative Report September 2005
An interesting lesson from this experience for the researcher was that the number of observations to be listed by the interraters, as well as the time allotted to the interraters, should preferably be kept synchronous. Yet, one might attempt to set as high a number of submissions as possible, due to the risk of too widely varied choices to be selected by interraters, if there are many themes available. This may happen in spite of perfect common understanding between interraters and may, henceforth, wrongfully educe the idea that there is not enough consistency in comprehension between the raters and, thus, no interrater reliability. The justifications for this argument are also presented in the recommendations section of this paper. References Armstrong, D., Gosling, A., Weinman, J., & Martaeu, T. (1997). The place of inter-rater reliability in qualitative research: An empirical study. Sociology, 31(3), 597-606. Association for Spirit at Work (2005). The professional association for people involved with spirituality in the workplace. Retrieved February 20, 2005, from http://www.spiritatwork.com/aboutSAW/profile_JudiNeal.htm Blodgett-McDeavitt, C. (1997, October). Meaning of participating in technology training: A phenomenology. Paper presented at the meeting of the Midwest Research-to-Practice Conference in Adult, Continuing and Community Education, Michigan State University, East Lansing, MI. Retrieved January 25, 2003, from http://www.iupui.edu/~adulted/mwr2p/prior/blodgett.htm Butler, E. A., & Strayer, J. (1998). The many faces of empathy. Poster presented at the annual meeting of the Canadian Psychological Association, Edmonton, Alberta, Canada. Colorado State University. (1997). Interrater reliability. Retrieved April 8, 2003, from http://writing.colostate.edu/guides/research/relval/com2a5.cfm Creswell, J. (1998). Qualitative inquiry and research design: Choosing among five traditions. Thousand Oaks, CA: Sage. Dyre, B. (2003, May 6). Dr. Brian Dyre's pages. Retrieved November 12, 2003, from http://129.101.156.107/brian/218%20Lecture%20Slides/L10%20research%20desi gns.pdf A phenomenological study of quest-oriented religion. Retrieved September 5, 2004, from http://www.twu.ca/cpsy/Documents/Theses/Matt%20Thesis.pdf Hamilton, H., Gurak, E., Findlater, L., & Olive, W. (2003, February 7). The confusion matrix. Retrieved November 16, 2003, from http://www2.cs.uregina.ca/~hamilton/courses/831/notes/confusion_matrix/confusi on_matrix.html Isaac, S., & Michael, W. (1997). Handbook in research and evaluation (Vol. 3). San Diego, CA: Edits. McMillan, J., & Schumacher, S. (2001). Research in education (5th ed.). New York: Longman. Ian I. Mitroff. (2005). Retrieved February 20, 2005, from the University of Southern California Marshall School of Business web site: http://www.marshall.usc.edu/web/MOR.cfm?doc_id=3055
460
Joan F. Marques and Chester McCall
Morse, J. M., Barrett, M., Mayan, M., Olson, K., & Spiers, J. (2002). Verification strategies for establishing reliability and validity in qualitative research. International Journal of Qualitative Methods, 1(2), 1-19. Mott, M. S., Etsler, C., & Drumgold, D. (2003). Applying an analytic writing rubric to children's hypermedia "narratives". Early Childhood Research & Practice,5(1) Retrieved September 25, 2003, from http://ecrp.uiuc.edu/v5n1/mott.html Myers, M. (2000, March). Qualitative research and the generalizability question: Standing firm with proteus. The Qualitative Report, 4(3/4), Retrieved March 10, 2005, from http://www.nova.edu/ssss/QR/QR4-3/myers.html Posner, K. L., Sampson, P. D., Ward, R. J., & Cheney, F. W. (1990, September). Measuring interrater reliability among multiple raters: An example of methods for nominal data. Retrieved November 13, 2003, from http://schatz.sju.edu/multivar/reliab/interrater.html Richmond University. (n.d.). Interrater reliability. Retrieved November 13, 2003, from http://www.richmond.edu/~pli/psy200_old/measure/interrater.html School of Business at the University of New Haven. (2005). Judi Neal Associate Professor. Retrieved February 20, 2005, from http://www.newhaven.edu/faculty/neal/ Scott, A. (2002). Merleau-Ponty’s phenomenology of perception. Retrieved September 5, 2004, from http://www.angelfire.com/md2/timewarp/merleauponty.html Srebnik, D. S., Uehara, E., Smukler, M., Russo, J. E., Comtois, K. A., & Snowden, M. (2002, August). Psychometric properties and utility of the problem severity summary for adults with serious mental illness. Psychiatric Services 53, 10101017. Retrieved March 4, 2005, from http://ps.psychiatryonline.org/cgi/content/full/53/8/1010 Tashakkori, A., & Teddlie, C. (1998). Mixed methodology (Vol. 46). Thousand Oaks, CA: Sage. Van Manen, M. (2002a). Phenomenological inquiry. Retrieved September 4, 2004, from http://www.phenomenologyonline.com/inquiry/1.html Van Manen, M. (2002b). Sources of meaning. Retrieved September 4, 2004, from http://www.phenomenologyonline.com/inquiry/49.html Appendix A Interview Protocol Project: Spirituality in the Workplace: Establishing a Broadly Acceptable Definition of this Phenomenon Time of interview: Date: Place: Interviewer: Interviewee: Position of interviewee:
461
The Qualitative Report September 2005
To the interviewee: Thank you for participating in this study and for committing your time and effort. I value the unique perspective and contribution that you will make to this study. My study aims to establish a broadly acceptable definition of “spirituality in the workplace” by exploring the experiences and perceptions of a small group of recognized interviewees, who have had significant exposure to the phenomenon, either through practical or theoretical experience. You are one of these icons identified. You will be asked for your personal definitions and perceived essentials (meanings, thoughts, and backgrounds) regarding spirituality in the workplace. I am looking for accurate and comprehensive portrayals of what these essentials are like for you: your thoughts, feelings, insights, and recollections that might illustrate your statements. Your participation will hopefully help me understand the essential elements of “spirituality in the workplace.” Questions 1. Definition of Spirituality in the Workplace 1.1 How would you describe spirituality in the workplace? 1.2 What are some words that you consider to be crucial to a spiritual workplace? 1.3 Do you consider these words applicable to all work environments that meet your personal standards of a spiritual workplace? 1.4 What is essential for the experience of a spiritual workplace? 2. Possible structural meanings of experiencing spirituality in the workplace? 2.1 If a worker was operating at his or her highest level of spiritual awareness, what would he or she actually do? 2.2 If a worker was operating at his or her highest level of spiritual awareness, what would he or she not do? 2.3 What is easy about living in alignment with spiritual values in the workplace? 2.4 What is difficult about living in alignment with spiritual values in the workplace? 3. Underlying themes and contexts for the experience of a spiritual workplace 3.1 If an organization is consciously attempting to nurture spirituality in the workplace, what will be present? 3.2 If an organization is consciously attempting to nurture spirituality in the workplace, what will be absent? 4. General structures that precipitate feelings and thoughts about the experience of spirituality in the workplace. 4.1 What are some of the organizational reasons that could influence the transformation from a workplace that does not consciously attempt to nurture spirituality and the human spirit to one that does? 4.2 From the employee’s perspective, what are some of the reasons to transform from a worker who does not attempt to live and work with spiritual values and practices to one that does?
462
Joan F. Marques and Chester McCall
5. Conclusion Would you like to add, modify or delete anything significant from the interview that would give a better or fuller understanding toward the establishment of a broadly acceptable definition of “spirituality in the workplace” Thank you very much for your participation.
Author Note Joan Marques was born in Suriname, South America, where she made a career in advertising, public relations, and program hosting. She founded and managed an advertising and P.R. company as well as a foundation for women’s awareness issues. In 1998 she immigrated to California and embarked upon a journey of continuing education and inspiration. She holds a Bachelors degree in Business Economics from M.O.C. in Suriname, a Master’s degree in Business Administration from Woodbury University, and a Doctorate in Organizational Leadership from Pepperdine University. Her recently completed dissertation was centered on the topic of “spirituality in the workplace.” Dr. Marques is currently affiliated to Woodbury University as an instructor of Business & Management. She has authored a wide variety of articles pertaining to workplace contentment for audiences in different continents of the globe. Joan Marques, 712 Elliot Drive # B, Burbank, CA 91504; E-mail:
[email protected]; Telephone: (818) 845 3063 Chester H. McCall, Jr., Ph.D. entered Pepperdine University after 20 years of consulting experience in such fields as education, health care, and urban transportation. He has served as a consultant to the Research Division of the National Education Association, several school districts, and several emergency health care programs, providing survey research, systems evaluation, and analysis expertise. He is the author of two introductory texts in statistics, more than 25 articles, and has served on the faculty of The George Washington University. At Pepperdine, he teaches courses in data analysis, research methods, and a comprehensive exam seminar, and also serves as chair for numerous dissertations. Email:
[email protected] Copyright 2005: Joan F. Marques, Chester McCall, and Nova Southeastern University Article Citation Marques, J. F. (2005). The application of interrater reliability as a solidification instrument in a phenomenological study. The Qualitative Report 10(3), 439-462. Retrieved [Insert date], from http://www.nova.edu/ssss/QR/QR10-4/marques.pdf
5/10/2014
The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu Home
Search People, Research Interests and Universities
Log In
The place of inter-rater reliability in qualitative research: an empirical study more by David Armstrong
Download (.pdf) Sociology
August 1997 v31 n3 p597(10)
Pa ge 1
The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l study.
1997_Sociology_Inte… 25.6 KB
by David Armstrong, Ann Gosling, Josh Weinman and Theresa Martaeu Assessing inter-rater reliability, whereby data are independently coded and the codings compared for agreement, is a recognised process in quantitative research. However, its applicability to qualitative research is less clear: should researchers be expected to identify the same codes or themes in a transcript or should they be expected to produce different accounts? Some qualitative researchers argue that assessing inter-rater reliability is an important method for ensuring rigour, others that it is unimportant; and yet it has never been formally examined in an empirical qualitative study. Accordingly, to explore the degree of inter-rater reliability that might be expected, six researchers were asked to identify themes in the same focus group transcript. The results showed close agreement on the basic themes but each analyst ’packaged’ the themes differently. Key words: inter-rater reliability, qualitative research, research methods. © CO PYRIGHT 1997 British Soc iologica l Assoc iation Public a tion Ltd. (BSA) Reliability and validity are fundamental concerns of the quantitative researcher but seem to have an uncertain place in the repertoire of the qualitative methodologist. Indeed, for some researchers the problem has apparently disappeared: as Denzin and Lincoln have observed, ’Terms such as credibility, transferability, dependability and confirmability replace the usual positivist criteria of internal and external validity, reliability and objectivity’ (1994:14). Nevertheless, the ghost of reliability and validity continues to haunt qualitative methodology and different researchers in the field have approached the problem in a number of different ways. One strategy for addressing these concepts is that of ’triangulation’. This device, it is claimed, follows from navigation science and the techniques deployed by surveyors to establish the accuracy of a particular point (though it bears remarkable similarities to the psychometric concepts of convergent and construct validity). In this way, it is argued, diverse confirmatory instances in qualitative research lend weight to findings. Denzin (1978) suggested that triangulation can involve a
consistency of findings from an analysis conducted by two or more researchers. However, the concept emerges implicitly in descriptions of procedures for carrying out the analysis of qualitative data. The frequent stress on an analysis being better conducted as a group activity suggests that results will be improved if one view is tempered by another. Waitzkin described meeting with two research assistants to discuss and negotiate agreements and disagreements about in a process described as ’hashing out’coding (1991:69). Another he example is afforded by Olesen and her colleagues (1994) who described how they (together with their graduate students a standard resource in these reports - ’debriefed’ and ’brainstormed’ to pull our first-order statements from respondents’ accounts and agree them. Indeed, in commenting on Olesen and her colleagues work, Bryman and Burgess (1994) wondered whether members of teams should produce separate analyses and then resolve any discrepancies, or whether joint meetings should generate a single, definitive coded set of materials. Qualitative methodologists are keen on stressing the transparency of their technique, for example, in carefully documenting all steps, presumably so that they can be ’checked’ by another researcher: ’by keeping all collected
variety of data sources; multiple theoretical perspectives to interpret a single set of data; multiple methodologies to study a single problem; and several different researchers or evaluators. This latter form of triangulation implies that the difference between researchers can be used as a method for promoting better understanding. But what role is there for the more traditional concept of reliability? Should the consistency of researchers’ interpretations, rather than their differences, be used as a support for the status of any findings?
data in well-organized, retrievable form, researchers can make them available easily if the findings are challenged or if another researcher wants to reanalyze the data’ (Marshall and Rossman 1989:146). Although there is no formal description of how any reanalysis of data might be used, there is clearly an assumption that comparison with the original findings can be used to reject, or sustain, any challenge to the original interpretations. In other words, there is an implicit notion of reliability within the call for transparency of technique.
In general, qualitative methodologies do not make explicit use of the concept of inter-rarer reliability to establish the
Unusually for a literature that is so opaque about the importance of independent analyses of a single dataset,
- Reprinted with permission. Add itional c opying is prohibited. -
G AL E G RO UP Information Integrity
Sociology
August 1997 v31 n3 p597(10)
Pa ge 2
The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l study. https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak…
1/6
5/10/2014
study.
The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu
Mays and Pope explicitly use the term ’reliability’ and, moreover, claim that it is a significant criterion for assessing the value of a piece of qualitative research: ’the analysis of qualitative data can be enhanced by organising an independent assessment of transcripts by additional skilled qualitative researchers and comparing agreement between the raters’ (1995:110). This approach, they claim, was used by Daly et al. (1992) in a study of clinical encounters between cardiologists and their patients when the transcripts were analysed by the principal researcher
and, more commonly, those who reject the term but allow the concept to creep into their work. On the other hand are those who adopt such a relativist position that issues of consistency are meaningless as all accounts have some ’validity’ whatever their claims. A theoretical resolution of these divergent positions is impossible as their core ontological assumptions are so different. Yet this still leaves a simple empirical question: do qualitative researchers actually show consistency in their accounts? The answer to this question may not resolve the
and ’an independent panel’, and the level of agreement assessed. However, ironically, the procedure described by Daly et al. was actually one of ascribing quantitative weights to pregiven ’variables’ which were then subjected to statistical analysis (1992:204).
methodological confusion it may clarify the modernists nature of the debate. If accounts do but diverge then for the there is a methodological problem and for the postmodernists a confirmation of diversity; if accounts are similar, the modernists’ search for measures of consistency is reinforced and the postmodernists need to recognise that accounts do not necessarily recognise the multiple character of reality.
A contrary position is taken by Morse who argues that the use of ’external raters’ is more suited to quantitative research; expecting another researcher to have the same ’insights’ from a limited data base is unrealistic: ’No-one takes a second reader to the library to check that indeed he or she is interpreting the original sources correctly, so why does anyone need a reliability checker for his or her data?’ (Morse 1994:231). This latter position is taken further by those so-called ’post-modernist’ qualitative researchers (Vidich and Lyman 1994) who would challenge the whole notion of consistency in analysing data. The researcher’s analysis bears no direct correspondence with any underlying ’reality’ and different researchers would be expected to offer different accounts as reality itself (if indeed it can be accessed) is characterised by multiplicity. For example, Tyler (1986) claims that a qualitative account cannot be held to ’represent’ the social world, rather it ’evokes’ it- which means, presumably, that different researchers would offer different evocations. Hammersely (1991) by contrast argues that this position risks privileging the rhetorical over the ’scientific’ and argues that quality of argument and use of evidence should remain the arbiters of qualitative accounts; in other words, a place remains for some sort of correspondence between the description and reality that would allow a role for ’consistency’. Presumably this latter position would be supported by most qualitative researchers, particularly those drawing inspiration from Glaser and Strauss’s seminal text which claimed that the virtue of inductive processes was that they ensured that theory was ’closely related to the daily realities (what is actually going on) of substantive areas’ (1967:239). In summary, the debates within qualitative methodology on the place of the traditional concept of reliability (and validity) remain confused. On the one hand are those researchers such as Mays and Pope who believe reliability should be a benchmark for judging qualitative research;
The purpose of the study was to see the extent to which researchers show consistency in their accounts and involved asking a number of qualitative researchers to identify themes in the same data set. These accounts were then themselves subjected to analysis to identify the degree of concordance between them. Method As part of a wider study of the relationship between perceptions of disability and genetic screening, a number of focus groups were organised. One of these focus groups consisted of adults with cystic fibrosis (CF), a genetic disorder affecting the secretory tissues of the body, particularly the lung. Not only might these adults with cystic fibrosis have particular views of disability but theirs was a condition for which widespread genetic screening was being advocated. The aim of such a screening programme was to identify ’carriers’ of the gene so that their reproductive decisions might be influenced to prevent the birth of children with the disorder. The focus group was invited to discuss the topic of genetic screening. The session was introduced with a brief summary of what screening techniques were currently available and then discussion from the group on views of genetic screening was invited and facilitated. The ensuing discussion was tape recorded and transcribed. Six experienced qualitative investigators in Britain and the United States who had some interest in this area of work were approached and asked if they would ’analyse’ the transcript and prepare an independent report on it, identifying, and where possible rank ordering, the main themes emerging from the discussion (with a maximum of five themes). The analysts were offered a fee for this work.
- Reprinted with permission. Add itional c opying is prohibited. -
G AL E G RO UP Information Integrity
Sociology
August 1997 v31 n3 p597(10)
Pa ge 3
The place of inter-ra ter relia bility in qua lita tive resea rch: a n empirica l study. The choice of method for examining the six reports was made on pragmatic grounds. One method, consistent with the general approach, would have been to ask a further six researchers to write reports on the degree of consistency that they perceived in the initial accounts. But then, these accounts themselves would have needed yet further researchers to be recruited for another assessment, and so on. At some point a ’final’ judgement of consistency needed to be made and it was thought that this could just as easily be made on the first set of reports. Accordingly, one of the authors (DA) scrutinised all six reports and deliberately did not read the original focus group transcript. The approach involved listing the themes that were identified by the six researchers and making judgements from the background justification whether or not there were similarities and differences between them.
context that gave it coherence. At its simplest this can be illustrated by the way that the theme of the relative invisibility of genetic disorders as forms of disability was handled. All six analysts agreed that it was an important theme and in those instances when the analysts attempted a ranking, most placed it first. For example, according to the third rarer: The visibility of the disability is the single most important element in its representation. [R3] But while all analysts identified an invisibility theme, all also expressed it as a comparative phenomenon: traditional disability is visible while CF is invisible. The stereotypes of the disabled person in the wheelchair;
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak…
2/6
5/10/2014
The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu
similarities and differences between them. Results
The focus group interview with the adults with cystic fibrosis was transcribed into a document 13,500 words long and sent to the six designated researchers. All six researchers returned reports. Five of the reports, as requested, described themes: four analysts identified five each, the other four. The sixth analysts returned a lengthy and discursive report that commented extensively on the dynamics of the focus group, but then discussed a number of more thematic issues. Although not explicitly described, five themes could be abstracted from this text. In broad outline, the six analysts did identify similar themes but there were significant differences in the way they were ’packaged’. These differences can be illustrated by examining four different themes that the researchers identified in the transcript, namely, ’visibility’, ’ignorance’, ’health service provision’ and ’genetic screening’. Visibility. All six analysts identified a similar constellation of themes around such issues as the relative invisibility of genetic disorders, people’s ignorance, the eugenic debate and health care choices. However, analysts frequently differed in the actual label they applied to the theme. For example, while ’misperceptions of the disabled’, ’relative deprivation in relation to visibly disabled’, and ’images of disability’ were worded differently, it was clear from the accompanying description that they all related to the same phenomenon, namely the fact that the general public were prepared to identify - and give consideration to - disability that was overt, whereas genetic disorders such as CF were more hidden and less likely to elicit a sympathetic response. Further, although each theme was given a label it was more than a simple descriptor; the theme was placed in a
The stereotypes of the disabled person in the wheelchair; the contrast between visible, e.g. gross physical, and invisible, e.g. specific genetic, disabilities; and the special problems posed by the general invisibility of so many genetic disabilities. [R2]
In short, the theme was contextualised to make it coherent, and give it meaning. Perhaps because the invisibility theme came with an implicit package of a contrast with traditional images of deviance, there was general agreement on the theme and its meaning across all the analysts. Even so, the theme of invisibility was also used by some analysts as a vehicle for other issues that they thought were related: a link with stigma was mentioned by two analysts; another pointed out the difficulty of managing invisibility by CF sufferers. Ignorance. Whereas the theme of invisibility had a clear referent of visibility against which there could be general consensus, other themes offered fewer such ’natural’ backdrops. Thus, the theme of people’s ignorance about genetic matters was picked up by five of the six analysts, but presented in different ways. Only one analyst expressed it as a basic theme while others chose to link ignorance with other issues to make a broader theme. One linked it explicitly with the need for education. The main attitudes expressed were of great concern at the low levels of public awareness and understanding of disability, and of great concern that more educational effort should be put into putting this right. [R2] Three other analysts tied the population’s ignorance to the eugenic threat. For example: Ignorance and fear about genetic disorders and screening, and the future outcomes for society. The group saw the public as associating genetic technologies with Hitler, eugenics, and sex selection, and confusing minor gene
- Reprinted with permission. Add itional c opying is prohibited. -
G AL E G RO UP Information Integrity
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak…
3/6
5/10/2014
The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak…
4/6
5/10/2014
The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak…
5/6
5/10/2014
The place of inter-rater reliability in qualitative research: an empirical study | David Armstrong - Academia.edu
Job Board
About
Mission
Press
Blog
Stories
We're hiring engineers!
FAQ
Terms
Privacy
Copyright
Send us Feedback
Academia © 2014
https://www.academia.edu/458025/The_place_of_inter-rater_reliability_in_qualitative_research_an_empirical_study?login=amrit315@gmail.com&email_was_tak…
6/6