Intercoder Agreement
1
The Challenge of Intercoder Agreement in Qualitative Inquiry Judith Harris, Ph.D. Department of Curriculum and Instruction 406 Sanchez Building University of Texas at Austin Austin, TX 78712-1294 512 - 471 - 5211
[email protected] Jeffry Pryor, M. Ed. Power Computing Corporation
[email protected] Sharon Adams, M. A. Southwestern Educational Development Lab (SEDL)
[email protected] Biographical Statement: Judi Harris is a faculty member in Curriculum and Instruction at the University of Texas at Austin, where she teaches courses in constructivist inquiry and instructional design. Her research and service focus upon the design of curriculum-based telecollaboration/teleresearch and professional development in educational telecomputing. Jeff Pryor works in Marketing at Power Computing Corporation, and is interested in effective models for user support in personal computing. Sharon Adams is a researcher with the Southwestern Educational Development Lab, and is interested in the intersections between information science and instructional design. Author’s Note: Special thanks to Greg Jones, who collaborated with me to begin the work that inspired this investigation. Thanks also to the members of our writing-forpublication group: Bonnie Elliott, Colleen Fairbanks, Elissa Fineman, Rob Linne, Sarah McCarthey, Julia Meritt, and Jo Worthy, who offered helpful suggestions in response to an early draft of this work.
Intercoder Agreement
2
Abstract This article documents our wondering about and actively exploring the appropriateness of intercoder agreement in interpretive, document-based inquiry. Two differing methods for, and results of, analysis of the same data set are compared to determine which procedure provided greater credibility and methodological congruence. The data segments twice analyzed for perceived functions were electronic mail messages exchanged among elementary, middle-level, and secondary students and teachers, plus volunteer subject matter experts, all engaged in curriculum-related telementoring. A collaborative coding process, seen as a logical extension of peer debriefing, grew out of this pragmatically grounded exploration of issues traditionally conceptualized as “reliability.” The process is shared here, along with our reasons for recommending its use.
Intercoder Agreement
3
The Challenge of Intercoder Agreement in Qualitative Inquiry Does intercoder agreement (also known as interrater/intercoder reliability and interjudge/interobserver/interscorer agreement) have a legitimate place in nonpositivistic research? If so, in which inquiry contexts is it appropriately used, and how should it be employed for data analysis and interpreted for research results? We explored one set of answers to these questions, creating and testing a collaborative method for document analysis while working with texts generated as electronic mail messages. Reliability and Validity Generally defined, reliability and validity refer, respectively, to the consistency and meaningfulness of research results (Sykes, 1990). Kirk and Miller (1986) differentiate between the two as follows. "Reliability is the degree to which the finding is independent of accidental circumstances of the research, and validity is the degree to which the finding is interpreted in a correct way." (p. 20) These authors (and others, e.g., Potter, 1996, p. 262) assert that reliability and validity are not symmetrical: perfect reliability can be achieved without validity (such as in the case of a defective measuring instrument), but perfect validity (which is often seen as theoretically unattainable) necessarily incorporates reliable data generation and categorization. Reliability and validity in qualitative research. Goodwin and Goodwin (1984) identified four views that qualitative researchers commonly hold concerning reliability and validity. First, some investigators, especially anthropologists and ethnographers, claim that measurement of reliability and validity is irrelevant, because the researcher and the instrument are one and the same. Marshall & Rossman (1995) expand this argument from the researcher as instrument to the implications of a social constructionist world view.
Intercoder Agreement
4
Positivist notions of reliability assume an unchanging universe where inquiry could, quite logically, be replicated. This assumption of an unchanging social world is in direct contrast to the qualitative/interpretive assumption that the social world is always being constructed, and the concept of replication is itself problematic. (p. 145) Another position held by qualitative inquirers views validity as essential, but interobserver agreement as unnecessary (e.g., Bogdan & Biklen, 1992). Goodwin and Goodwin labeled this position as "curious" (p. 416), since, according to traditional definitions, reliability is a necessary component of validity. A third position says that reliability and validity are relevant to qualitative research, but that "empirical estimation is difficult to impossible" (p. 416) because qualitative methods are based on personal experience. A final view asserts that both reliability and validity are relevant to qualitative investigation, and can be directly assessed and addressed. Authors expressing this view (e.g., Lincoln & Guba, 1985) often emphasize validity over reliability, and reinterpret traditional definitions of these notions so that they can accommodate nontraditional research paradigms. Schwandt (1997) summarizes these four differing perspectives by saying that reliability “is an epistemic criterion thought to be necessary but not sufficient for establishing the truth of an account or interpretation of a social phenomenon” (p. 137). Goodwin and Goodwin (1984) identify four different types of reliability that are appropriate for use in qualitative research. These correlate, to some extent, with the three types of reliability identified by Kirk and Miller (1986). • interobserver, interinterviewer, interrecorder, interanalyst reliability (Goodwin & Goodwin, 1984) or synchronic reliability (Kirk & Miller, 1986) -Amount of agreement between independent data collectors
Intercoder Agreement
5
• intraobserver, intrainterviewer, intrarecorder, intra-analyst reliability (Goodwin & Goodwin, 1984) or diachronic reliability (Kirk & Miller, 1986) -Consistency over time of each researcher's data collection and analysis procedures • stability (Goodwin & Goodwin, 1984) or quixotic reliability (Kirk & Miller, 1986) -Consistency of informants' behavior or perspectives expressed over time • internal consistency (Goodwin & Goodwin, 1984) -Homogeneity of data collected and classified over time The first, which we will call intercoder agreement, is the focus of this paper. Goodwin and Goodwin identify two phases of research during which interobserver reliability can be measured: data collection and data analysis. During data collection, interobserver reliability is demonstrated when two or more researchers' independentlycollected data mirror each other. During data analysis, it is demonstrated when independently functioning researchers agree upon data segments to be coded, categories to be used, the placing of data segments into the same categories, and the interpretations drawn from examination of classified data segments. The authors go on to say that autonomous researchers are not usually expected to agree completely, because it is generally accepted that each has a unique perspective upon the context for inquiry, but that they should record essentially the same events and classify their data in a congruent manner. Goodwin and Goodwin (1984) assert that "given the fairly obvious need to estimate this type of reliability, it is surprising that very few examples of its use can be found in published accounts of qualitative evaluations" (p. 421). Why is this so? According to Sykes (1990), it is due to the nature of qualitative research methods: "that their inherent characteristics (their flexibility and the absence of rigid experimental control) are not conducive to replicability." (p. 309) Quantitative
Intercoder Agreement
6
notions of reliability (especially intercoder agreement) may not have a place in the paradigmatic context of nonpositivistic qualitative research. Differences among results of replicated studies completed by different investigators should be expected, given the dynamic and inductive ways in which qualitative researchers generate and interpret data. Indeed, researchers' tacit knowledge of domains for inquiry is considered by some (e.g., Lincoln & Guba, 1985) to be an invaluable component in the interpretation of naturalistic data. In addition, the unique compilations of experience and reflection that produce such knowledge are not replicable across researchers. Perhaps it is not desirable to strive for inter-researcher reliability of any type. Instead, as Sykes (1990) and others (e.g., Erlandson, Harris, Skipper, & Allen, 1993) have suggested, the qualitative researcher should provide a complete audit trail that documents how data were generated and analyzed, including all notes, documents, analysis materials, and a comprehensive investigator's journal, which chronicles all decisions made, events that occurred, and questions that arose during the research process. Careful examination of this trail should reveal whether the investigator has consistently based his/her interpretations upon the data generated, rather than upon preexisting assumptions or erroneous (informant-absent) interpretation. Demonstrating Reliability Positivistic demonstration. Traditionally, intercoder agreement is reported according to the percentage of agreement between two or more judges' independent classifications of data segments, often using an existing set of segment labels. Although this number is easy to calculate and understand, it does not account for agreement between judges due to chance, especially when the number of possible categorical classifications is small and judges' labeling is particularly discrepant. More accurate estimations of the interrater reliability of nominally categorized data are most often
Intercoder Agreement made using Cohen's kappa (1960; 1968), which accounts for chance agreement by recommending the following calculation: K = (Fo - Fc) / (N - Fc) where: N=
the total number of labels assigned by each judge,
Fo =
the number of labels on which the judges agree, and
Fc =
the number of labels that can be expected to agree by chance.
Cohen suggested that Fc could be determined by examining each judge's distribution of data labels across categories, then multiplying the occurrence of marginal probabilities together. In this way, a score of 1.0 would indicate perfect agreement among judges, and 0.0 would indicate no agreement other than what would be expected by chance. A negative value would indicate disagreement beyond what would be expected by chance. Typically, values greater than .65 or .7 are considered to be evidence of acceptable levels of interjudge agreement when using this formula. Cohen's formula has been criticized as being overly conservative and therefore difficult to compare with other reliability indices (Perreault & Leigh, 1989), arbitrary in its determination of the amount attributable to chance (Uebersax, 1988), and most importantly, not applicable for research in which the distribution of categorized data segments is not known a priori (Perreault & Leigh, 1989; Uebersax, 1988), as is the case with most nonpositivistic studies. Other, more complex formulae have been proposed (e.g., Koslowsky & Bailit, 1975; Frick & Semmel, 1978), but have been criticized in similar ways. For, as Perreault & Leigh (1989) assert: Disagreements between judges on how to code some observations forces attention to the underlying problem: because the data are only at the nominal scale level, one cannot simply "average" different codes (with errors of
7
Intercoder Agreement
8
measurement) from different judges to arrive at a composite measure for use in subsequent analysis. Thus, even if reliability is in general high, any lack of reliability--as reflected in disagreement among judges--poses a problem. (p. 137) These authors have derived what they consider to be a more accurate way to calculate intercoder agreement for positivistic qualitative data analysis by estimating the expected level of interjudge agreement, given a "true (population) level of reliability" (p.140), which is represented in their equation, but is unknown. Nonpositivistic demonstration. Nonpositivistic, constructivist perspectives on qualitative research call for an expansion of the idea of reliability (stability, consistency, and predictability) to accommodate the natural occurrence of change in the context being studied, and in the perspective of the researcher, as it evolves as s/he learns about the context by experience and reflection. Lincoln & Guba (1985) and Erlandson, et al. (1993), for example, suggest demonstrating dependability by providing evidence of the stability of research results. If the same (or similar) informants were involved in a study of the same (or a similar) context by different researchers, would the results be the same or similar to the original findings? Naturalistic researchers (e.g., Erlandson, et al., 1993) suggest that study results are dependable if any variance (which is to be expected, over time) can be "tracked" to specified researcher error, shifts in informants' perspectives, better researcher insights, etc. Two methods are suggested for establishing dependability. "Stepwise replication" (Lincoln & Guba, 1985, p. 317) involves splitting a research team in two, with each subgroup studying the same context somewhat independently, meeting frequently as one research group to make sure that emerging foci for investigation are similar. This method was not recommended by the authors, due to its time-consuming and cumbersome nature. Instead, the "inquiry audit" (Lincoln & Guba, 1985, p. 317) or
Intercoder Agreement
9
"dependability audit" (Erlandson, et al., 1993, p. 34), in which a post-investigation, external audit is enacted using materials collected during the study and comprising a complete audit trail, similar to the procedure mentioned by Sykes (1990) above, is suggested for use. A report of the results of this process can demonstrate to the eventual reader of the study’s report that the interpretations made by the researchers were firmly grounded in the expressed perspectives of the informants. It should be noted, however, that this naturalistic reinterpretation of traditional notions of reliability does not (and should not) call for proof that two different researchers will categorize the same data segments independently and similarly. Such a scenario would probably not occur within the realm of constructivist inquiry, since data are understood to be generated as a collaborative act between researcher and informant, rather than collected by the researcher with minimal influence upon them. Therefore, the specific attributes of data generated in each naturalistic investigation would be unique. Instead, the assertion is of a more general nature: that two credible researchers or research teams studying the same or similar contexts will generate consistent overall result patterns, and any variance between result sets will be traceable to documented changes in informants and/or researchers. Objectivity? Reliability and validity are traditionally seen as componential aspects of objectivity, which is "the simultaneous realization of as much reliability and validity as possible" (Kirk & Miller, 1986, p. 20). Objectivity, demonstrated in part by independent interobserver/coder agreement, is then contrasted with subjectivity, which reflects individual, and often idiosyncratic, experience and interpretation. Objectivity is assumed by these methodologists to be the methodological goal of all "good research," regardless of paradigm:
Intercoder Agreement
10
Objectivity, though the term has been taken by some to suggest a naive and inhumane version of vulgar positivism, is the essential basis of all good research. Without it, the only reason the reader of the research might have for accepting the conclusions of the investigator would be an authoritarian respect for the person of the author. (Kirk & Miller, 1986, p. 20) In this view, objectivity is demonstrated when more than one researcher reaches a similar interpretation independently; this interpretation is seen to be closer to truth than a conclusion that a lone researcher "subjectively" (read: inaccurately) draws. This view of objectivity places emphasis upon the characteristics of the researcher and the data analysis method. In a nonpositivistic paradigm (Guba & Lincoln, 1994), the very idea of objectivity is impossible at worst; improbable at best, due to the understanding of how nonpositivistic research data come to be. Instead, naturalistic and constructivist investigators turn their attention to the data themselves, and ask, "Are they confirmable?" (Lincoln & Guba, 1985) In other words, can they be tracked to their sources, and is it clear how they logically contribute to coherent and mutually corroborating interpretive statements about the context being explored? (Erlandson, et al., 1993) One indicator of reliable, valid, and objective research in the traditional view, then, would be intercoder agreement. But considered within a nonpositivistic, constructivist framework, intercoder agreement would merely indicate similarity of researchers' data analysis. In and of themselves, coding similarities would not necessarily add to the trustworthiness of the study’s results. Yet if coding agreement is consciously and collaboratively developed, it becomes part of a peer debriefing process, rather than an objective assessment of the empirical truth of data classifications. Peer debriefing is recommended by constructivist researchers to establish the credibility, or "truth value" of the findings of a study (Lincoln & Guba, 1985; Erlandson, et. al, 1993).
Intercoder Agreement
11
Comprehensive, collaborative coding reviews might be worthwhile endeavors, considering the admittedly and unapologetically subjective character of qualitative research. A Methodological Quandary What was the genesis of this musing about differing conceptions of reliability, including the question of its appropriateness for qualitative inquiry? We were confronted with a methodological quandary, and experienced an unexpected resolution, while analyzing data that were generated by teachers, students, and subject matter experts exchanging electronic mail in the context of curriculum-related study. These electronic teams were formed and facilitated through an ongoing World Wide Web-based educational project. The Electronic Emissary (http://www.tapr.org/emissary/), piloted in early 1993 and ongoing with support from the Texas Center for Educational Technology and the JC Penney Corporation, is a "matching service," pairing subject matter expert (SME) volunteers with elementary, middle level, and secondary teachers and students who are studying in the fields of the SMEs' expertise. In doing so, the Emissary helps to establish curriculum-based communication in the form of telementoring , or “use of email or computer conferencing systems to support a mentoring relationship when a face-to-face relationship would be impractical,” (O’Neill, Wagner & Gomez, 1996, p. 39). Emissary matches are requested by teachers using either electronic mail or a World Wide Web-accessible database and accompanying, custom-designed selection software (Jones & Harris, 1995). Students in Emissary teams are encouraged to inquire about their curriculum-related topics of interest, which are also the subject matter experts' content specializations. The teachers in Emissary teams work with the SMEs, the students, and university-based "online facilitators"/research assistants to shape this
Intercoder Agreement
12
interaction, helping participating teachers to incorporate it into the face-to-face classroom learning environment. In addition to providing these services, the project is an ongoing research effort, examining the nature of adult-child interaction and collaborative, asynchronous teaching and learning in text-based, computer-mediated environments. The Emissary uses several custom-designed Unix scripts (short programs) to route messages among participants in each team. These same scripts archive copies of all messages exchanged, and allow the project's technical administrator to manage the system. When messages are sent to an Emissary team's address on the project's server, each incoming text is copied and saved, basic information on the message’s routing and temporal attributes is collected, and then it is automatically forwarded to the electronic mailboxes of the other team members. In this way, all messages, separated by team and ordered chronologically, are available for review and analysis by the project’s researchers. This data collection is done with participants' full prior knowledge and consent. Data analysis. In an analysis of one semester’s Emissary-supported communication between classes of students and experts, a total of 350 messages formed 18 team mail logs, which were used to generate two types of data. The first data set was formed by the Emissary’s automated mail program (Jones & Harris, 1995) and yielded information on numbers of lines, words, and characters contained in each message. The second data set was generated by researchers analyzing each message for its direction, or "flow," and speech acts, or "functions." A single message typically contained more than one perceived function. Rueda's (1992) 19 functions were tested in initial coding trials for functional comprehensiveness and mutual exclusivity. They were amended and appended to form 21 functions organized into Rueda's original three classes
Intercoder Agreement
13
("Reporting Information," "Requesting Information," and "Other"). These message function categories are shown below in Table 1, with corresponding examples taken from Emissary project data supplied for each coding category. Table 1 Examples of Message Function Classifications
Reporting Information
Content Information • "In principle radio waves could be diffracted just like light, and if put through a prism the different frequencies (colors) could be separated out…"
Procedural Information (content-related "how-to" information) • "Trenchers [recipe]: 1. Dissolve yeast in warm water. 2. Combine ale, yeast, sugar, salt, and egg in a large bowl…"
General Information • "We are also at the end of first quarter..."
Directions (non-content-related "how-to" information) • "No, you may *not* call me Annie! Geesh! Unless you want me to call you Ricki, Mikey, and Nannie? :)"
Personal Information
Intercoder Agreement
14
• "I am in my office M-F 8-5 EST and am reachable there directly by phone or e-mail during those times."
Idea/Opinion/Emotion • "Great flick, even if the 'irresponsible scientist wreaking havoc with nature regardless of consequences' attitude kinda puts my trews in a bunch...Oh well. Its fun."
Resource (book, video, or other resource information) • "I would suggest Holt's 1988 edition of Physical Science, ISBN 0-03-014397-7."
Feedback (non-content-related suggestions, evaluations, etc.) • "In regards to your question, we should send extra copies."
Requesting Information
Content • "Can radio waves be diffracted (like light) or put through some kind of electronic "prism" to separate the waves (again, much like light)?"
Procedural Information (content-related "how-to" information) • "How did they take out the protein that the dinosaurs needed to survive and put it in their food to control the dinosaurs?"
General Information • "Let me know if [the messages] come through (that doesn't really make sense does it?).?"
Intercoder Agreement
Directions (non-content-related "how-to" information) • "Can we call you Annie?"
Personal Information • "What kind of sports do you play?"
Idea/Opinion/Emotion • "Here are some ideas. What do you think?"
Resource (book, video, or other resource information) • "Do you have any books that would help?"
Feedback (non-content-related suggestions, evaluations, etc.) • "If the formatting of this text entry needs adjustment, please let me know."
15
Intercoder Agreement
16
Other Functions
Salutation (greetings and closings, not including signatures) • "Hello Barb -" • "Welcome." • "Hey gang."
Planning (project planning) • "For the first 5 days, I would like to continue exploring light and color, their properties and characteristics, special effects and interrelationships."
Thanking • "Thanks for the great questions!"
Complaining • "The students didn’t like having their spelling and grammar corrected and are unhappy."
Apology • "Please accept my apologies for the delay of this message."
An initial set of 17 messages was selected for independent analysis by two researchers, and the results of the analyses were compared. Two additional message sets containing 25 messages each were subsequently checked, and coding decisions discussed, to discover and increase intercoder agreement, assuming that this would assure the trustworthiness (Lincoln & Guba, 1985) of the study's results. The message
Intercoder Agreement
17
function categories were revised and agreed upon during this preliminary stage of analysis. Intercoder agreement on independently coded, randomly selected, common segments was first measured at 70%, but with continued peer consultation between the two researchers, rose to 83%. Then the entire message bank (320 messages) was analyzed by one of the researchers, using the message flow and function categories that had been agreed upon during the first three data reviews. Once these totals were calculated, however, we began to question whether the traditional method for “ensuring” accurate assignment of data labels to data portions (intercoder agreement) was sufficient. Silverman (1993) suggests that when analyzing texts in qualitative research, classification categories “should be used in a standardized way, so that any researcher would categorize in the same way” (p. 148). In the interpretation of medical interviews, Clavarino, Najman, & Silverman (1995) suggest that the reasons for coding discrepancies be explored, and that perhaps a researcher external to the project be called in to determine what constitutes the most appropriate analysis of a data set. Had we met Silverman’s standard of procedural uniformity, we asked ourselves? Was what he suggests even possible for our study? More importantly, could we trust the frequencies that resulted from that first analysis and what their patterns might imply about telementoring in the context of the Electronic Emissary project, if only one researcher did most of the data coding? We also wondered whether using a complete electronic mail message as the coding unit, to which multiple message functions could be assigned, would produce different resulting data patterns than if sentences or sentence fragments were used as “coding chunks.” Data analysis, once again. Two new researchers joined the project, interested in exploring the roles and communication patterns of online facilitators in Emissarysupported electronic teams. Based on our concerns about data coding by “lone
Intercoder Agreement
18
researchers,” they developed what they believed to be a more thorough method for data analysis. Using the sentence as the coding unit instead of the message, these researchers independently assigned message function(s) from the list of 21 (see Table 1) to each sentence or fragment, then met to compare their labels. When they disagreed with reference to a particular label, they discussed their differing interpretations until both could conscientiously reach an accord. During this collaborative coding/peer debriefing process, they further refined and delineated the conceptual boundaries and implications of each message function category, generating one new addition to the function list (“Update”). They patiently recoded the data set previously analyzed by the “lone researcher” who had established intercoder agreement, as is explained earlier in this article. Tables 2 and 3 show the results of both message function analyses. Results printed in boldface are those for which there was at least a 10% difference between coding results.
Intercoder Agreement
19
Table 2 Frequencies of Message Function by General Classification Frequency Function Class
Number of Occurrences
Percent of Total Occurrences
Analysis #1 Analysis #2 Analysis #1 Analysis #2 Reporting
291
794
41%
56%
Requesting
158
216
22%
15%
Other
254
406
36%
29%
Further breakdown of these figures by specific message function is displayed in Table 3.
Intercoder Agreement Table 3 Frequencies of Message Functions by Specific Classification Reporting Information Frequency Function Class
Percent of Total Occurrences Analysis #1 Analysis #2
Content
12%
44%
Directions
12%
02%
Feedback
02%
N/A
General Information
20%
34%
Idea/Opinion/Emotion
23%
62%
Personal Information
24%
26%
Procedure
03%
05%
Resources
03%
15%
Updates
N/A
08%
20
Intercoder Agreement Table 3, continued Requesting Information Frequency Function Class
Percent of Total Occurrences Analysis #1
Analysis #2
Content
34%
29%
Directions
22%
03%
Feedback
09%
N/A
General Information
10%
08%
Idea/Opinion/Emotion
01%
22%
Personal Information
13%
06%
Procedural
07%
02%
Resources
03%
04%
Updates
N/A
02%
21
Intercoder Agreement
22
Table 3, continued Other Functions Frequency Function Class
Percent of Total Occurrences Analysis #1
Analysis #2
Apology
08%
09%
Complaining
02%
02%
Planning
22%
39%
Salutation
50%
62%
Thanking
18%
27%
Clearly, the coding results for the two different methods of analysis are quite different, especially with reference to the reporting of information in electronic mail messages. How could the discrepancies be resolved, and indeed, should they be? One way might be to ask the messages’ authors to collaborate with us in data coding, by telling us post hoc what their internally-held purposes were in sending each message. The logistical implications for that method, especially after significant passage of time, are overwhelming. We also considered whether it would have been better to have asked participants to contribute running commentaries on their purposes for electronic communications as they were engaged in telementoring, but realized that that would have been unnecessarily distracting to the learning process, and probably would have inhibited the nature of their communication. Obviously, a more complete and accurate picture of what was happening among the members of each of these electronic teams would have incorporated analysis of many different types of data (e.g., interview, observation, electronic mail logs), involving the informants as co-creators of the study’s results. Most of the research projects associated with the Emissary do this. Yet, our interest in this particular instance was in what the messages themselves were telling us, if anything.
Intercoder Agreement
23
A Better Match, not a Better Method* Our methodological decisions, therefore, were based more upon notions of document analysis within an interpretive framework than they were reflective of traditional assumptions about intercoder agreement. Hodder (1994) refers to document analysis as “interpretation of mute evidence” which “endures physically and...can be separated across space and time from its author, producer, or user,” which has to be interpreted, therefore, “without the benefits of indigenous commentary” (p. 393). Such was the case with our examination of electronic mail communication among students, teachers, and subject matter experts, for reasons described earlier. But which set of message function results was the “best?” we wondered. Hodder (1994) reminds us that ...meaning does not reside in a text but in the writing and reading of it. As the text is reread in different contexts it is given new meanings, often contradictory and always socially embedded. Thus there is no “original” or “true” meaning of a text outside specific historical contexts...There is often a tension between the concrete nature of the written word, its enduring nature, and the continuous potential for rereading meanings in new contexts, undermining the authority of the word. Text and context are in a continual state of tension, each defining and redefining the other, saying and doing things differently through time. (p. 394) Within a nonpositivistic research paradigm, perhaps both sets of results -two different patterns of meaning created in alternate contexts for analysis - are acceptable. Our decision between the two, then, ultimately had more to do with a match between method and study design (Denzin & Lincoln, 1994) than with an overarching judgment about the efficacy of a specific data analysis procedure. Current analyses of message functions for data generated by Electronic Emissary project participants, therefore, are being done collaboratively, as was piloted in the coding that led to the second set of results shown in Tables 2 and 3 above. As a research team working on related, but differing projects, we feel that the credibility (Erlandson, et al., 1993) of our results is significantly enhanced when we extend in-depth peer debriefing to the data coding process. The meanings of the texts that we are analyzing cannot be co-constructed with the documents’ authors, but the documents’ apparent purposes can be co-constructed within the context of a
Intercoder Agreement collaborative research group. We offer this methodological suggestion to other teams engaged in document-based inquiry, with hopes that a call for and example of congruence among research paradigm, perspective, strategy for inquiry, and methods will help investigators to think beyond the notion of “accuracy” in interpretive data coding.
*Thanks go to Bonnie Elliott for creating this phrase.
24
Intercoder Agreement
25
References Bogdan, R. C., & Biklen, S. K. (1992). Qualitative research for education: An introduction to theory and methods. Boston: Allyn and Bacon. Clavarino, A. M., Najman, J. M., & Silverman, D. (1995). The quality of qualitative data: Two strategies for analyzing medical interviews. Qualitative Inquiry, 1, 223-242. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46. Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213-220. Denzin, N. K., & Lincoln, Y. S. (1994). Introduction: Entering the field of qualitative research. In N. K. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative research (pp. 1-17). Thousand Oaks, CA: SAGE Publications. Erlandson, D. A., Harris, E. L., Skipper, B. L., & Allen, S. D. (1993). Doing naturalistic inquiry. Newbury Park, CA: SAGE Publications. Frick, T., & Semmel, M. I. (1978). Observer agreement and reliabilities of classroom observational measures. Review of Educational Research, 48, 157-184. Goodwin, L. D., & Goodwin, W. L. (1984). Are validity and reliability "relevant" in qualitative evaluation research? Evaluation & the Health Professions, 7, 413-426. Guba, E. G., & Lincoln, Y. S. (1994). Competing paradigms in qualitative research. In N. K. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative research (pp. 105-117). Thousand Oaks, CA: SAGE Publications.
Intercoder Agreement
26
Hodder, I. (1994). The interpretation of documents and material culture. In N. K. Denzin & Y. S. Lincoln (Eds.), Handbook of qualitative research (pp. 393-402). Thousand Oaks, CA: SAGE Publications. Jones, G., & Harris, J. (1995). The Electronic Emissary: Design and initial trials. In J. Willis, B. Robin, & D. A. Willis (Eds.), Technology and teacher education annual, 1995 (pp. 672-676). Charlottesville, VA: Association for the Advancement of Computing in Education. Kirk, J., & Miller, M. L. (1986). Reliability and validity in qualitative research. Beverly Hills, CA: SAGE Publications. Koslowsky, M, & Bailit, H. (1975). A measure of reliability using qualitative data. Educational and Psychological Measurement, 35, 843-846. Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Beverly Hills, CA: SAGE Publications. Marshall, C., & Rossman, G. B. (1995). Designing qualitative research (2nd ed.). Thousand Oaks, CA: SAGE Publications. O’Neill, D. K., Wagner, R., & Gomez, L. M. (1996). Online mentors: Experimenting in science class. Educational Leadership, 55 (3), 39-42. Perreault, W. D., & Leigh, L. E. (1989). Reliability of nominal data based on qualitative judgments. Journal of Marketing Research, 26 (2), 135-148. Potter, W. J. (1996). An analysis of thinking and research about qualitative methods. Mahwah, NJ: Lawrence Erlbaum Associates, Publishers.
Rueda, R. S. (1992). Characteristics of teacher-student discourse in computerbased dialogue journals: A descriptive study. Learning Disability Quarterly, 15, 187-206.
Intercoder Agreement
27
Schwandt, T. A. (1997). Qualitative inquiry: A dictionary of terms. Thousand Oaks, CA: SAGE Publications. Silverman, D. (1993). Interpreting qualitative data: Methods for analysing talk, text and interaction. Thousand Oaks, CA: SAGE Publications. Sykes, W. (1990). Validity and reliability in qualitative market research: A review of the literature. Journal of the Market Research Society, 32, 289-328. Uebersax, J. S. (1988). Validity inferences from interobserver agreement. Psychological Bulletin, 104, 405-416.