A MICROCOMPUTER BASIC PROGRAM TO ...

5 downloads 0 Views 121KB Size Report
A simple percentage measure of agreement among raters using nominal scales may provide misleading reliability information. Scott's Pi and Cohen's Kappa are ...
EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

1980, 40

A MICROCOMPUTER BASIC PROGRAM TO CALCULATE THE LEVEL OF AGREEMENT BETWEEN TWO RATERS USING NOMINAL SCALE CLASSIFICATION LEON D. LARIMER North Central Kansas Guidance Center Manhattan, Kansas MARLEY W. WATKINS Deer Valley School District Phoenix, Arizona

A simple percentage measure of agreement among raters using nominal scales may provide misleading reliability information. Scott's Pi and Cohen's Kappa are two chance-corrected statistics which have been widely utilized in assessing interrater agreement. This paper presents a BASIC microcomputer program which calculates these two chance-corrected measures of interrater reliability.

A large proportion of psychological research is characterized by the absence of a well defined standard upon which raters may check the accuracy of their judgments. Under these conditions, the most frequent practice for determining rater accuracy in nominal scale classification is to assess the percentage of concurrent agreement between two independent observers. It has been cogently argued by Hartmann (1977) and Costello (1973), however, that a simple percentage mea.;: sure of agreement among raters using nominal scales may provide insufficient or misleading information. Chance-corrected solutions to this nominal scale classification problem have been proposed by Scott (1955) and Cohen (1960). Both solutions are defined by the formula (Po - Pell - Pc), where Po represents the observed proportion of interrater agreement and Pc the probability of interrater agreement attributable to chance factors alone. In using either statistic, one has to fulfill several assumptions: first, the categories used must be mutually exclusive, exhaustive of possible alternatives, and nominal; second, Copyright © 1980 by Educational and Psychological Measurement

773

774

EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT

the cases classified need to be "independent; and third, the raters must operate independently in the classification. The utility of a chance-corrected agreement statistic has been demonstrated by Spitzer and Fleiss (1974) with a re-analysis of a number of psychiatric diagnostic studies and by Watkins (1979) in a re-analysis of manuscript reviewer data. Scott's Pi (Scott, 1955) and Cohen's Kappa (Cohen, 1960) statistics, although differing somewhat in the manner in which the expected frequencies (Pc) are computed, are both widely used (Krippendorff, 1970). Consequently several computer programs which calculate Pi (Thornton and Croskey, 1975) and Kappa (Berk and Campbell, 1976; Antonak, 1971) have been published. As these programs are available only in FORTRAN, their use is restricted almost exclusively to large mainframe computers.

'l..

I

~

I \

Purpose

The Scott (1955) and Cohen (1960) agreement statistics are useful in a variety of behavioral (Hartmann, 1977) and therapeutic (Flanders, 1967) settings. Programs for their mechanized computation exist for computers which utilize the FORTRAN language. The purpose of the present paper was to present a computer program which makes Pi and Kappa more accessible to behavioral scientists working in environments that do not afford access to a mainframe computer. General Description

The program is an interactive one written in Applesoft BASIC for the Apple II microcomputer. Residing in 3.2K of RAM memory it will accommodate 20 variables for each 16K of user RAM memory. Fully documented with variables in mnemonic form it should be easily adapted to other popular microcomputers. Program input consists of the number of categories used by the raters and the cross-tabulation matrix of ratings. Output consists of Pi, the standard error of Pi, Kappa, the standard error of Kappa, the critical value of Z (Light, 1973) for Kappa, and a statement of the normal curve area (Coons, 1978) for Kappa. Availability

A listing of the program, a copy of this paper, and a complete set of sample input and output are available, without charge, from Dr. Leon D. Larimer, North Central Kansas Guidance Center, 320 Sunset, Manhattan, Kansas 66502.

l >

I (

l l

LARIMER AND WATKINS

775

REFERENCES

.'

I

I

...

Antonak, R. F. A computer program to compute measures of response agreement for nominal scale data obtained from two judges. Behavioral Research Methods and Instrumentation, 1977, 9, 553. Berk, R. A. and Campbell, K. L. A FORTRAN program for Cohen's kappa coefficient of observer agreement. Behavior Research Methods and Instrumentation, 1976, 8, 396. Cohen, J. A coefficient of agreement for nominal scales. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1960, 20, 37-46. Coons, D. F. A concise method for computing normal curve areas using a calculator. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1978, 38, 653-655. Costello, A. J. The reliability of direct observations. British Psychological Society Bulletin, 1973, 26, 105-108. Flanders, N. A. The problems of observer training and reliability. In E. J. Amidon and J. B. Hough (Eds.), Interaction Analysis: Theory, Research, and Application. Reading, Mass.: Addison-Wesley, 1967. Hartmann, D. P. Considerations in the choice of interobserver reliability estimates. Journal of Applied Behavior Analysis, 1977, IO, 103-116. Krippendorff, K. Bivariate agreement coefficients for reliability of data. In E. F. Borgatta and G. W. Bohmstedt (Eds.), Sociological Methodology, San Francisco: Jossey-Bass, 1970. Light, R. J. Issues in the analysis of qualitative data. In R. M. W. Travers (Ed.), Second Handbook of Research on Teaching. Chicago: Rand McNally, 1973. Scott, W. A. Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, 1955, 19, 321-325. Spitzer, R. L. and Pleiss, J. L. A re-analysis of the reliability of psychiatric diagnosis. British Journal of Psychiatry, 1974, 125, 341-347. Thornton, B. W. and Croskey, F. L. A computer program for calculating an index of interobserver reliability from timeseries data. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1975, 35, 735-737. Watkins, M. W. Chance and interrater agreement on manuscripts. American Psychologist, 1979, 34, 796-798.