Rater Bias in Psychological Research - American Psychological ...

23 downloads 0 Views 2MB Size Report
Carolyn Cutrona, Rick Gibbons,. David Lubinsfci, and Mike McCullough for ..... esized population effect size (Sutcliffe, 1980), thus exacerbating the already high ...
Copyright 2000 bv the American Psychological Association. Inc. 1082-989X/OG/$5.(>0 DOI: IO.I037//I082-9S9X.5 1,64

Rater Bias in Psychological Research: When Is It a Problem and What Can We Do About It? William T. Hoyt Iowa State University Rater bias is a substantial source of error in psychological research. Bias distorts observed effect sizes beyond the expected level of attenuation due (o intraraler

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

error, and the impact of bias is not accurately estimated using conventional methods of correction for attenuation. Using a model based on multivariate generalizability theory, this article illustrates how bias affects research results. The model identifies 4 types of bias that may affect findings in research using observer ratings, including the biases traditionally termed leniency and halo errors. The impact of bias depends on which of 4 classes of rating design is used, and formulas arc derived for correcting observed effect sizes for attenuation (due to bias variance) and inflation (due to bias covariance) in each of these classes. The rater bias model suggests procedures for researchers seeking to minimize adverse impact of bias on study findings.

how to consider the likely impact of this source of

Almost inevitably, research questions involving individual differences (other than readily observable differences such as sex and race) require the use of raling measures to quantify the person variables of interest. It is well known (e.g., Saal, Downey, & Lahey, 1980) that ratings arc subject to various forms of bias. Raters may interpret scale items differently or have unique reactions to particular targets so that the obtained ratings reflect characteristics of the raters to some extent, in addition to reflecting the target characteristics that are of interest. However, the impact of rater bias on study results depends on numerous factors, including the nature of the construct being rated, the extent to which raters are trained to interpret the meaning of these behaviors similarly, and several features of the rating design. Because of this complexity, it may be difficult for researchers to know whether rater bias is likely to be a problem in their data sets or

error on their findings. This article is intended as a primer on the subject of rater bias. After a brief review of previous theoretical and empirical work, I identify four types of bias most likely to affect psychological research using ratings and introduce both univariate and bivariate models based on generalizability theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972) for estimating the mag nitude of bias variance and covariance in ratings. The impact of bias varies as a function of how raters are assigned to both targets and rating dimensions. I discuss the nature of bias in four common classes of rating design, first in schematic form and then quantitatively, and present formulas for correcting observed effect sizes for distortion due to bias variance and covariance. I discuss the implications of the rater bias model for estimating the statistical power of ratings-based research and for interpreting results of studies using ratings, and I conclude with recommendations for minimizing the adverse impact of rater bias on psychological research. In this article, I use the terms rater and observer interchangeably to refer to the source of rating data. 1 refer to the object of measurement, sometimes called the ratee in previous theoretical work on ratings, as the target; I do so to encompass ratings of nonperson as well as person targets. The approaches I present are relevant to the study of bias in continuous, but not categorical, rating scales.

William T. Hoyt, Department of Psychology, Iowa State University. I thank Brad Bushman. Carolyn Cutrona, Rick Gibbons, David Lubinsfci, and Mike McCullough for their comments on previous versions of this article. Correspondence concerning this article should be addressed to William T. Hoyt, who is now at the Department of Counseling Psychology, University of Wisconsin, 321 Education Building. 1000 Bascorn Mall, Madison, Wisconsin 53706-1398. Electronic mail may be sent to wlhoyt® education.wisc.edu.

64

65

RATER BIAS

What Is Rater Bias?

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

Rater bias refers to disagreements among raters due to either (a) their differential interpretations of the rating scale or (b) their unique (and divergent) perceptions of individual targets. Ratings of the same target by different observers may diverge for a number of reasons, including differential opportunities to observe target behaviors, disagreement about which target cues to attend to in formulating ratings, and disagreement about how to apply levels of the rating scale to the same observed target cues (Kenny, 1991). Rater bias is a form of method variance (Campbell & Fiske, 1959) and contributes systematic variance in observed scores that is not due to the target. Although method variance can be the focus of some types of investigations (Cronbach, 1995; Kenny, 1994), for most users of ratings, rater bias is of no substantive interest and therefore contributes to error of measurement.

Types of Bias When multiple observers rate multiple targets on some attribute of interest, bias in ratings can affect (a) the mean of the obtained ratings, (b) the variance of the obtained ratings, or (c) the covariance of the obtained ratings with ratings of another attribute by the same observers. Grade inflation is an example of a form of bias that distorts the mean of obtained ratings. If all raters (teachers) interpret C as a failing grade, despite the fact that according to the rating (grading) scale, C means satisfactory or average, then obtained ratings may reflect an implausible reality in which, like Garrison Keillor's Lake Wobegone, all the children are above average. Grade inflation does not necessarily compromise one important function of grades, which is to distinguish among the academic standings of the targets (students) being rated. Indeed, if the bias is constant across all teachers (and if the scale has enough gradations to accommodate distinctions among pupils in the new, restricted range), then the mean rating will be affected without altering the rank ordering of students that would be obtained in the absence of this constant bias. For this reason, such constant biases are of limited interest in many research applications and will not be considered in this article. More problematic is the possibility that raters have different biases, in which case variance in obtained ratings includes variance due to raters as well as targets, and the true rank ordering among targets is thereby obscured.

Biases affecting the variance of the obtained ratings may be rater-specific or dyad-specific. If some teachers continue to treat C as an average or satisfactory grade, whereas others reinterpret C as failing, there is rater-specific bias in the obtained grades. Comparing grades among students having different instructors is no longer straightforward. In fact, to make accurate comparisons, it is necessary to correct obtained ratings for rater-specific bias (Raymond & Viswesvaran, 1993). A more complex issue concerns dyad-specific bias. If some or all teachers allow student attributes (e.g., attractiveness, penmanship) that are not related to performance to influence their evaluations, obtained grades will be biased to some extent due to teachers' differing impressions of each student. Like rater-specific bias, dyad-specific bias contributes to variance in obtained ratings, making rank-orderings among students suspect. Unlike rater-specific bias, dyad-specific bias is resistant to estimation and correction, because bias contributed by a given rater varies from one target to the next. When observer bias contributes to variance in obtained ratings, an additional question concerns the extent to which rater- or dyad-specific variance affects correlations among rated variables. If raters provide data on several variables for each target (as when elementary school teachers rate students on multiple dimensions encompassing academic performance and interpersonal conduct), observed correlations among ratings on these variables may differ from the true (unbiased) correlations because of bias covariance. For example, if some or all teachers grade students they like more favorably on each dimension than those they do not like (dyad-specific bias), correlations among the dimensions will be inflated. Also, when all students are not rated by the same teachers and rater-specific biases (i.e., individual tendencies to be lenient or severe in assigning ratings) are consistent across rating dimensions, correlations among the dimensions will be inflated.

Terminology Historically, research on rater bias has focused on two types of errors, called leniency (or severity) errors and halo errors. Starting with Kneeland (1929), most investigators claiming to examine leniency error have focused on bias affecting the mean rating but ignored individual differences in leniency among raters that would affect rating variance (Saal et al., 1980). However, some investigators (e.g., Ford, 1931; J. S. Kane,

HOYT

This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.

66

Bernardin, Villanova, & Peyrefitte, 1995) have operationalized leniency or severity errors in terms of rater differences that are consistent across targets, similar to the rater-specific bias described above. Halo error (Thorndike, 1920) is reflected in artificially high correlations among variables rated by the same observer and is usually attributed to the influence that observers' general impressions of targets have on their ratings of specific attribute dimensions (Lance, LaPointe, & Fisicaro, 1994). Table 1 illustrates the four types of rater bias considered in this article and shows how these variances and covariances relate to traditional terminology in the rater bias literature. As noted above, when observers rate only a single target attribute, researchers need to be concerned about the extent to which observed variance in ratings is due to bias. Bias variance may be introduced either as a consequence of raters' different interpretations of the rating scale (rater-specific biases) or their unique perceptions of particular targets (dyad-specific biases). Studies of leniency error have sometimes, but by no means always, assessed rater-specific variance (Saal et al., 1980). However, with the exception of generalizability studies of observer ratings (Hoyt & Kerns, 1999), dyad-specific variance in ratings has not been studied by researchers interested in rater bias. To conserve space and to relate components of bias to the generalizability models presented in the next section, I refer to rater-specific bias variance (or rater variance) as a2(r), and dyadspecific bias variance (or dyadic variance) as o2(d). When observers rate multiple target attributes and the investigator is interested in relations (i.e., correlations) among these attributes, both bias variance and bias covariance are a concern. Bias covariance can arise as a result of rater-specific biases that are correlated for the two attributes (e.g., observers who are lenient in rating attribute X are also likely to be lenient

in rating attribute Y) or to dyad-specific biases that are correlated for the two attributes (e.g., an observer who has a uniquely favorable impression of a given target on X is also likely to have a uniquely favorable impression of this target on Y). Rater-specific covariance [rater covariance, or ff(rx,rY)~] has sparked little interest among researchers interested in rater bias.1 Dyad-specific covariance [dyadic covariance, or v(dx,dy)\, on the other hand, has been of considerable interest. Halo error is conventionally conceptualized as positive correlations in raters' unique perceptions of a given target on multiple dimensions (i.e., as dyadic covariance), although most studies claiming to assess halo error have not been successful in isolating the illusory or bias-based component of correlations between variables (Cooper, 1981; Murphy, Jako, & Anhalt, 1993). Rater Bias—Univariate Model Generalizability theory (Cronbach et al., 1972) is a useful analytical framework for studying rater bias (Hoyt & Melby, 1999). Like the analysis of variance (ANOVA) approach developed by Guilford (1954), generalizability theory allows for the simultaneous examination of the impact of multiple sources of error (and their interactions) on rating data. Because generalizability analyses focus on estimation of variance accounted for by effects in the model (rather than testing those effects for statistical significance), they lend themselves to psychometric interpretations, yielding useful information about the relative importance of various sources of error and their impact on rating quality. As Table 1 indicates, when considering bias in ratings of a single variable, both rater variance and dyadic variance are possible sources of error. This suggests the following component model of ratings