Detecting Halo Effects in Performance-Based ...

85 downloads 26 Views 217KB Size Report
Keywords: rated data, halo-effect, performance-based testing, language test- ing, classical .... 3 X X. Examinees 4 X X . 5 X X. 6 X X. 7 X X . 8 X X. Figure 1.
Detecting Halo Effects in Performance-Based Examinations Timo M. Bechger Cito, Arnhem

Gunter Maris Cito, Arnhem University of Amsterdam

Ya Ping Hsiao Cito, Arnhem Abstract The main purpose of this paper is to demonstrate how halo-effects may be detected and quantified using two independent ratings of the same person. A practical illustration is given to show how halo-effects can be avoided. Keywords: rated data, halo-effect, performance-based testing, language testing, classical test theory.

Introduction So-called productive abilities (e.g., speaking, writing) that require active behavior of examinees are usually measured via human judgement. That is, examinees demonstrate their ability on a number of assignments/exercises and experts are used to assess the quality of each response. This simple fact gives rise to a myriad of complications. The most conspicuous one being that judges will usually disagree. In this paper, we focus on the haloeffect which occurs when judgements of one rated characteristic influences judgements of other characteristics in a positive or negative direction (e.g., McDonald, 1999, p. 24). Thus, ratings are influenced by former ratings causing dependencies that cannot be explained by the tendency of examinee’s to produce responses of similar quality. Our main purpose is to demonstrate how halo-effects may be detected and quantified using an incomplete block design with two independent ratings of each examinee’s performance. This design is often used in large-scale examinations when many trained raters are available, but it is not practical to have all raters rate all examinees. Unlike a fully crossed design, where each rater rates every examinee, generalizability theory cannot be used to identify or control halo-effects in this case (Hoyt, 2000).

Correspondence should be addressed to: Timo M. Bechger, Cito, Nieuwe Overstraat 50, NL-6811JB, Arnhem, The Netherlands; e-mail: timo.bechger[at]cito.nl

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

2

The outline of this paper is as follows. We will first discuss the halo-effect and define the correlation matrix that is the basis for the analysis. Then, we explain the basic tenants of our approach followed by an application concerning a large-scale language exam in the Netherlands. Note that our approach is based on well-known concepts from classical test theory (e.g., Lord & Novick, 1968; Steyer & Eid, 1993). It does not require a detailed psychometric model to describe what individual raters do when they pass judgement on the performances of specific individuals. To achieve this aim we make two assumptions: 1) the test can be divided into two parallel test-halves, and 2) raters are assigned randomly to examinees. Following the explanation of our procedure there is a section where we discuss the consequences of violation of these assumptions. If a halo-effect can be detected, the next question is how to deal with it. A number of halo reduction methods have been suggested but none of them has been proven to be effective under all circumstances (Cooper, 1981). Another possibility is to correct for haloeffects using statistical (e.g., item response theory) modeling. Both strategies are hampered by the lack of a good understanding of the nature of the halo-effect and the mechanisms causing it. In our view, halo-effects are therefore best avoided. The obvious way to do this is to have different raters judge different performances of the same candidate (see also Hoyt, 2000). An illustration is given in Section where we describe an experiment in which raters are assigned at random to different combinations of candidates and assignments. The paper ends with a discussion.

Halo-Effects We assume that the same performance is rated on one or more aspects. For example, a piece of writing is judged on: content, spelling, coherence, etc. In the early 20th century it was discovered that associations between ratings of different aspects are often inflated (Wells, 1907; Thorndike, 1920). Thorndike (1920), in particular, characterized the halo effect as “suffusing ratings of special features with a halo belonging to the individual as a whole”(p. 25). The definition of the halo-effect used here is slightly more general in the sense that halo error may also affect ratings of different performances of the same examinee. Furthermore, we recognize that halo-effects may be positive or negative (Murphy, Jako, & Anhalt, 1993). That is, the correlations between the ratings may be increased or decreased: They are simply different from the correlations that one would find without halo-effects. Halo effects can occur for many reasons. For example, judges may form a general impression after having seen a few performances and subsequent judgments may be heavily influenced by this first impression. Some raters may simply stop paying attention to the examinees’s performances while others may (unconsciously) be tempted to make subsequent ratings consistent with earlier ratings. Halo-effects may also be due to contrast-effects: i.e., the standard of a previously rated student may influence the ratings (Hales & Tokar, 1975). Unless the data have been collected in a carefully designed experiment we can only speculate about the nature and causes of halo-effects, for little is known about the complex cognitive processes of human scoring (e.g., Lumley, 2005; Vaughan, 1991). This is why we believe that it is useful to have a method to detect halo-effects that does not require detailed assumptions about rater behavior. These assumption are quite likely to be wrong which would lead to wrong conclusions about raters and examinee’s.

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

1 2 3 Examinees 4 5 6 7 8

Raters 1 2 3 4 X X X X X X X X X X X

5

3

6

X X X X X

Figure 1. The Assignment of Raters to Examinees in an Incomplete Block Design. When a Rater is Assigned to an Examinee There is a Cross.

Halo-effects imply a decrease in the number of independent opportunities for the candidate to demonstrate his or her proficiency and diminishes the reliability of the test. At the same time, the correlation between ratings of the same candidate may be increased which gives the impression that the test is very reliable. In the extreme case, only the first performance is rated and all subsequent ratings are equal to the first which leads to high correlations between scores on different parts of the examination. However, since only one performance was rated we should not expect to make very precise statements about the examinee’s ability. The observation that halo-effect may bias an estimate of the reliability of the test inspired the method proposed in this paper. Essentially, we compare two estimates of reliability only one of which can be affected by halo-effects. Finally, note that the halo-effect is not synonymous to rater bias. Rater bias refers to a systematic under- or over estimation of the quality of a performance while the haloeffect refers to (local) dependencies between ratings. Halo-effects do however increase the probability that raters biases affect the examination marks. This is clearly demonstrated in the case of contrast-effects (Hales & Tokar, 1975). Thus, halo effects may harm the validity of the examination. There have, however, also been cases where halo-effects are found to increase the accuracy of the ratings (Murphy et al., 1993).

Rating, Half-Test Correlations, and Random Assignment of Raters The incomplete block design gives rise to a sparse data set with many missing observations. Raters must be selected in such a way that each of the examinees is assigned two different raters while at the same time the workload of each rater is controlled. As illustrated in Figure 1, this means that we need to fill a matrix such that there are two crosses in each row and not too many or too few crosses in each column. In this paper, we require that the distribution of the crosses is independent of the person and the identity of the rater. To this aim, the assignment of rater pairs to examinees must be random. A brief outline of a selection procedure is given in an Appendix. Figure shows how we construct a complete data set with two independent ratings, R1 and R2 , for each examinee. One simply places the available ratings in two columns

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

1 2 1 3 2 2 3 5 Examinees 4 5 1 6 7 8

Raters 3 4 5 1 0 2 2 1 4 3 2 1

6 1 2 3 4 5 6 7 8

2 5 3

4

Rating 1 Rating 2 3 1 2 0 5 2 2 2 1 1 4 5 3 3 2 1

Figure 2. From Incomplete Data with Raters to Complete Data with Ratings

ignoring the fact that the ratings come from different raters. When each rating is done by a randomly selected pair of raters, each rating represents the average rater. Formally, the ratings are exchangeable which means that the indices of the rating (1 or 2) may be exchanged freely (De Finetti, 1974). Finally, let the examination be divided into two half-tests. Based upon a single rating, the two half-test are scored separately for each examinee. When each examinee is rated by two independent raters there are two scores for each half-test T1 and T2 (see Figure 3). The matrix of correlations between the four ratings is given in Figure 4. This correlation matrix will be the main unit of analysis in the sequel. Note that the correlation matrix is similar to the so-called multi-trait, multi-method matrix (Campbell & Fiske, 1959). Here, multiple raters are the multiple methods and halo is a kind of method variance.

Theory The observation that halo-effect may bias the estimated reliability of the test inspired the method proposed in this paper. Essentially, we compare two estimates of reliability

1 2 3 Examinees 4 5 6 7 8

R1 R2 3 1 2 0 5 2 2 2 1 1 4 5 3 3 2 1

1 2 3 4 5 6 7 8

R1 T1 T2 2 1 1 1 2 3 2 0 1 0 1 3 2 1 0 2

R2 T1 T2 1 0 0 0 0 2 1 2 1 0 4 1 2 1 0 1

Figure 3. Constructing the Data to Calculate the Correlations

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

5

Figure 4. Correlation Matrix with Two Ratings and Two Test-Halves

R1

T1 T2

R1 T1 T2 1 ρ21 1

R2

T1 T2

ρ31 ρ41

ρ32 ρ42

R2 T1 T2

1 ρ43

1

Figure 5. Correlation Matrix with two Raters and Parallel Test-Halves

R1

R2

R1

T1 1 ρ21

T2

T1 T2

R2

T1 T2

ρ3 ρ41

ρ32 ρ3

T1

T2

1 ρ43

1

1

only one of which can be affected by halo-effects. Two Reliabilities Suppose the examination is divided into two parallel half-tests. Based the first rating, the two half-test are scored separately for each examinee and the correlation coefficient ρ21 is computed between these two sets of scores. Since the half-tests are parallel, ρ21 equals the reliability of each half-test (Lord & Novick, 1968, Ch. 2). Using the Spearman-Brown (SB) prophecy formula (Brown, 1910; Spearman, 1910), the reliability of the full-length examination is calculated as: 2ρ21 ρXX 0 = (1) 1 + ρ21 When each examinee is rated by two independent raters there are two scores for each half-test (see Figure 3). Schematically, the correlation matrix between these scores is given in Figure 5. In Figure 5, ρ31 = ρ42 = ρ3 which is a consequence of having parallel test-halves. There are now four different split-half correlations: ρ21 , ρ41 , ρ32 , and ρ43 . The remaining correlation, ρ3 , between two ratings of the same test-half is a measure of rater reliability. However, all these correlations would be affected by halo-effects which complicates their interpretation. Differences between ρ21 and ρ43 , or between ρ41 and ρ32 , are due to differences between raters. When the two raters that rate each student’s product have been chosen at random, there will be no such differences and the correlations show the pattern in Figure 6. Now we are left with two, possibly different, split-half correlations: ρ1 and ρ2 . However, differences between ρ1 and ρ2 have a simple interpretation. Specifically, a halo-effect will increase or decrease the dependencies among ratings by the same rater so that ρ1 will become different

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

6

Figure 6. Correlation Matrix with Parallel Test-Halves and Two Exchangeable Raters

R1

T1 T2

R1 T1 T2 1 ρ1 1

R2

T1 T2

ρ3 ρ2

ρ2 ρ3

R2 T1 T2

1 ρ1

1

from ρ2 . Hence,

2ρ1 2ρ2 6= ρXX 0 = (2) 1 + ρ2 1 + ρ1 with equality when there are no halo-effects. Thus, compared to the ρ∗XX 0 , ρXX 0 may be inflated or deflated due to halo-effects. ρ∗XX 0 =

An Effect-Size Measure To quantify the size of a halo-effect we employ the general form of the SpearmanBrown formula: kρ∗XX 0 ρXX 0 = , (3) 1 + (k − 1)ρ∗XX 0 where k is the number of is the number of times the test would have to be lengthened to raise ρ∗XX 0 to the value of ρXX 0 . Solving for k gives: k=

ρXX 0 (1 − ρ∗XX 0 ) ρ∗XX 0 (1 − ρXX 0 )

(4)

(cf. Lord & Novick, 1968, Theorem 5.12.2). Thus, the reliability of the test is over- (or under) estimated due to halo-effects and k expresses this effect in terms of examination length, where examination length is defined in terms of the number of ratings of each examinee. The index k −1 has a more intuitive interpretation. Specifically, it expresses the haloeffect as the percentage of independent ratings in the examination when all ratings are done by the same rater. When k −1 < 1 there is a positive halo-effect: due to halo-effects the number of independent ratings has decreased. When k −1 > 1 the halo-effect is negative: due to halo-effects the number of independent ratings has increased. Note that test reliability is non-linear with respect to test length, and k is larger when the reliability of the test is higher. For example, when reliability is increased from 0.85 to 0.89, k = 1.43 (k −1 = 0.70) while k = 1.18 (k −1 = 0.85) when reliability is increased from 0.55 to 0.59. Extensions Extension to more than two sub-tests is straightforward. With more than two sub tests, ρ1 and ρ2 are sub-matrices of correlations that are all equal when the sub-tests are parallel. With g ≥ 2 exchangeable raters there will be g correlations equal to ρ1 , and g − 1 correlations equal to ρ2 or ρ3 . Hence, the pattern of the correlation matrix does not change.

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

7

Estimation and Testing The correlations ρi are population quantities. In practice, we estimate ρXX 0 and ρ∗XX 0 using sample correlations, rij . The simplest way to test for the presence of halo-effects is to compare r21 with r32 : A halo effect is present when r21 6= r32 . These correlations are based on independent data so that standard tests can be used (e.g., Steiger, 2005 and references therein). Unless the ratings are exchangeable by design, we would first test whether ρ1 equals ρ5 , and ρ2 equals ρ4 . A more efficient test, i.e., a test based on more data, would be to consider the entire pattern in the correlation matrix as a null hypothesis (Steiger, 2005). Steiger’s MULTICORR program (Steiger, 1979) can be used to this aim.1 When the hypothesis of equal correlations cannot be rejected, we simply average the sample correlations. Hence, ρ1 is estimated as the average of the two within-rating correlations, and ρ2 is estimated as the average of the two between-rating, between-halves correlations. Formally, this gives ordinary least-squares estimates.

Halo-effects in the State Examination Dutch as a Second Language The State Examination Dutch as a Second Language (STEX) measures the ability of non-native speakers of Dutch to use and understand Dutch as it is spoken, written, and heard in work and educational settings. The STEX includes separate exams for the productive abilities (speaking and writing), and the receptive abilities (reading and listening). Here, we consider the examination for speaking. More background information on the STEX can be found in Bechger, Kuijper, and Maris (2009). The examination consists of a number of assignments. Each assignment presents the examinee with a practical situation and he or she responds by speaking aloud. The utterances are recorded and send to two independent raters for judgement. The raters are chosen from a file of available raters such that no rater is assigned the same examinee twice. Raters are instructed to listen to the performance on each assignment and answer a set of questions concerning different aspects such as tempo, content or vocabulary. Each rater passes judgement on all performances of an examinee and there is a real risk that halo-effects occur. To investigate the size of the halo-effect, we took data from the examination administered in July 2006. Parallel test-halves where constructed by randomly assigning assignments to the two half test forms. The scores were simple sums of the marks. The resulting correlations are in Table 1. The pattern of the correlations in Table 1 strongly suggests that the ratings are exchangeable and there is no real need in this case for a statistical test. The difference between the estimated ρ1 = 0.85547 and ρ2 = 0.71462 suggests that a positive halo-effect has occurred. It is easily calculated that ρXX 0 = 0.9221, and ρ∗XX 0 = 0.8335. Using Equation 4, it follows that k −1 = 0.42. Due to halo-effects, the examination has become less than half in size. Similar findings were found for examinations administered at other dates. Thus, halo-effects occur and are quite large. 1 At the time of writing, this http://www.statpower.net/page5.html.

program

can

be

downloaded

without

cost

from:

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

8

Table 1: Correlations July 2006 Examination: n = 1801

R1

T1 T2

R1 T1 1 0.855178

R2

T1 T2

0.768225 0.715133

R2 T1

T2

1 0.855767

1

T2 1

0.714115 0.769148

Non-parallel Test-Halves With parallel test-halves and exchangeable raters, we know which correlations to compare to see whether halo-effects have occurred. Without these assumptions, we need more detailed statistical modelling to formulate hypotheses on the pattern of correlations. The assumption that raters are exchangeable can be made true by random assignment. Thus, we will focus here on the assumption that test-halves are not parallel. In practice, test are divided into test-halves that are “as parallel as possible”. In the application discussed above we have divided the test in random halves but one could also use the matched random sub tests method proposed by Gulliksen (1950), using a computer algorithm when necessary (e.g., Sanders & Verschoor, 1998). When the test-halves are not exactly parallel, the SB formula will give a biased estimate of test reliability. For example, when test-halves are random, this leads to an over -estimation of measurement error variance so that the SB formula will tend to under -estimate the reliability. This will affect both reliabilities, ρXX 0 and ρ∗XX 0 in the same way, however, and k −1 (or k) would still be useful as an index for the size of a halo-effect. The method described earlier is developed to detect a halo-effect when there is little knowledge on how the halo-effect manifests itself. When in fact there is a hypothesis on how the halo effect expresses itself, it may be wise not to make the test-halves parallel. Suppose, for example, that we suspect that judges rate two performances and subsequently copy their responses. To test this hypothesis, the test may be divided into two halves with the first half containing the ratings of the first two performances and the second half the remaining ones. Figure 7 shows a path-diagram which represents the situation where the ratings on the second test-half are (almost) exact copies of the first test-half. The circle labelled F represents the unobserved or latent variable measured by both ratings. The absence of a direct arrow from F to R1T 2 and R2T 2 means that ratings on the second test-half are independent of F conditional upon the ratings on the first part. Any differences between them is due to random errors represented by E2. The two paths originating from F are the same and the ratings are exchangeable. However, the test halves are not parallel (i.e., ρ31 6= ρ42 ) because they have different relations to F . A path model can be fitted to the correlations (or covariances) using any program for Structural Equation Modeling (SEM) such as LISREL (J¨oreskog & S¨orbom, 1993), Mx2 2

Now an open-source R package! consult http://www.vcu.edu/mx/

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

9

F

a

a

R1T1 h

R2T1 E1

h

E1

R2T2

R1T2

E2

E2

Figure 7. Path Model Representing the Situation Where the Ratings on the Second Test-Half Depend on the First Test-Half.

(Neale, Boker, Xie, & Maes, 2003), EQS (Bentler, 1995; Byrne, 2006), Mplus (Muth´en & Muth´en, 1998-2007; Croudace, Dunn, & Pickles, 2009), or Amos (Arbuckle, 2006; Byrne, 2009)

Random Assignment of Raters to Examinee/assignment combinations An obvious way to eliminate the halo-effect in the STEX is to have different raters judge the performances of an examinee on different assignments. The findings presented in Section convinced the leadership of the STEX of the need to bring this into practice and a small pilot study was organized to see whether this was feasible. For the pilot, 50 examinees where drawn from those that took the July 2006 examination and their performances were re-rated. On this occasion, rater pairs where randomly assigned to combinations of examinees and assignments. Hence, a halo effect cannot occur because different exercises are rated by different raters. The correlation matrix is given in Table 2. Averaging the relevant correlations, we find that ρ1 = 0.8068 and ρ2 = 0.8127. Hence, ρXX 0 = 0.8931, and ρ∗XX 0 = 0.8967. In this case, k −1 = 1.04 suggesting a small, positive halo-effect. As an aside, we note that the rater reliability ρ3 is higher than the rater reliability for the regular examination (χ2 = 8.7583, d.f.= 2, p= 0.0125). This could be due to the stimulating effect of participating in the pilot. In the history of the STEX we have seen this on a number of occasions. Motivated raters do better!

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

10

Table 2: Correlations Pilot: n = 50

R1

R2

R1

T1 1 0.757038

T2

T1 T2

R2

T1 T2

0.851374 0.82865

0.796855 0.878345

T1

T2

1 0.856656

1

1

In this case, we know that a halo-effect cannot have occurred and k differs from 1 due to sampling error. To gain an impression of the sampling variation of k we used a resampling scheme. Specifically, we randomly switched the first and second raters in each of the rater pairs a number of times and each time calculated k. The mean estimated k was equal to 1.022 (k −1 = 0.98). The variance due to sampling of raters was estimated to 0.1513.

Discussion We have discussed how a halo-effect can be detected and how its size can be expressed in terms of examination length. It was the following observation that led to this result. On the one hand, one could argue that halo-effects decrease reliability. On the other hand, it will in many cases result in an overestimated estimate of reliability calculated using a split-half method with single ratings. This apparent contradiction is solved if one considers that Equation 1 is based on the assumption that measurement errors are independent and that this assumption is violated when a halo-effect occurs. When the observed scores are the result of human judgement, measurement errors include variation in the quality of an examinee’s performances as well as errors of judgement. When a halo-effect occurs, the latter are positively (or negatively) correlated across different ratings and Equation (1) should no longer be interpreted as a reliability. Note that this problem has long ago been recognized in classical test theory. For example, Kelley (1924) and Guilford (1936) warned that items corresponding to a common stimulus should be placed in the same test-halve when computing split-half reliabilities. Our approach does not require a detailed measurement model to detect halo-effects. Basically, the only assumption is that the response variables are random (e.g., Bechger, Maris, Verstralen, & B´eguin, 2003). Unidimensionality, for example, has not been assumed. Thus, we have a simple way to see whether halo-effects occur before they are investigated in more detail. It is surprising how little is know about the halo-effect and it is an good topic for future research. We have illustrated how CFA may be used to investigate halo-effects when test-halves are not exchangeable. In fact, any of the CFA models developed for the analysis of multitrait multi-method problems can be employed (e.g., Bechger & Maris, 2004; Eid, 2000; Goffin & Jackson, 1992; Marsh & Butler, 1984; Solomonson & Lance, 1997). However, the possibilities for modeling are limited when the data are reduced to only four measures. Item response theory (IRT) provides an alternative to CFA. An IRT analysis would

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

11

focus on the individual aspect ratings as the basis for analysis. In the parlance of IRT, a halo-effect entails local item dependence: i.e., a dependence between aspect ratings that cannot be explained by the latent trait(s) in the model. Testing for local dependence is easier when raters are assigned randomly to examinees. Maris and Bechger (2007) show that the resulting exchangeability of the raters frees us from the need to model the behavior of individual raters so that standard tests can be used to detect local dependencies (e.g., Yen, 1984; Verguts & De Boeck, 2001; Ip, Smits, & De Boeck, 2009). Compared to the approach proposed here, these tests requires a good idea where local dependencies might turn up; i.e., which items will be dependent and in which way. Furthermore, we must be able to distinguish halo-effects from other causes of local dependences such as multidimensionality. In fact, the method discussed here was borne out of frustration because we failed to detect halo-effects using tests for local dependence in the STEX. We now believe that, in the STEX, halo-effects cause small local dependencies between individual items which are difficult to detect. The present approach provides a measure of their combined effect. When raters are not assigned at random one needs a model for the behavior of individual raters which complicates the analysis. To see IRT in action, in this situation, the reader is referred to Wang and Wilson (2005) who use a unidimensional random-effects FACETS model. This model is an generalization of the FACETS model (Linacre, 1994). The FACETS model is a Rasch-type (Rasch, 1960) model in which individual raters are characterized by a severity parameter. The random facets model includes a random effect to model local dependencies. Interestingly, Wang and Wilson (2005) expressed the halo-effect as an increase in the reliability of the estimated abilities (Bechger et al., 2003). Although Wang and Wilson (2005) did not use k as an index, they followed basically the same line of thinking that has been proposed here. In the context of high-stakes examinations like the STEX, we believe that halo-effects are best avoided. A halo-effect always implies a decrease in the number of independent opportunities that an examinee has to demonstrate his or her proficiency. We believe that this is unacceptable. Furthermore, the need to deal with possible halo-effects complicates the analysis of the data. Many testing companies use IRT to equate different examinations and failure to model local dependencies may bias the results. On the other hand, modeling solutions are complicated and require assumptions that may be difficult to justify. The most obvious way to avoid halo-effects is to assign raters at random to combinations of examinees and assignments. The pilot study suggested that this is feasible in practice and does not diminish the reliability of the examination. We conjecture that halo-effects are often a consequence of bad rater instructions: Overly complex or ambiguous rater instructions entice raters to base ratings on their personal impression of each examinee. Thus, halo-effects may be a signal that raters experience time-pressure, require better instructions, or that the scoring rubrics are not clear. Our experience is that, with halo-effects out of the way, it is simpler to investigate the quality of the ratings and pinpoint aspects, assignments, or examinees that are difficult to assess (e.g., Maris & Bechger, 2007). Given the limitations of a data design that is not ideal for this purpose, random assignment also facilitates the detection of aberrant raters because each individual rater can be compared to an average rater. In closing, we mention that the scope of this paper extends beyond educational mea-

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

12

surement. Halo effects are widely known and discussed in the management science (e.g., Rosenzweig, 2007), and organizational psychology literature (e.g., employee performance appraisals or assessment centres). One reviewer pointed out that assessment centres are particularly relevant, because multiple raters are often used to evaluate examinee performance across multiple exercises, so one can construct the types of data matrices described in Figures 1-3 (e.g., Thornton, 1992 Woodruffe, 1998).

References Arbuckle, J. L. (2006). Amos (version 7.0) [computer program]. [Computer software manual]. Chigago: SPSS. Bechger, T. M., Kuijper, H., & Maris, G. (2009). Standard setting in relation to the Common European Framework of Reference for Languages: The case of the state examinations Dutch as a second language. Language Assessent Quarterly, 6 , 126-150. Bechger, T. M., & Maris, G. (2004). Structural equation modeliing of multiple facet data: Extending models for multitrait -multimethod data. Psicol´ ogica, 25 , 253-274. Bechger, T. M., Maris, G., Verstralen, H. H. F. M., & B´eguin, A. A. (2003). Using classical test theory in combination with item response theory. Applied Psychological Measurement, 27 , 319-334. Bentler, P. M. (1995). EQS structural equations program manual [Computer software manual]. Encino, CA. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3 , 296-322. Byrne, B. (2006). Structural equation modeling with EQS: Basic concepts, applications, and programming (2nd ed.). Lawrence Erlbaum Ass. Byrne, B. (2009). Structural equating modelling with AMOS: Basic concepts, applications, and programming (2nd ed.). New-York: Taylor and Francis group. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait multimethod matrix. Psychological Bulletin, 56 , 81-105. Cooper, W. H. (1981). Ubiquitos halo. Psychological Bulletin, 90 , 218-244. Croudace, T., Dunn, G., & Pickles, A. (2009). General latent variable modelling using Mplus (1st ed.). London: Chapman & Hall. De Finetti, B. (1974). Theory of probability. New-York: Wiley. Eid, M. (2000). A multitrait-multimethod model with minimal assumptions. Psychometrika, 65 , 241-261. Goffin, R. D., & Jackson, D. N. (1992). Analysis of multitrait-multirater performance appriasal data: Composite direct product method versus confirmatory factor analysis. Multivariate Behavioral Research, 27 , 363-385. Guilford, J. P. (1936). Psychometric methods. New-York: McGraw-Hill. Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. Hales, L. W., & Tokar, E. (1975). The effect of the quality of preceding responses on the grades assigned to subsequent responses to an essay question. Journal of Educational Measurement, 12 , 115-117. Hoyt, W. (2000). Rater bias in psychological research: When is it a problem and what can we do about it? Psychological Bulletin, 5, Issue 1 , 64-86. Ip, E. H., Smits, D. J. M., & De Boeck, P. (2009). Locally dependent linear logistic test model with person covariates. Applied Psychological Measurement, 3 , 555-569. J¨ oreskog, K. G., & S¨ orbom, D. (1993). LISREL 8: Structural equation modeling with the simplis command language. Chicago: Scientific Software Int. Kelley, T. L. (1924). Note on the reliability of a test: A reply to Dr. Crumm’s criticism. The Journal of Educational Psychology, 15 , 193-204.

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

13

Linacre, J. M. (1994). Many-facetted Rasch measurement (2nd ed.). Chicago: Mesa Press. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Lumley, P. (2005). Assessing second language writing: The rater’s perspective. Frankfurt: Peter Lang. Maris, G., & Bechger, T. M. (2007). Scoring open ended questions. In C. R. Rao & S. Sinharay (Eds.), Handbook of statistics: Psychometrics (Vol. 26, p. 663-680). Amsterdam: Elsevier. Marsh, H. W., & Butler, S. (1984). Evaluating reading diagnostic tests: An application of conformatory factor analysis to multitrait-multimethod data. Applied Psychological Measurement, 8 , 307-320. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Earlbaum. Murphy, K. R., Jako, R. A., & Anhalt, R. L. (1993). The nature and consequences of halo error: A critical analysis. Journal of Applied Psychology, 78 , 218-225. Muth´en, L. K., & Muth´en, B. O. (1998-2007). Mplus user’s guide [Computer software manual]. Los Angeles, CA. Neale, M. C., Boker, S. M., Xie, G., & Maes, H. H. (2003). Mx: Statistical modeling (6th ed.) [Computer software manual]. Box 900126, Richmond, Va 23298. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: The Danish Institute of Educational Research. (Expanded edition, 1980. Chicago, The University of Chicago Press) Rosenzweig, P. (2007). The halo effect. New-York: Free Press. Sanders, P. F., & Verschoor, A. J. (1998). Parallel test construction using classical item parameters. Applied Psychological Measurement, 22 , 212-223. Solomonson, A. L., & Lance, C. E. (1997). Examination of the relationship between true halo and halo error in performance ratings. Journal of Applied Psychology, 82 , 665-674. Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3 , 271-295. Steiger, J. H. (1979). MULTICORR: A computer program for fast, accurate, small-sample tests of correlational pattern hypotheses. Educational and Psychological Measurement, 39 , 677-680. Steiger, J. H. (2005). Comparing correlations. In A. Maydeu-Olivares (Ed.), Contemporary psychometrics. A festschrift for Roderick P. Mcdonald. Mahwah NJ: Lawrence Erlbaum Associates. Steyer, R., & Eid, M. (1993). Messen und Testen. Berlin: Springer-Verlag. Thorndike, E. L. (1920). A constant error in psychological ratings. Journal of Applied Psychology, 33 , 263-271. Thornton, G. C. (1992). Assessment centers in human resource management. Addison-Wesley. Vaughan, C. (1991). Holistic assessment: What goes on in the rater’s mind. In L. H. Lyons (Ed.), Assessing second language writing in academic contexts. Norwood: N. J.: Ablex Publishing Corporation. Verguts, T., & De Boeck, P. (2001). Some Mantel-Haenszel test of Rasch model assumptions. British Journal of Mathematical and Statistical Psychology, 54 , 21-37. Wang, W., & Wilson, M. (2005). Exploring local item dependence using a random-effects facet model. Applied Psychological Measurement, 29 , 296-318. Wells, F. J. (1907). A statistical study of literary merit. Archives of Psychology, 1 , 1-30. Woodruffe, C. (1998). Assessment centers: Identifying and developing competence. London: Institute of Personnel Management. Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 30 , 187-213.

DETECTING HALO EFFECTS IN PERFORMANCE-BASED EXAMINATIONS

14

Appendix A Procedure for Random Assignment Here we outline a possible procedure to assign raters randomly to combinations of examinees and assignments. Specifically, when confronted with an assignment or question an examinee produces an answer which we will call a performance. These performances are then assigned to rater pairs in such a way that the workload of each rater is controlled. The same procedure can be used to assign rater pairs to examinees. Init Each of n examinees responds to N open questions. The n × N performances are placed in two lists: Box1 and Box2, such that each performance is in each list and only once in a list. The workload of a rater is the maximum number of judgements that he or she is to perform. One must make sure that the total workload is sufficient to perform all ratings. Assign We do each rater separately. When a rater enters the system we: 1. Choose one of the two lists (Box 1 or Box 2) with probability proportional to the number of producs in a list. That is, the probability that a rater is assigned Box 1 is length(Box 1) length(Box 1) + length(Box 2)

(5)

If Box1 is not choosen a rater is assigned Box2. 2. From the list choosen in the first step, we sample a number of performances. This number is equal to the minimum of the workload of the rater, and the number of performances left in the list. If a performance is assigned to a rater it is removed from the list.

Suggest Documents