© 2003 BY THE
Music Perception Fall 2003, Vol. 21, No. 1, 21–41
REGENTS OF THE UNIVERSITY OF CALIFORNIA ALL RIGHTS RESERVED.
Evaluating Evaluation: Musical Performance Assessment as a Research Tool SAM THOMPSON
&
AARON WILLIAMON
Royal College of Music, London Much applied research into musical performance requires a method of quantifying differences and changes between performances; for this purpose, researchers have commonly used performance assessment schemes taken from educational contexts. This article considers some conceptual and practical problems with using judgments of performance quality as a research tool. To illustrate some of these, data are reported from a study in which three experienced evaluators watched performances given by students at the Royal College of Music, London, and assessed them according to a marking scheme based on that of the Associated Board of the Royal Schools of Music. Correlations between evaluators were only moderate, and some evidence of bias according to the evaluators’ own instrumental experience was found. Strong positive correlations were found between items on the assessment scheme, indicating an extremely limited range of discrimination between categories. Implications for the use of similar assessment systems as dependent measures in empirical work are discussed, and suggestions are made for developing scales with greater utility in such work. Received July 18, 2002, accepted May 25, 2003
Q
judgments form a routine part of musical listening. Depending on the particular situation, such judgments may be more or less formal, ranging from a postconcert discussion to the thoughts that lead to marks being awarded in an examination context. Arguably, in fact, it is impossible to listen to music completely dispassionately, making no such judgments at all. As a form of measurement, quality judgments appear to have the advantage of intrinsic ecological validity. Moreover, formalized quality judgments can be seen as providing a structured output to a process that is already taking place as part of listening. Accordingly, they form the primary means of assessment in many musical education institutions worldUALITY
Address correspondence to Sam Thompson, Centre for the Study of Music Performance, Royal College of Music, Prince Consor t Road, London SW7 2BS, U.K. (e-mail:
[email protected]) ISSN: 0730-7829. Send requests for permission to reprint to Rights and Permissions, University of California Press, 2000 Center St., Ste. 303, Berkeley, CA 94704-1223. 21
22
Sam Thompson & Aaron Williamon
wide, and there is a widely held assumption that such judgments can provide a reliable means of quantifying the difference(s) between several performances and of feeding back information about these differences to the performer. From the perspective of music psychology, structured quality judgments thus represent an appealing research tool. Furthermore, in recent years such judgments have played a larger role as the focus of research interest has gradually shifted toward understanding and explaining real-world performance issues. Much of this applied research is concerned with improving musical performance itself (e.g., Gruzelier, Egner, Valentine, & Williamon, 2002; Juslin & Laukka, 2000), investigating practice and memorization strategies (e.g., Williamon & Valentine, 2000, 2002), and trying to deduce the factors that make one performer appear more successful than another in a real performance environment (e.g., Davidson & Coimbra, 2001). In all of these contexts, it seems important for researchers to have access to performance assessments that are not only reliable and consistent but able to specify differences between two or more performances, be they given by several performers or by the same performer over a period of time. For example, it may be hypothesized that some particular intervention will reduce performance anxiety and so improve stage presence. To measure the efficacy of the intervention, it might be desirable not only to take state anxiety measures (see Steptoe & Fidler, 1987; Valentine, Fitzgerald, Gorton, Hudson, & Symonds, 1995) but also to use some form of performance assessment and look for specific changes in the evaluators’ perceptions. Similarly, it is commonly argued that memorized performances are somehow better than those given from the printed score (see Williamon, 1999); here a performance measurement tool could be used to investigate the perceived difference in quality between the two cases. Successful performance assessment schemes can, therefore, form an essential part of the integrated music psychology that, as we have argued elsewhere (Williamon & Thompson, in press), seems most likely to provide tangible benefits to practicing musicians. Although a large amount of research has investigated aspects of the performance assessment process in educational contexts (e.g., Mills, 1991; Delzell & Leppla, 1992; G. E. McPherson, 1995; Elliott, 1995/1996; Hunter & Russ, 1996; Johnson, 1997; Davidson & Coimbra, 2001; Wapnick, Mazza, & Darrow, 2000; J. McPherson, 2002), no published work has specifically addressed the issue of how existing performance assessment protocols might be developed into reliable tools for researchers. In attempting to achieve this goal, some important questions must be answered. These include (but are not limited to): Can adequate levels of interrater reliability and consistency be established? Can the use of discriminant criteria be justified and, if so, what are the most suitable criteria? Can a single evaluation method be generally applied across all instrumental groups? The answer to
Evaluating Evaluation
23
some of these questions might well be different for psychologists than for educators (see “Wider Implications” later). The primary aim of this article is to initiate the development of a research tool by examining some of the assumptions and implications that are inherent in any formal system of musical performance assessment. To illustrate how these may operate in the real-world application of a typical performance assessment protocol, we present data from a study in which a structured assessment system was employed in a highly naturalistic, ecologically valid experimental context. The issues of reliability between evaluators, and the utility of the assessment system (and, by implication, others similar to it) as a research tool are specifically examined. These analyses are subsequently used in a more general discussion to highlight a number of significant challenges that any performance assessment system must overcome if it is to be used successfully in a research context.
On Formal Assessment Schemes in Music Performance PRINCIPLES
In an effort both to standardize the performance assessment process and to improve the level of feedback given to those assessed, a number of more or less formal assessment “systems” have been developed by various institutions and working groups over the years (a review of some of these can be found in Zdzinski, 1991). A recent example is the clear system devised at the Guildhall School of Music and Drama, now used in all of their graded examinations (detailed in Hollis, 2001a, 2001b, although the system has been in use since around 1997). Other examples include the marking scheme of the Associated Board of the Royal Schools of Music (Harvey, 1994) and the systems developed at the Sydney Conservatorium (Stanley, Brooker, & Gilbert, 2002) and the Queensland Conservatorium (Wrigley, Emmerson, & Thomas, 2002). These systems are all “criterion based” (McPherson & Schubert, in press), in that they require the evaluator(s) to judge the performance—or aspects of it—relative to a set of written criteria. It appears that, conceptually speaking, all performance assessment systems based on structured quality judgments make at least three fundamental assumptions. The first of these is that musical performance quality is a dimension with a common psychological reality for experienced listeners (in this paper, the term “performance quality” is intended to refer to the quality of the performance independent of the quality of the composition). This assumption has two parts. First, it is assumed that describing one musical performance as being superior to another is objectively meaningful, referring both to an actual property of the performance and to a listener’s awareness of that property. Second, it is assumed that this property can be
24
Sam Thompson & Aaron Williamon
apprehended by most listeners who are familiar with a musical style—experienced listeners—such that those listeners will tend to agree about the relative merits of several performances within that style (we take “experienced” to imply a high level of acculturation, but not necessarily the possession of explicit musical knowledge). This premise is necessary because the alternative—an entirely individualistic, phenomenological definition of performance quality (e.g., a performance is better only if one thinks it is) seems to render any talk of assessing relative quality rather meaningless. One consequence of the assumption is that musical performance quality must rely to some degree on features of the performance that are independent of the individual listener’s perception and thus open, in principle, to description in nonphenomenological terms (this implies the possibility of developing “acoustic” measures of quality; see Juslin, Friberg, & Bresin, 2001/2002). The notion that performance quality, in this case, may turn out to be a “composite factor” (as in structural equation modeling) is logically permissible under the assumption and is an issue that several assessment systems have implicitly addressed (see “Structure of Assessment Schemes” next). Second, it is commonly assumed that experienced musicians are able to offer consistent judgments of music performance quality. According to the first assumption, the fact that it is believed possible to judge music performance quality at all implies that any listener should be able to recognize a good performance when they hear one, as long as they are sufficiently familiar with features of the musical style in question. In general, however, researchers and educators alike tend to rely for their performance assessments on so-called “expert” evaluators, who are typically experienced performers themselves. For example, the U.K.’s Associated Board of the Royal Schools of Music, a major provider of graded music exams, holds a list of approved evaluators who are trained in assessment procedures and whose marking is subject to regular scrutiny. Although some evidence indicates that the experience of evaluators may be rather less significant than often thought (Byo & Brooks, 1994), consistency of assessment is most commonly assumed to be a skill that can be learned, or at least improved with practice. A third assumption is that experienced musicians are able to distinguish between aspects of a performance such as technique and interpretation. This is related to the first two assumptions, although in terms of implementation within educational systems it is arguably more recent, following (although not necessarily attributable to) a general movement in education to give students detailed feedback on assessed work. The usual manifestation is to have marking scales that consist of a number of subcategories such as technique, musicality, interpretation, and so on. The reductionist implications of this assumption are conveniently in keeping with the prevailing ideology of cognitive psychology; in fact, however, the hypothesis that musical performances might consist of perceptually distinct aspects is in-
Evaluating Evaluation
25
grained in the teaching of musical instruments and even in graded examination syllabi themselves. The Associated Board of the Royal Schools of Music, for instance, specifies that at Grade 5 “musicality becomes a more important element,” implying both that it develops and can be assessed independently of technical aspects (Harvey, 1994). STRUCTURE OF ASSESSMENT SCHEMES
The kind of evaluation implied by the third assumption just presented, and increasingly deployed in examination contexts, has been described by Mills (1991) as segmented, as opposed to holistic marking in which only an overall grade is given. Of the two, holistic marking schemes seem to maintain the highest level of ecological validity by keeping the degree of intervention into the existing evaluation process to a minimum. By contrast, segmented marking systems sacrifice an amount of ecological validity in favor of greater post hoc utility. Specifically, they enable performers to be given detailed feedback about areas of their performance that might be improved and, from a research point of view, provide precisely the kind of quantitative data that enable statistical analysis of subtle performance differences and/or changes to be conducted. Confidence in the validity of a segmented approach is thus a necessary preliminary for virtually any performance quality measurement system used in empirical research. However, the difference between segmented and holistic assessment is often not so clearly defined in practice. Mills (1991) notes that most segmented schemes still retain an “overall quality” mark of some kind. Is this mark genuinely holistic? This can depend on the design of the scheme, which may or may not be arranged such that the overall grade is the linear sum of the categories. These two possibilities have clear implications for the direction in which the evaluators’ thoughts are guided. In the case where the overall mark is a mathematical function of the marks awarded across several segmented categories, the evaluator is not explicitly required to consider the overall quality of the performance. In the alternative case, where the evaluator is required to produce an overall mark that is nominally unrelated to the subcategory marks, they might decide for themselves on a strategy for combining or weighting the designated subcategories. For example, one evaluator may choose to arrive at a holistic mark first and complete the subcategories accordingly; another may devise his or her own mathematical or quasi-mathematical relation between the subcategories and the overall mark. RELIABILITY
It has become something of a cliché to observe that performance assessment is highly subjective and, so the implication goes, unreliable. In fact, empirical research suggests that this need not always be the case, and spe-
26
Sam Thompson & Aaron Williamon
cifically that the level of reliability can be dependent on the nature of the assessment scheme being used. Fiske (1977) carried out a study comparing evaluations of trumpet performance made using a segmented mark scheme. The scheme included an overall quality mark that did not stand in any specified relationship to the other scale categories. He found that the variation between evaluators was greater on the various scale categories than for the overall assessment category. This result could be taken to imply that the evaluators internally weighted the segmented categories differently in arriving at their overall mark, or alternatively that they interpreted the meaning of the categories differently. Mills (1991) reported a study in which groups of evaluators were asked to rate a performance both holistically and according to a 12-category segmented assessment scheme. She found that the segmented mark scheme accounted for around 70% of the variability between holistic marks. She suggested that there might thus be no assessment advantage in using a segmented mark scheme as it may not adequately reflect the process of arriving at a holistic, overall mark. Moreover she argued that, conceptually speaking, holistic assessment is more “musically credible” (p. 179) than segmented assessment, appearing closer to the kind of informal quality judgments made in everyday listening. However, the fact that a marking scheme is not outwardly segmented and appears to have good reliability across evaluators should not necessarily imply that the evaluative processes themselves are necessarily similar. W. F. Thompson, Diamond, and Balkwill (1998) have given a persuasive demonstration that evaluators may make holistic judgments according to internalized, personal criteria that are difficult to express verbally and do not necessarily relate to those of others. Using recordings of a Chopin étude, they interviewed five experienced piano adjudicators and elicited six bipolar constructs (after Kelly, 1955) from each; five of these constructs were intended to refer to specific characteristics of the performance, whereas one was intended to capture the overall quality. Categorizing by semantic similarity, the researchers found a certain amount of overlap between the constructs elicited from each adjudicator, although there were also some marked differences. For example, several adjudicators chose constructs highly specific to the particular piece (for one, the performance of bar 27 was apparently a crucial indicator of performance quality). In the second part of the study, the adjudicators heard six different recorded performances of the same work and were required to assess each one on a seven-point Likert scale corresponding to their own six bipolar constructs. A comparison of the various “overall preference” constructs revealed only a moderate mean correlation (r = 0.53) between the six adjudicators. This research suggests that experienced musicians may develop a kind of internal segmented marking scheme, perhaps even one that is specific to the piece be-
Evaluating Evaluation
27
ing performed. The forced-choice nature of the initial task (the adjudicators being required to develop six constructs, from material suggested by the experimenters) arguably limits the extent to which the constructs elicited can be claimed to be identical with (or even equivalent to) those that adjudicators would employ in a real, nonexperimental context. Clearly, however, the potential existence of highly personal category systems is problematic for the researcher faced with the challenge of reliably quantifying performance difference or change. In certain contexts, moreover, judgments of performance quality may be affected by cultural expectation. Specifically, extramusical factors such as the performer’s appearance, or adherence to accepted performance protocol, may have a strong influence on perceived quality. In the Western art music tradition, for instance, playing from memory is a skill that is expected of expert performers. Williamon (1999) found that memorized performances in which a music stand was in view (but, unbeknownst to the viewers, did not hold any music) were awarded lower marks than those given from memory where no music stand was visible. This result suggests that listeners implicitly weighted their quality judgments according to their expectations of how—in a visual sense—a good performer should perform. Davidson and Coimbra (2001) carried out a study of the musical assessment procedures at a U.K. conservatory. Qualitative analysis of the examiners’ discussions after a series of vocal performances suggested that the performers’ physical appearance was an important factor in determining overall quality. However, this was not stated explicitly, and did not appear on any of the final reports. Perhaps more worryingly yet, Elliott (1995/1996) prepared video tapes in which identical audio tracks were dubbed over performances by a white male, white female, black male, and black female on both the trumpet and flute. Eighty-eight music education majors evaluated the performances, and subsequent analysis revealed a main effect of race, with black performers rated significantly lower than white performers. Taken together, these findings suggest that performance quality may sometimes be influenced by the “wrong” (or at least inappropriate) criteria—this point is returned to in “Wider Implications” later. A STUDY OF COMMON-PRACTICE ASSESSMENT
For applied research aiming to enhance, or otherwise influence, a musical performance, it is clearly crucial that the results of any intervention be easily observed in typical, real-world situations. This is not to make any assertion about the magnitude of difference a performance intervention must provide to be detectable, but rather to insist that the sensitivity of the detection tools must be sufficient to identify any such difference under normal conditions. The criteria that define normal conditions imply—as a
28
Sam Thompson & Aaron Williamon
minimum—a formal concert or examination setting, a well-rehearsed and reasonably extended program, and an impartial jury consisting of between one and five experienced members. However, obtaining data from such assessment scenarios is far from straightforward. There is an understandable reticence on behalf of music educational institutions to change between assessment systems purely for the sake of research. Plainly, a research methodology that risks interfering, even indirectly, with the outcome of a real assessment will be fraught with ethical difficulties. To address this issue, we designed a large-scale controlled study of common-practice assessment under ecologically valid conditions. The study was embedded within a larger research project at the Royal College of Music (RCM) in which students participated in performance enhancement training programs, and the efficacy of those programs was measured, in part, by evaluations of performance quality. The performances reported in this article took place before the students underwent training. By taking this approach, we hoped to shed light on two key areas: 1. Interevaluator agreement and difference (to what extent do the evaluators’ marks concur, and in what ways do they differ?) 2. Utility of the measurement system (to what extent is the system capable of reliably discriminating between features of the performance?)
Method PARTICIPANTS
Sixty-one students at the RCM (17 male and 44 female) volunteered to give a performance for the study. Of these students, 15 were keyboard players, 10 woodwind, 24 string, and 12 other (i.e., harp, guitar, brass, and voice). The evaluators were a female pianist, a male cellist, and a male clarinetist. All three were professional performing musicians in the U.K., each of whom had substantial experience of evaluating at conservatory level. The three evaluators were external to the RCM, did not know any of the participating students, and were paid for their participation. MATERIALS
All performances were recorded on a digital video camera with an external microphone to ensure good sound quality (sound sampled at 48 kHz). Digital video editing was conducted using Adobe Premiere software. PROCEDURE
The student participants were asked to give recitals up to 15 minutes long on their first study instrument. Each recital consisted of two contrasting pieces of the performer’s choice.
Evaluating Evaluation
29
The recitals were performed under typical examination conditions before a live panel of instrumental professors from the RCM. Video recordings of the performances were subsequently edited into a random order, in such a way that the recordings began and ended approximately 3 s before and after each performance. The resulting set of tapes was sent to the three external evaluators, who were asked to assess each performance according to a segmented marking scale. This consisted of an overall quality mark and 13 further categories; in this study, the performance as a whole was assessed (i.e., both pieces combined, rather than each piece separately) as this is the standard protocol within examinations at the RCM and at other similar institutions. The assessment form is given in the Appendix. The main categories (technical competence, musicality, and communication and presentation) were taken directly from the guidelines of the Associated Board of the Royal Schools of Music (Harvey, 1994), and the component subcategories were defined through a consultation exercise involving instrumental professors at the RCM. The scheme of the Associated Board of the Royal Schools of Music was chosen as the basis because of its ubiquity within the U.K. music education system and the fact that, more than most comparable systems, it has been in constant use and development for many years. Each of the 14 categories was marked on a scale of 1–10; this scale was chosen instead of the common 1–7 Likert scale because it mapped more directly onto the 100-point scale frequently used in the educational contexts with which the evaluators were familiar. The evaluators were given no specific instructions as to the order in which they should complete the form—that is, whether they should arrive at an overall mark first and then fill out the form, or vice versa. Particularly, it was not specified whether the various subcategories should stand in any specific relationship to the overall categories, or to the overall mark.
Results INTEREVALUATOR AGREEMENT AND DIFFERENCE
The Shapiro-Wilk test for normality of distribution was significant for all categories, for all three evaluators, indicating that the range of marks awarded was not normally distributed. Therefore, nonparametric statistical methods were used in all analyses. To investigate the level of agreement between the evaluators, correlation coefficients (Spearman’s rho) were calculated between each combination of evaluators for the complete set of performances (Table 1). Over the complete set of performances, evaluators’ marks did correlate positively (mean r = 0.498, range 0.332–0.651, p < .05) across the full set of categories; however, this level of correlation is rather moderate, accounting for only about 25% of the observed variance. Judgments of overall quality were no more strongly correlated across evaluators than any of the others (and were actually more weakly correlated in some cases). These results do not, therefore, concur with the findings of Fiske (1977) that overall judgments are more reliable across examiners than segmented judgments of different performance aspects (although the data may not be directly comparable because of various methodological differences). It could be hypothesized that evaluators will mark performances on their own instrument, or instruments closely related to their own, differently
30
Sam Thompson & Aaron Williamon
TABLE 1 Correlation Coefficients (Spearman’s r) between Each Evaluator Pairing, per Category Category
Overall quality Instrumental competence Technical security Rhythmic accuracy Tonal quality Musical understanding Stylistic accuracy Interpretative imagination Expressive range Communicative ability Deportment on stage Deportment with instrument Emotional commitment Ability to cope with stress
Evaluator 1 vs Evaluator 2
Evaluator 1 vs Evaluator 3
Evaluator 2 vs Evaluator 3
0.486* 0.520* 0.501* 0.405* 0.332* 0.407* 0.454* 0.367* 0.413* 0.464* 0.366* 0.455* 0.442* 0.526*
0.583* 0.569* 0.435* 0.565* 0.477* 0.460* 0.396* 0.356* 0.469* 0.465* 0.410* 0.468* 0.417* 0.476*
0.569* 0.585* 0.599* 0.542* 0.647* 0.651* 0.627* 0.616* 0.572* 0.585* 0.512* 0.552* 0.514* 0.651*
* p < .005.
from how they will mark performances on instruments with which they are not so familiar. This difference might be expected particularly on the technical competence questions, where an in-depth knowledge of the technical requirements of playing an instrument may enable the evaluator specializing in that instrument to notice features of the performance that are missed by others who have experience with different instruments. Such bias could feasibly operate in either direction—familiarity with the music for their own instrument may lead evaluators to find a performance more pleasing (see North & Hargreaves, 1997), whereas more detailed knowledge of, for example, relevant technical issues could result in them marking more harshly. Using the Median Test (a non-parametric equivalent to the one-way analysis of variance), a separate calculation was performed for each evaluator, with instrumental group as the independent variable and overall quality and overall instrumental competence as the dependent variables. No significant differences were discovered for Evaluators 1 and 3. For Evaluator 2, however, significant differences emerged for both variables (overall quality: c2[3] = 7.864, p < .05; overall instrumental competence: c2[3] = 9.275, p < .05). The same calculation was made for each subcategory, yielding the same pattern of results. Figure 1 is a graph of the mean “overall quality” scores for each evaluator on each instrumental group. Although Evaluators 1 and 3 appear to have been relatively even-handed, the graph suggests that Evaluator 2 assigned lower ratings to the strings (his own instrumental group) than the
Evaluating Evaluation
31
Fig. 1. Mean marks awarded by each evaluator, per group. Error bars show standard error.
other instrumental groups. This was confirmed by a post hoc Mann-Whitney test, comparing Evaluator 2’s marks for the string group against those for all other groups (U = 73, p < .005). The same procedure was used to examine the possibility of gender bias— for example, evaluators awarding higher marks to male performers or vice versa. No significant effect was discovered, although the unbalanced sample sizes could mask any existing bias to some extent (see also “Interevaluator Agreement and Difference” in the Discussion section). UTILITY OF MEASUREMENT SYSTEM
The third assumption of formal evaluation (see “Principles,” earlier) implies not only that these performance aspects should be readily distinguishable but also that they should be, in principle and in practice, autonomous. Furthermore, they should stand in some composite relationship (though not necessarily linear; Mills, 1991) with the overall quality mark. If the categories on a given segmented scale were truly independent, therefore, a regression analysis would be expected to reveal that they each account for a separate proportion of the variance in the overall quality mark. To test this hypothesis, a regression analysis was planned for each evaluator, using the three main categories (overall rating of instrumental competence, overall rating of musical understanding, and overall rating of communicative ability) as predictors of the overall quality mark. However, inspection of the data revealed a high degree of multicollinearity between predictor variables, as indicated by the coefficients given in Table 2. For each evaluator, the three predictor variables are highly correlated with the dependent variable overall quality. Hence, in a regression model, they would each account independently for almost all of the observed variance; such a model would thus be of little explanatory value.
32
Sam Thompson & Aaron Williamon
TABLE 2 Correlation Coefficients (Spearman’s r) for Each Predictor Variable with Overall Quality Correlation Coefficients (Predictor with Overall Quality) Predictor Variables
Instrumental competence Musical understanding Communicative ability
Evaluator 1
0.922* 0.870* 0.860*
Evaluator 2
0.966* 0.950* 0.975*
Evaluator 3
0.919* 0.909* 0.938*
* p < .001 (one-tailed).
The high level of intercorrelation between categories within a given performance raises the following question: to what extent were evaluators able to discriminate between different aspects of performance? Schemes in which the evaluators are required to award a numerical mark for each category are obviously of limited utility in a research context if those using them simply tend to give the same mark across all categories in any given performance. For each evaluator, the standard deviation of marks awarded across all the categories for each performance was calculated and then averaged across all performances to give a mean standard deviation for each evaluator per performance. This value thus represents the average range of marks used by each evaluator in assessing a typical performance. These data are presented in Figure 2. The graph suggests that Evaluator 2 displayed a substantially lower mean standard deviation than did Evaluators 1 and 3. As the standard deviations were again significantly non-normally distributed (W = 0.958, p < .001),
Fig. 2. Mean standard deviation of marks awarded across all 14 categories, per performance. Error bars show standard error.
Evaluating Evaluation
33
the Median Test statistic was calculated with standard deviation per performance as the dependent variable and evaluator as the independent variable. This statistic was significant (c2[2]= 61.787, p < .001), and a post hoc Mann Whitney test between Evaluator 2 and Evaluators 1 and 3 (combined) revealed a significant difference (U = 1017, p < .005). Evaluator 2 showed an average standard deviation of only 0.435 out of a scale of 10, indicating an extremely narrow degree of discrimination between categories.
Discussion INTEREVALUATOR AGREEMENT AND DIFFERENCE
Regarding the question of overall agreement, at this stage it remains an open question whether correlation coefficients of around 0.5 are the best that can be expected in a performance assessment context; this is a question to be addressed by future research. Mills (1991) achieved agreement (r) between assessors of between 0.2 and 0.7, and W. F. Thompson et al. (1998) demonstrated mean agreement between their adjudicators of 0.53. It should be noted again that both of these studies used substantially fewer performances, different groups of evaluators, and different assessment methodologies than in the present study. Moreover, both sets of correlations are based on rank ordering of performances by perceived quality, not (again as in the present study) on a direct quantitative measure. The results are, therefore, not directly comparable, a fact that amply illustrates a major justification for the development of an “industry standard” performance assessment research tool in the first place. Nevertheless, that a naturalistic study should yield similar results to these various approaches could be taken as a good sign; at least, one might argue, realistic assessment appears to be no less reliable than that conducted under more heavily controlled conditions. In general, however, the difficulty in demonstrating a strong correlation between the assessments of only three highly experienced evaluators is worrying from the perspective of the first assumption. If performance quality really is a dimension with a common psychological reality for a majority of experienced listeners, then it is not at all clear whether the assessment systems considered here have come close to quantifying it. Another factor to bear in mind when considering the correlation statistics is that positive coefficients do not specify whether two sets of numbers are in the same range, only that they vary in approximately the same way. So, even if the correlations had returned particularly high coefficients, it is
34
Sam Thompson & Aaron Williamon
entirely possible that one evaluator could have been marking consistently higher or lower than the others. This would not represent a problem as such, depending on the use for which the evaluations were intended. As suggested earlier, a researcher might be interested in measuring performance quality before, during, and after some intervention (i.e., a repeated measures design); as long as the evaluators were self-consistent, the absolute mark might then be less relevant than the degree of change observed. In a between-groups experimental design, by contrast, the absolute value awarded is obviously of crucial importance. Related to reliability is the question of the nature of specific differences between evaluators. The above analysis highlighted one aspect of disparity, namely the apparent tendency of Evaluator 2 to award lower marks to the string players than to players of other instruments. Clearly, this alone does not account for the relatively indifferent level of correlation overall, and it is certainly possible that other similar, generic biases might have influenced the evaluations. An obvious candidate would be gender bias, although in the present study no evidence of this was found. It is worth noting that the word “bias” is not used here to imply any deliberate adjustment of marks by the evaluators. However, in addition to the statistical data presented earlier, further evidence suggests that the existence of certain biases might be conscious to some degree. During a debriefing session held after the evaluation process was completed, one of the evaluators (not Evaluator 2) specifically mentioned that he found assessing performance on certain instruments (in instrumental groups other than his own) difficult. He requested to see the marks given by the other evaluators as a comparison, being particularly concerned that his own marks for these performances might be “wrong,” in the sense of being out of step with those awarded by the specialist. What are the implications of this? If a broader study were to show that instrumental biases were prevalent in the assessment of musical performance, it would suggest that instrument-specific evaluators and perhaps even instrument-specific marking schemes may be particularly undesirable in some contexts (or desirable, depending on the direction of the bias). It could even be that the criteria that constitute a good violin performance are actually different from those constituting, say, a good piano performance (cf. W. F. Thompson et al., 1998). Again, this is not necessarily a problem for the researcher so long as evaluators are selfconsistent, but if the marks of different instrumental specialists are to be treated as equivalent and used to infer conclusions in empirical studies, then obviously more caution may be required. In any case, this would still not explain the evaluator’s own feeling of unease, which seemed to reflect a perceived deficit in the evaluator’s own ability rather than a flaw in the process itself.
Evaluating Evaluation
35
UTILITY OF THE MEASUREMENT SYSTEM
In terms of practical applications for measuring performance difference and/or change, the second strand of analysis is particularly illuminating. The designated predictor variables in the proposed regression exhibited a high degree of multicollinearity, indicating that they were all strongly correlated with the dependent variable. Consequently, they were each able to account for an approximately equal proportion of the variance (around 90%). The comparison of standard deviations demonstrated that none of the evaluators had an average standard deviation per performance of above 1, with Evaluator 2 exhibiting significantly the most limited range of marks per performance. In other words, the range of discrimination between categories, within a given performance, appeared to be rather narrow, which explains the strong correlations observed between variables. There seem to be four likely explanations for this result. The first is that the evaluators completed the form carelessly or hastily. For example, after each performance they may have arrived initially at a holistic mark, and then simply gave that same mark for all performance categories unless some particular feature of the performance made them do otherwise (e.g., an obvious technical slip leading to a revised mark for “Instrumental Competence”). A variation on this would be that in the absence of any specific guidelines for completing the form they deliberately adopted a strategy along these lines. The second possibility is that the evaluators were simply unable to distinguish in any meaningful way between the categories suggested on the form, either because of limitations in their own discriminative abilities or because the particular categories employed do not correspond to those with which the evaluators were familiar. If this was the case it would directly contradict the third key assumption presented earlier and would, thus, have rather worrying implications for the whole endeavor of structured performance assessment (or at least for this particular system and the numerous others that share its basic configuration). However, this possibility seems unlikely because the categories were not arbitrary but developed by highly experienced musicians—individuals with professional experience and skills similar to the present evaluators. A third option is that the categories themselves did have psychological reality for the evaluators, but that they were causally correlated with each other. For example, although one can hypothesize a performance that is highly musical and yet technically poor, in reality perhaps the technical problems would lead to the level of musical understanding being perceived as inferior. Although not contradicting any of the preceding assumptions directly, this situation would inevitably lead to a narrow range of marks being awarded across categories and would again call into question the validity of segmented assessment.
36
Sam Thompson & Aaron Williamon
A fourth possibility is that none of the performances in this sample displayed wide disparities between aspects of performance. In other words, all the performers could have exhibited approximately the same degree of competence across all the specified categories. One tempting solution to these last two possibilities would be to increase the sensitivity of the scale employed, from 1–10 up to 1–20, 1–50, or even 1–100. However, although this might give the appearance of broadening the range of marks, it would (a) introduce more statistical “noise” into the analyses, (b) seem to increase the degree to which the marks are perceived as arbitrary by those being evaluated, and (c) likely have little or no effect—in proportional terms—on the relationship between range of marks awarded and range of scale.
Wider Implications On the basis of the results of the study reported here, it is difficult to be positive about the utility of structured assessment schemes such as those used as tools in psychological research. Whether the pattern of results observed is peculiar to the particular circumstances of the current study, or to the chosen assessment scheme, cannot be verified without attempts at replication. However, to the extent that these circumstances are not unique to the present study but are, in fact, highly typical of real-world assessment situations, it seems reasonable to interpret the results as being characteristic of normal practice rather than an empirical anomaly. If this sounds excessively bleak, it is worth remembering that the problem is partially one of perspective. A psychologist may well have different expectations than an educator regarding what an assessment tool should be expected to achieve. It may be too trite, for example, to assume that everyone is interested in achieving maximum levels of interevaluator correlation. Psychologists are often more interested in generic abilities or characteristics that are common across groups of similar subjects than in the particular differences that might obtain between individuals, and hence will tend to emphasize the question of agreement. By contrast, the prominence (evidenced by the weighting of marks) given by examining institutions to manifestly imperfect performance assessment protocols might suggest, in fact, that musicians and educators have a rather more sanguine, or at least accepting, attitude toward the whole issue of evaluator differences. Nonetheless, if it is true that existing protocols do not satisfy scientific standards of reliability, replicability, and discrimination, what can be done to improve them? On the assumption that considerable individual differences between evaluators are both real and unavoidable, it seems that there are essentially two approaches that can be taken to dealing with them in empirical studies.
Evaluating Evaluation
37
The first, mentioned briefly earlier, is to use many more evaluators to assess each musical performance, in the hope that such individual differences as exist between them will be rendered statistically insignificant when the data set is viewed as a whole. Given that the overall level of correlation observed in the present study is similar to that reported in studies using more evaluators (pace the comments just made regarding the difficulty of making such comparisons at all), it is not clear whether this method would really have an effect in terms of improving interevaluator agreement, although this is an issue for future research to clarify. More importantly, it is clearly impractical to use large numbers of evaluators in most real performance situations. However, it is essential that applied research in music performance be conducted, at least partially, in precisely such situations. Even if high reliability could be demonstrated in a study using, say, 20 evaluators, this result would not help provide a satisfactory solution to the problem faced by applied researchers. We are currently exploring a second, less conventional approach to the issue of individual differences. If a number of self-assessment measures are included within the assessment procedure and completed by the evaluators themselves, these can then be used as covariates in subsequent statistical analyses, providing a means of controlling for biases. This approach has its own associated problems, not least of which are the potential hostility of evaluators to a process that seems to focus scrutiny in the wrong direction (i.e., on themselves rather than the performance) and the need for relatively complex post hoc statistical analysis. However, such an approach has the distinctive advantage of controlling for individual differences at source; in principle, this should render immaterial the number of evaluators involved in making the assessment and the degree of experimental control exerted over the performance situation. Precisely what questions need to be asked of the evaluators in order to provide suitable covariates is an empirical issue that will be resolved only through further investigation. Regarding the use of segmented-type scales, it may plausibly be argued that a major constraint on between-category discrimination is the semantic problem posed by the assessment categories. As discussed above, and illustrated by W. F. Thompson et al. (1998), evaluators do not necessarily share the same idea of what factors contribute (in the “segmented” sense) to a good performance. More seriously, there is no way to guarantee that several evaluators’ understanding of, say, “musical understanding” is the same or even equivalent in any quantifiable way. One route to achieving higher levels of between-category discrimination on a given assessment scheme may be to provide more precise guidelines for the completion of the evaluation form, defining each category in great detail. This is the approach taken by the Guildhall School’s clear system (Hollis, 2001a, 2001b), and in numerous other systems (e.g., Hunter & Russ, 1996; Stanley et al., 2002;
38
Sam Thompson & Aaron Williamon
see Johnson, 1997, for a persuasive critique of this approach). It is certainly conceivable that a large-scale survey could identify with reasonable accuracy the most commonly perceived performance categories and from these build an assessment tool. In use, such a tool would have the practical limitation that evaluators would be required to spend a good deal of time familiarizing themselves with the scheme. More seriously, however, this approach seems dangerous in that it necessitates an uncomfortable degree of intervention into the evaluative process and, consequently, a serious reduction of ecological validity. Applied researchers must decide for themselves whether such a trade-off justifies the more detailed data output that can be achieved. Considering the empirical approach to performance evaluation more generally, there seems to be a danger in the temptation to whitewash all individual differences. As has been argued throughout this article, performance quality as a research dimension is meaningless without real-world validity, and thus, in an important sense, it is actually irrelevant whether an evaluator arrives at their final quality judgment through the weighing-up of purely “musical” factors or because they take exception to a particular item of clothing worn by the performer. These may not be the criteria on which one would wish evaluators to make their judgments, or even on which they “should” make them; however, this is not an issue about which researchers interested in performance enhancement can afford to be dogmatic (or even, as the case may be, politically correct). For now, researchers who wish to employ performance assessment as a dependent measure in experimental studies may have to accept that musical performances are simply not open to reliable and consistent scrutiny of the type they might wish. At the very least, such researchers must be aware that simply using the same protocols that are prevalent in educational contexts may not provide them with the degrees of reliability, replicability, and discrimination demanded by scientific standards. Whether future work will lead to the development of a research tool capable of meeting these demands is, at this stage, unclear.1
References Byo, J. L., & Brooks, R. (1994). A comparison of junior high musicians’ and music educators’ performance evaluations of instrumental music. Contributions to Music Education, 21, 26–38. Davidson, J., & Coimbra, D. D. C. (2001). Investigating performance evaluation by assessors of singers in a music college setting. Musicæ Scientiæ, 5, 33–50. 1. This work was supported by a grant from The Leverhulme Trust. The authors thank Yonty Solomon, Brian Hawkins, and the three evaluators for their help with the performance assessment, and Tânia Lisboa, Gary McPherson, Janet Mills, Elizabeth Valentine, the CSMP discussion group, Bill Thompson, and three anonymous reviewers for helpful comments on an earlier draft.
Evaluating Evaluation
39
Delzell, J. K., & Leppla, D. A. (1992). Gender associations of musical instruments and preferences of fourth-grade students for selected instruments. Journal of Research in Music Education, 40, 93–103. Elliott, C. A. (1995/1996). Race and gender as factors in judgments of musical performance. Bulletin of the Council for Research in Music Education, 127, 50–56. Fiske, H. E. (1977). Relationship of selected factors in trumpet performance adjudication reliability. Journal of Research in Music Education, 25, 256–263. Gruzelier, J. H., Egner, T., Valentine, E., & Williamon, A. (2002). Comparing learned EEG self-regulation and the Alexander Technique as a means of enhancing musical performance. In C. Stevens, D. Burnham, G. McPherson, E. Schubert, & J. Renwick (Eds.), Proceedings of the 7th International Conference on Music Perception and Cognition, Sydney, Australia. Adelaide: Causal Productions. Harvey, J. (1994). These music exams. London: ABRSM Publications. Hollis, E. (2001a). The Guildhall School’s clear performance assessment system: How clear works. London: Guildhall School of Music and Drama Publications. Hollis, E. (2001b). The Guildhall School’s clear performance assessment system: Marking schemes for the assessment categories. London: Guildhall School of Music and Drama Publications. Hunter, D., & Russ, M. (1996). Peer assessment in performance studies. British Journal of Music Education, 13, 67–78. Johnson, P. (1997). Performance as experience: the problem of assessment criteria. British Journal of Music Education, 14, 271–282. Juslin, P. N., & Laukka, P. (2000). Improving emotional communication in music performance through cognitive feedback. Musicæ Scientiæ, 4, 151–183. Juslin, P. N., Friberg, A., & Bresin, R. (2001/2002). Towards a computational model of expression in music performance: The GERM model. Musicæ Scientiæ, special issue, 63–122. Kelly, G. (1955). The psychology of personal constructs. New York: Norton. McPherson, G. E. (1995). The assessment of musical performance: Development and validation of five new measures. Psychology of Music, 23, 142–161. McPherson, G. E., & Schubert, E. (in press). Measuring performance enhancement in music. In A. Williamon (Ed.), Enhancing musical performance. Oxford: Oxford University Press. McPherson, J. (2002). Musical performance: Holistic assessment using the SOLO taxonomy. In C. Stevens, D. Burnham, G. McPherson, E. Schubert, & J. Renwick (Eds.), Proceedings of the 7th International Conference on Music Perception and Cognition, Sydney, Australia. Adelaide: Causal Productions. Mills, J. (1991). Assessing musical performance musically. Educational Studies, 17, 173– 181. North, A. C., & Hargreaves, D. J. (1997). Experimental aesthetics and everyday music listening. In D. J. Hargreaves & A. C. North (Eds.), The social psychology of music. Oxford: Oxford University Press. Stanley, M., Brooker, R., & Gilbert, R. (2002). Examiner perceptions of using criteria in music performance assessment. Research Studies in Music Education, 18, 43– 52. Steptoe, A., & Fidler, H. (1987). Stage fright in musicians: A study of cognitive and behavioural strategies in performance anxiety. British Journal of Psychology, 78, 241– 249. Thompson, W. F., Diamond, C. T. P., & Balkwill, L. (1998). The adjudication of six performances of a Chopin etude: A study of expert knowledge. Psychology of Music, 26, 154– 174. Valentine, E. R., Fitzgerald, D. F. P., Gorton, T. L., Hudson, J. A., & Symonds, E. R. C. (1995). The effect of lessons in the Alexander technique on music performance in high and low stress situations. Psychology of Music, 23, 129–141. Wapnick, J., Mazza, J. K., & Darrow, A. A. (2000). Effects of performer attractiveness, stage behavior, and dress on evaluation of children’s piano performances. Journal of Research in Music Education, 48, 323–335.
40
Sam Thompson & Aaron Williamon
Williamon, A. (1999). The value of performing from memory. Psychology of Music, 27, 87–95. Williamon, A., & Thompson, S. (in press). Psychology and the music practitioner. In J. Davidson & H. Eiholzer (Eds.), The music practitioner. Aldershot, U.K.: Ashgate Publishing. Williamon, A., & Valentine, E. (2000). Quantity and quality of musical practice as predictors of performance quality. British Journal of Psychology, 91, 353–376. Williamon, A., & Valentine, E. (2002). The role of retrieval structures in memorizing music. Cognitive Psychology, 44, 1–32. Wrigley, W. J., Emmerson, S. D., & Thomas, P. R. (2002). Improving the accountability and educational utility of the Queensland Conservatorium music performance assessment process. Paper presented at the SRPMME conference Investigating Music Performance, Royal College of Music, London, U.K. Zdzinski, S. F. (1991). Measurement of solo instrumental music performance: A review of literature. Bulletin of the Council for Research in Music Education, 109, 47–58.
Appendix Rating Form Used by Evaluators DIRECTIONS: Circle the appropriate number to the right of each statement and write additional comments beneath. __________________________________________________________________________________________ OVERALL QUALITY 1. Overall rating of 1 2 3 4 5 6 7 8 9 10 performance quality Comments: __________________________________________________________________________________________ PERCEIVED INSTRUMENTAL COMPETENCE 1. Overall rating of 1 2 3 4 instrumental competence Comments:
5
6
7
8
9
10
2. Level of technical 1 2 3 4 5 6 7 8 9 10 security 3. Rhythmic accuracy 1 2 3 4 5 6 7 8 9 10 4. Tonal quality 1 2 3 4 5 6 7 8 9 10 and spectrum __________________________________________________________________________________________ MUSICALITY 1. Overall rating of musical 1 understanding Comments:
2
3
4
5
6
7
8
9
10
2. Stylistic accuracy 1 2 3 4 5 6 7 8 9 10 3. Interpretive imagination 1 2 3 4 5 6 7 8 9 10 4. Expressive range 1 2 3 4 5 6 7 8 9 10 __________________________________________________________________________________________
Evaluating Evaluation COMMUNICATION 1. Overall rating of communicative ability Comments: 2. Deportment on stage 3. Deportment with instrument 4. Communication of emotional commitment and conviction 5. Ability to cope with the stress of the situation
41
1
2
3
4
5
6
7
8
9
10
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10