holders (Garant 2009; Roat 2006; Strong & Rudser 1985; Vermeiren et al. 2009). Given the prevalence and the criticality of interpretation assessment, the over-.
Using rating scales to assess interpretation Practices, problems and prospects Chao Han Southwest University (China)
Over the past decade, interpretation assessment has played an increasingly important role in interpreter education, professional certification, and interpreting research. The time-honored assessment method is based on analysis of (para)linguistic features of interpretation (including such items as omissions, substitutions, un/filled pauses and self-corrections). Recently, use of descriptor-based rating scales to assess interpretation has emerged as a viable alternative (e.g., Angelelli 2009; Han 2015, 2016; J. Lee 2008; Tiselius 2009), arguably providing a basis for reliable, valid and practical assessments. However, little work has been done in interpreting studies to ascertain the assumed benefits of this emerging assessment practice. Based on 17 international peer-reviewed journals over the last twelve years (2004–2015), and other related publications (e.g., scholarly books, reports, documents), this article provides an overview of practices in scale-based interpretation assessment, focusing on four major aspects: (a) rating scales; (b) raters; (c) rating procedures; (d) reporting of assessment outcomes. Problem areas and possible emerging trends in interpretation assessment are examined, identifying a number of future research needs. Keywords: interpretation assessment, rating scales, reliability, validity
1.
Introduction
Interpretation is assessed in different contexts and for various purposes. In interpreter education, assessment is conducted to screen applicants for under-/postgraduate level interpreting courses/programs (Russo 2011; Wang 2007); diagnose trainee interpreters’ weaknesses (Ribas 2010; Schjoldager 1995; Y.-H. Lee 2005); gauge their progress (Campbell & Hale 2003; Postigo Pinazo 2008; Wu 2010); determine their level of achievement on completion of their course/program (Liu et al. 2008; Matthews & Ardemagni 2013); and, sometimes, evaluate the effectiveness of course/program content for accountability purposes (Lim 2006; Petronio doi: 10.1075/intp.00003.han (proofs) Interpreting 20:1 (2018), pp. 60–96. | issn 1384-6647 | e-issn 1569-982x © John Benjamins Publishing Company
Using rating scales to assess interpretation
& Hale 2009). In real-life interpreting, one of the important reasons for assessing interpretation is professional certification (Chen 2009; Clifford 2001, 2005; Han 2016; Han & Slatyer 2016), as witnessed by the increasing number of interpreter certification performance tests (ICPT) developed and administered in different parts of the world (see Chen 2009; Han 2016; Hlavac 2013; Kelly 2007; Liu 2013; Napier 2004; Roat 2006). Another reason is to enable quality control by clients, employers, or even interpreters themselves (Clifford 2001; Skyba 2014). Interpretation assessment is also conducted for research purposes, though this tends to attract far less attention in the interpreting community at large. Usually, the various types of assessment mentioned above generate quantitative outcomes (e.g., scores) and thus provide at least one dependent variable, to be correlated either with a (psychological) construct singled out for study (e.g., Rosiers et al. 2011; Timarová et al. 2014; Yan et al. 2010) or with prior completion of a given pedagogical intervention (e.g., Hale & Ozolins 2014; Ko 2008). Although the stakes can be very different from one type of assessment to another, each of them provides significant information for such decisions as admission, degree-track selection, degree conferral and certification, as well as for interpreting pedagogy, research and/or practice in general. Assessment-based decisions in turn produce washback effects (positive or negative) for academe and industry, involving all relevant stakeholders (Garant 2009; Roat 2006; Strong & Rudser 1985; Vermeiren et al. 2009). Given the prevalence and the criticality of interpretation assessment, the overarching concern is to ensure its reliability and validity (Angelelli 2009; Campbell & Hale 2003; Clifford 2005; Han 2016; Han & Slatyer 2016; Liu et al. 2008; Sawyer 2004), so that assessment outcomes can be used with confidence for the benefit of stakeholders. The most common and time-honored approach to assessing interpretation is arguably based on what Goulden (1992) calls the atomistic method: assessors focus on points of content in an interpretation and/or its (para)linguistic features (including such items as omissions, substitutions, errors, pauses, false starts and repetitions). This generates frequency data as indicators of interpretation quality, as can be seen in the early literature (e.g., Barik 1973; Goldman-Eisler 1967) and in more recent research (e.g., Mead 2005; Pio 2003). The atomistic method has the potential to generate a nuanced description of interpretation, and has thus been used mostly in research where there is a scholarly and pedagogic need to gain finer-grained understanding, and where a relatively small sample makes such an approach manageable. Some researchers, however, argue that the atomistic method is essentially reductionist (Beeby 2000; Strong & Rudser 1985), focusing on a lexical instead of a discoursal level of interpretation (Clifford 2001), and highly subjective (Gile 1999; McAlester 2000). In addition, the method has not yet fully demonstrated
61