Language Testing http://ltj.sagepub.com
Some reflections on task-based language performance assessment Lyle F. Bachman Language Testing 2002; 19; 453 DOI: 10.1191/0265532202lt240oa The online version of this article can be found at: http://ltj.sagepub.com/cgi/content/abstract/19/4/453
Published by: http://www.sagepublications.com
Additional services and information for Language Testing can be found at: Email Alerts: http://ltj.sagepub.com/cgi/alerts Subscriptions: http://ltj.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.co.uk/journalsPermissions.nav Citations http://ltj.sagepub.com/cgi/content/refs/19/4/453
Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Some re ections on task-based language performance assessment Lyle F. Bachman University of California, Los Angeles
The complexities of task-based language performance assessment (TBLPA ) are leading language testers to reconsider many of the fundamental issues about what we want to assess, how we go about it and what sorts of evidence we need to provide in order to justify the ways in which we use our assessments. One claim of TBLPA is that such assessments can be used to make predictions about performance on future language use tasks outside the test itself. I argue that there are several problems with supporting such predictions. These problems are related to task selection, generalizability and extrapolation. Because of the complexity and diversity of tasks in most ‘real-life’ domains, the evidence of content relevance and representativeness that is required to support the use of test scores for prediction is extremely dif cult to provide. A more general problem is the way in which dif culty is conceptualized, both in the way tasks are described and in current measurement models. The conceptualization of ‘dif culty features’ confounds task characteristics with test-takers’ language ability and introduces a hypothetical ‘dif culty’ factor as a determinant of test performance. In current measurement models, ‘dif culty’ is essentially an artifact of test performance, and not a characteristic of assessment tasks themselves. Because of these problems, current approaches to using task characteristics alone to predict dif culty are unlikely to yield consistent or meaningful results. As a way forward, a number of suggestions are provided for both language testing research and practice.
I Introduction Task-based language performance assessment (TBLPA) has brought language testers into closer contact with researchers in other areas of applied linguistics and in educational assessment, while at the same time creating a more complex arena in which we practice our profession, both as researchers and practitioners. The complexities of TBLPA are leading us to reconsider many of the fundamental issues about what we want to assess, how we go about it and what sorts of arguments and evidence we need to provide in order to justify the inferences and decisions we make on the basis of our assessments. The ‘performance’ part of this term is, of course, one of long-standing Address for correspondence: Lyle F. Bachman, Applied Linguistics & TESL, Rolfe Hall, Room 3314/3316, Los Angeles, CA 90095-1543, USA; email:
[email protected] 10.1191/0265532202lt240oa Ó 2002 Arnold
Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Language Testing 2002 19 (4) 453–476
454
Task-based language performance assessment
discussion in the language assessment literature, going back to the ‘direct testing’ movement of the 1970s (for overviews of this literature, see Bachman, 1990; McNamara, 1996 ) and has drawn researchers in other areas of applied linguistics into the debate (e.g., Lantolf and Frawley, 1985; 1988; Kramsch, 1986; van Lier, 1989; Young and Milanovic, 1992; Young, 1995; Tarone, 1998; Young and He, 1998). The ‘task-based’ aspect is of relatively more recent lineage, deriving much of its impetus from research in second language acquisition (SLA) and language pedagogy (e.g., Long, 1985; Candlin, 1987; Brindley, 1988; 1989; 1994; Crookes and Gass, 1993a; 1993b; Long and Crookes, 1992; Skehan, 1998). In this article I rst discuss some of the distinctions that have been made between ‘task-centred’ and ‘ability’ or ‘construct-centred’ approaches to language assessment, and the claims that have been made for a task-based approach. I then discuss some issues related to supporting one kind of use – prediction of future performance – that has been claimed for TBLPA. Next, I discuss what I see as some of the problems with attempts to predict task dif culty on the basis of so-called ‘task dif culty features’, as well as what I see as a fundamental contradiction in the way we presently problematize the notion of task dif culty. Throughout, I draw on research into TBLPA that has been conducted at the University of Hawaii at Manoa (Norris et al., 1998; Brown et al., in press). I discuss this research because it is the most fully conceptualized, operationalized and researched exempli cation of this approach of which I am familiar. Finally, I propose an agenda for research and practice that I see as a way forward for investigating these issues. II Tasks and constructs in language assessment Discussions of the distinction between task-centred and constructcentred approaches to test design can be found in both educational measurement (e.g., Messick, 1994b; Nichols and Sugrue, 1999) and language testing (e.g., McNamara, 1996; Skehan, 1998; ChalhoubDeville, 2001). An underlying premise to virtually all of these discussions of task-based language assessment is that the inferences we want to make are about underlying ‘language ability’ or ‘capacity for language use’ or ‘ability for use’. Thus, Brindley (1994: 75) explicitly includes, in his de nition of task-based language assessment, ‘the view of language pro ciency as encompassing both knowledge and ability for use’. In a similar vein, McNamara (1996) discusses issues of construct validity in performance assessments, stating that ‘such [task-centred] tests, no matter what their attention to content validity, still require us to address the fundamental problem in test validation, Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Lyle F. Bachman
455
that is, the question of justifying inferences from test performance’ (McNamara, 1996: 17). Skehan (1998), while advocating a taskbased approach to language assessment, nevertheless is explicit that the inferences to be made are about underlying ability, or what he calls ability for language use: ‘So what we need to do is . . . design testing procedures to probe as broad a capacity, on the part of the language learner, as can be used to cope with a range of realistic processing conditions’ (Skehan 1998: 180; italics in the original ). More recently, Chalhoub-Deville (2001) has discussed the results of her research on several popular oral assessments, which she refers to as ‘task-based assessments’. These results demonstrate the lack of comparability of scores and ‘the inability to uncover the speci c language abilities underlying performance’ on these oral assessments, leading her to conclude that: language testers and researchers need to expand their test speci cations to include the knowledge and skills that underlie the language construct. Such speci cations should be informed by theory and research on the language construct and the language-learning process as well as by systematic observations of the particulars of a given context. (Chalhoub-Deville, 2001: 225)
A different approach to de ning TBLPA can be found in the work of researchers at the University of Hawaii at Manoa (Norris et al., 1998; Brown et al., in press). These researchers see TBLPA as one kind of performance assessment, drawing on the following de nition, which they cite from Long and Norris (2000): [T]ask-based language assessment takes the task itself as the fundamental unit of analysis motivating item selection, test instrument construction, and the rating of task performance. Task-based assessment does not simply utilize the real-world task as a means for eliciting particular components of the language system which are then measured or evaluated; on the contrary, the construct of interest in task-based assessment is performance on the task itself. (Brown et al., in press: 14–15)
Thus, in Brown et al.’s approach, the inferences to be made are about ‘students’ abilities to accomplish particular tasks or task types’ (Brown et al., in press: 15). The notion of employing ‘authentic’ assessment tasks that correspond to tasks outside the test itself and that engage test-takers in authentic language use has informed many discussions of language assessment over the years, whether these are explicitly called ‘task based’ or not (e.g., Clark, 1975; Alderson, 1983; Shohamy and Reves, 1985; Spolsky, 1985; Brindley, 1994; Bachman and Palmer, 1996; McNamara, 1997; Skehan, 1998). The critical difference in Brown et al.’s approach thus lies not in the kinds of assessment tasks that are used, but rather in the kinds of inferences – predictions about future performance on real-world tasks – they claim they can make Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
456
Task-based language performance assessment
on the basis of test-takers’ performance on assessment tasks. In taking this approach, Brown et al. are essentially de ning the construct in terms of what Upshur (1979) called ‘pragmatic ascription’, or what test-taker’s can do, and in so doing, according to Upshur, limiting their interpretation to predictions about future performance. Another way to characterize the distinction that Brown et al. make between TLBPA and other types of performance assessment is in the way one interprets consistencies in responses across a set of assessment tasks. Brown et al.’s approach corresponds to what Chapelle (1998) has called a ‘behaviourist perspective’ on construct de nition, in which response consistencies are interpreted as what Messick (1989) has called ‘samples of response classes’. In contrast to this, the ‘trait perspective’ – which has been advocated by other proponents of performance assessment – would interpret these response consistencies as evidence of underlying processes or structures (Messick, 1989: 15).1 Or, to put it in Chapelle’s (1998: 34) terms, Brown et al. ‘attribute consistencies to contextual factors’, while others would attribute response consistencies to characteristics of the test-taker.2 One way of interpreting the differences between these two approaches is represented in Figure 1.3 In addition to representing the salient differences between these two kinds of interpretation, Figure 1 also illustrates the putative difference between so-called ‘abilitybased’ and ‘task-based’ approaches to language test development. One formulation of this difference is that given by Norris et al. (1998): a.
in developing a performance assessment, focus either on constructs or on tasks: i. Begin construct-based test development by focusing on the construct of interest and then develop tasks based on the performance attributes of the construct, score uses, scoring constraints, and so forth. ii. Begin task-centered test development by deciding which performances are the desired ones. Then, score uses, scoring criteria, and so forth become part of the performance test itself. (Norris et al., 1998: 25)
1 I grant that some interpretations of ‘traits’ may be based on little more than response consistencies, or statistically reliable variance, without adequate construct de nitions built into the assessment design. 2 I would note that Chapelle also discusses a third approach, an interactionalist approach, which attributes response consistencies to ‘the result of traits, contextual features, and their interaction’ (Chapelle, 1998: 34). 3 I recognize that many other factors – such as test takers’ topical knowledge and other personal attributes and features of the context (e.g., interlocutors, setting, random events) – also affect response consistencies. However, for purposes of this discussion, I focus on the two factors that are central to the distinction between these two interpretations, namely language ability and task characteristics.
Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Lyle F. Bachman
457
Figure 1 Different interpretations of response consistencies on language assessment tasks: (a) ‘Ability-based’ inferences about language ability and (b) ‘Task-based’ predictions about future performance as ‘real-world’ tasks
From this account, it would appear that the so-called ‘construct-based’ approach must consider both constructs and tasks, while the so-called ‘task-based’ approach need consider only performances on tasks. The former approach is in line with most recent textbooks in language assessment, which make it quite clear that sound procedures for the design, development and use of language tests must incorporate both a speci cation of the assessment tasks to be included and de nitions of the abilities to be assessed (e.g., Alderson et al., 1995; Bachman and Palmer, 1996; Brown, 1996; Alderson, 2000; Douglas, 2000; Read, 2000; Buck, 2001; Davidson and Lynch, 2002). The demands and requirements for validation arguments and the kinds of evidence that need to be collected in support of inferences of ability, particularly in the context of performance assessment, have been discussed extensively in the literature (e.g., Linn et al., 1991; Brindley, 1994; Messick, 1994a; 1995; McNamara, 1997; Kane et al., 1999; Bachman, 2000), and the issues associated with construct validation are well known. Rather than rehearsing these here, I discuss the issues involved in supporting predictions about future performance on the basis of TBLPAs. This discussion focuses on three issues: · · ·
de ning tasks, or content domain speci cation; identifying and selecting tasks for use in language assessments, and content relatedness; and the relationship between real-life tasks and assessment tasks. Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
458
Task-based language performance assessment
III ‘What is this thing called task?’: Content domain speci cation One of the key tenets of TBLPA is that the tasks to be used on the test must be related in some way to tasks outside the test itself. Thus, if we intend to incorporate the notion of task into our approach to performance assessment, and to operationalize this in designing and developing a particular assessment, we clearly need to de ne what we mean by task. In the literature on SLA, language pedagogy and language assessment, one nds little consensus on what, precisely, a task is. De nitions vary from including virtually anything that is done (Long, 1985) to distinguishing between ‘real-world tasks’ and ‘pedagogic tasks’ (e.g., Long, 1985; Nunan, 1989; Long and Crookes, 1992) to Skehan’s (1998) extended de nition. Two recent approaches to de ning tasks can be found in the language assessment literature. Norris et al. (1998: 331) de ne task ‘as those activities that people do in everyday life and which require language for their accomplishment’. For Norris et al. a task is essentially a real-world activity, and they apparently do not distinguish between these and assessment tasks. They employ Skehan’s (1996; 1998) list of task dimensions as the salient features for describing tasks. Bachman and Palmer (1996: 44) de ne a ‘language use task’ as ‘an activity that involves individuals in using language for the purpose of achieving a particular goal or objective in a particular situation’. As with Norris et al., this de nition clearly focuses on tasks that involve language, but adds to this the notions – drawn from the SLA literature (e.g., Pica et al., 1993) – that tasks are goal oriented and are situated in speci c settings. Bachman and Palmer argue that this de nition is intended to be general enough to include assessment tasks as well as target language use (TLU) tasks, or tasks in relevant settings outside the test itself, including tasks intended speci cally for language teaching and learning purposes. Two critical issues arise in these discussions of tasks: · precisely how ‘real-life’ task types are identi ed, selected and characterized; and · how pedagogic or assessment tasks are related to these. Task speci cations constitute the de nition of the content domain to which our assessment-based inferences about ability extrapolate or the domain of real-life tasks which we want to predict. Therefore, vagueness in task speci cation inevitably leads to vagueness in measurement. Inconsistencies across tasks affect generalizability, or the extent to which our inferences generalize across a set of assessment tasks. Ill-de ned or indeterminate relationships between assessment tasks and TLU tasks affect extrapolation, or the extent to Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Lyle F. Bachman
459
which our inferences extend beyond the set of assessment tasks to tasks in a real-world domain. 1 ‘Which tasks do we use?’: Identifying and selecting assessment tasks One critical issue that must be addressed in the design, development and use of any language assessment is the speci cation of the procedures we will use to elicit instances of language use. While these procedures will include features such as the setting, administration, method and criteria for scoring, a central element in the procedures is what we present to the test-taker: the test or assessment ‘task’. One reason that the speci cation of assessment tasks is critical is that the particular tasks we include in our assessment will provide the basis for one part of a validity argument: that of content relevance and representativeness. Another reason is that the degree of correspondence between the test tasks and tasks outside the test itself provides a basis for investigating the authenticity of the test tasks. 2 Content relevance and content representativeness Content relevance and representativeness provide important types of evidence in support of a validation argument. Content relevance refers to the extent to which the areas of ability to be assessed are in fact assessed by the task (Messick, 1989: 39). The investigation of content relevance requires ‘the speci cation of the behavioral domain in question and the attendant speci cation of the task or test domain’ (Messick, 1980: 1017). Content representativeness refers to the extent to which the test adequately samples the content domain of interest and provides a basis for investigating what Messick (1989) calls task generalizability and Kane et al. (1999) call extrapolation. The problem of investigating and demonstrating content relevance and representativeness is two-fold. First, we must identify the TLU domain, which Bachman and Palmer (1996: 44) de ne as ‘a set of speci c language use tasks that the test-taker is likely to encounter outside the test itself, and to which we want our inferences about language ability to generalize’. Then we need to select tasks from that domain which will form the basis for language assessment tasks. In situations where there is a well-speci ed syllabus content, or even widespread consensus among the various stakeholders on what the TLU domain is, identifying this may be relatively unproblematic. However, there are many situations in which test-takers may come from a variety of backgrounds or have widely ranging language use needs, or in which there is no well-speci ed course content. In cases Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
460
Task-based language performance assessment
such as these, the TLU domain is not at all clear, or is ill-de ned, making it dif cult to investigate or provide evidence for content relevance and representativeness. To address the second issue, that of selecting tasks from the TLU domain upon which to base assessment tasks, a number of researchers (e.g., Long, 1985; Bachman and Palmer, 1996; Norris et al., 1998) have recommended using needs analysis. However, even when a wellde ned TLU domain can be identi ed, selecting speci c tasks from within that domain for use as assessment tasks may be problematic. Bachman and Palmer (1996) suggest three reasons why real-life tasks may not always be appropriate as a basis for developing assessment tasks. First, not all TLU tasks will engage the areas of ability we want to assess. Secondly, some TLU tasks may not be practical to administer in an assessment in their entirety. Thirdly, some TLU tasks may not be appropriate or fair for all test-takers if they presuppose prior knowledge or experience that some test-takers may not possess. A more critical problem is that of sampling adequacy. What criteria can we use as a basis for selecting some tasks and not others from the TLU domain? How do we know that the tasks we select are equally relevant and representative of the TLU domain? Consider, for example, some of the tasks implemented by researchers at the University of Hawaii (Norris et al., 1998; Brown et al., in press), who identify a set of real-world tasks and then modify these in some way to meet the exigencies of assessment. These are hypothetical tasks used for research purposes in order to investigate what one form of TBLPA might look like and how it might occur in order to inform inferences about the accomplishment of target tasks taken from some set of educational objectives, learner needs and so forth. (John Norris, personal communication). One of their modi ed ‘real-world’ tasks is to evaluate menus based on a written description of types of vegetarians and to select the most appropriate restaurant, based on the items listed in their menus (Brown et al., in press: 210): Item C14 · Situation: You want to take two friends to dinner tonight. They are both vegetarians (they do not eat meat). One is a lacto-ovo vegetarian. The other is a vegan. You need to nd out what each of them can and cannot eat. · Task (part 1): Read the short article on types of vegetarians. Pay close attention to what foods each type will and will not eat. Remember, one friend is a lacto-ovo vegetarian and the other is a vegan. [A short reading passage is provided.] · Task (part 2): Now look at the menus from three restaurants. Identify only those items on each menu which could be ordered by either of your two friends. For those items that are okay for lacto-ovo, write LO beside them. For those items that are okay for the vegan, write V beside them. When you Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Lyle F. Bachman
461
have nished, decide which restaurant you will visit, based on which menu gives the most choices for your two friends. Place a check (3) mark at the top of the one menu you think is best. [Several menus are provided.]
According to Brown et al., this task is a ‘personal-response type’ (Brown et al., in press), is taken from the general content area ‘food and dining’, and the thematic subdivision ‘comparing food types on menus and making a recommendation’ (Norris et al., 1998: 112), which I assume to be their speci cation of the TLU domain and task type to which this assessment task corresponds. An example task from the same TLU domain, given in Norris et al. (1998: 112) involves a friend who has just had angioplasty and has some dietary restrictions. In this task, the test-taker reads a list of restrictions from the doctor, the menus from restaurants, and is then asked to leave a phone message for the friend, explaining which restaurant seems to be the best choice and describing the options available on the menu. While these two tasks may be taken from the same TLU domain, they are clearly quite different. The rst task involves the content of vegetarian dietary restrictions, while the second includes content about post-operation dietary restrictions. The rst task apparently involves the test-taker in simply identifying the appropriate items in the three menus and indicating a choice of menu, while the second task requires an oral production. Thus, these tasks differ in both their topical content and in the kind of response required, so that I would regard their comparability as questionable. If these tasks were to be used in an actual test, there would, in my view, be serious questions about both the generalizability of scores on these two tasks to the universe of assessment tasks these tasks are intended to represent and the extent to which these scores extrapolate to a real-life TLU domain. If a test-taker were to perform these tasks successfully, could we predict that he or she would also be able to select restaurants for a group of international visitors, one of whom is Hindu, one Muslim and a third who is allergic to shell sh: a ‘real-life’ task with which I myself was once faced? In summary, I would argue that there are serious problems with the claim that TBLPA’s distinctive characteristic is that it enables us to make predictions about future performance. Brown et al. explicitly claim that performance on the assessment task itself is the construct of interest (Brown et al., in press: 15), so that this approach to assessment essentially precludes the possibility of making inferences about test-takers’ ability. I would argue that this problem makes the approach inappropriate for many educational purposes, such as diagnosis and assessing the achievement of learning objectives, which are typically stated in terms of areas of language ability to be learned. Furthermore, the evidence of content relevance and representativeness Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
462
Task-based language performance assessment
that is required to support even the limited use of test scores for prediction is extremely dif cult to provide, given the complexity and diversity of most ‘real-life’ TLU domains. Without such evidence, we cannot demonstrate that performance on one assessment task generalizes to other assessment tasks, or that it extrapolates to performance on tasks in the TLU domain. This problem is implicit in Brown et al.’s very limited claim about the use of ratings based on their test tasks: ‘These ndings provide initial evidence in support of the use of examinee’s overall task-dependent ratings on a given ALP test form for the purposes of distinguishing among examinees according to their apparent overall abilities to perform the collection of tasks found on these ALP tests’ (Brown et al., in press: 129). In other words, scores from these tasks tell us about test-takers’ ability to perform these test tasks. What this implies, of course, is that we really cannot extrapolate beyond the test, so that even the limited purpose of prediction in unsupported by Brown et al.’s. study. IV ‘How hard is it?’: The dif culty with dif culty The notion that test items or tasks themselves differ in dif culty is deeply ingrained in both the way we conceptualize the dif culty of assessment tasks and in how we operationalize dif culty in most current measurement models. A common analogy that is used to explain dif culty is to sporting events, such as the high jump, in which different heights of the bar are asserted to correspond to differing degrees of dif culty. While this may be the case on average, it is clearly not the case across individuals. When I was in high school, world-class athletes were regularly beginning to jump higher than six feet. As a high school athlete at the time, my goal was to be able to clear a height of ve and a half feet. Bar heights of ve feet eight inches and ve feet ten inches would have been too dif cult for me, and the two inch difference between them inconsequential. Similarly, these heights would have been far too easy for the world-class high jumpers of the day, and the two inch difference inconsequential for them. The point of this example is that dif culty does not reside in the task alone, but is relative to any given test-taker. The conceptualization of dif culty as a characteristic that resides in test tasks has led researchers to try to understand, explain or predict how dif cult a given task will be. Two general approaches have been taken in recent years by researchers in language testing. One approach has been to identify a number of task characteristics that are considered to be essentially independent of ability and then investigate the relationships between these characteristics and empirical indicators of dif culty (e.g., Anderson et al., 1991; Freedle and Kostin, Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Lyle F. Bachman
463
1993; 1999; Perkins and Brutten, 1993; Bachman et al., 1996). The other approach has been to explicitly identify ‘dif culty features’, which are essentially combinations of ability requirements and task characteristics, that are hypothesized to affect the dif culty of a given task. Studies in this approach have investigated how various individuals – typically ‘experts’ – rate different tasks that include different combinations of these factors, either in terms of dif culty features or their expected dif culty, or how different groups of individuals perform on such tasks (e.g., Skehan, 1996; 1998; Norris et al., 1998; Brown et al., in press; Brindley and Slatyer, this issue; Elder et al., this issue). All of these studies have sought to clarify the question of task dif culty, many in the hope that this will enable us to design and develop language assessments that are more appropriate for their intended uses and test-takers. Nevertheless, as a whole, these studies provide widely varying and inconsistent results. Studies of the rst type, which have used a variety of statistical analyses to investigate the relationships among task characteristics and empirical indicators of task dif culty, have yielded some task characteristics that appear to predict empirical task dif culty for some sets of tasks. Nevertheless, because they have employed widely differing sets of task characteristics and have operationalized these quantitatively in very different ways, it is virtually impossible to generalize across studies. Studies of the second approach, which have investigated differences among different ratings of task dif culty factors, or relationships among these ratings and the performance of different groups of individuals on different tasks, have yielded consistent results: that there is virtually no systematic relationship among a priori estimates of dif culty based on dif culty factors and empirical indicators of dif culty. Thus, despite the fact that language testers have been researching the relationships among task characteristics and task dif culty for over a decade, the results of this research seem to have brought us no closer to an understanding of this relationship. All of the researchers in these studies have offered possible explanations for these outcomes, most of which are related to methodological limitations in the studies: the participants, or the types of tasks and kinds of analyses used. One possible explanation that I feel warrants serious consideration is that offered by Elder et al. (this issue: 362): The fact that our results differ so consistently and markedly from those of previous SLA research . . . may have to do with differences between testing and pedagogic contexts, with the former producing a cognitive focus on display rather than on task ful lment or getting the message across. Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
464
Task-based language performance assessment
The suggestion that assessment tasks may be fundamentally different from pedagogic or, by extension, ‘real-life’ tasks, not only raises questions about the validity of assessing certain aspects of language ability with certain types of tasks, which these researchers point out, but also raises a much more general and perplexing question about the generalizability of research with SLA and pedagogic tasks to assessment tasks. While this is a question that language testers must surely consider at some point, it is beyond the scope of this article. What I focus on in the next section is what I consider to be the major reason for our lack of progress in understanding the relationships among the characteristics of language assessment tasks and empirical indicators of task dif culty. In the section that follows, I argue that the problem with current approaches to investigating task dif culty is two-fold: conceptual and methodological. I argue that most of the ‘dif culty features’ that have been proposed are not inherent in tasks themselves, but are functions of the interactions between a given test-taker and a given test task. Next, I argue that empirical estimates of task dif culty are not estimates of a separate entity, ‘dif culty’, but are themselves artifacts of the interaction between the test-taker’s ability and the characteristics of the task. Finally, I argue that approaches to investigating the effects of task characteristics on test performance that use empirical estimates of dif culty based on current measurement models are unlikely to advance our understanding of how task characteristics affect performance. V Problems with ‘dif culty factors’ Over the years, measurement theorists (e.g., Cronbach, 1947; Lorge, 1951; Thorndike, 1951; Stanley, 1971; Cronbach et al., 1972) and language testers (e.g., Carroll, 1968; Clark, 1972; Weir, 1983; Bachman, 1990; Bachman and Palmer, 1996) have discussed a large number of sources of variation, or factors that may affect test performance. One example of such a formulation is that provided by Bachman (1990), who identi es several distinct sets of factors (language ability of the test-taker, test-task characteristics, personal characteristics of the test-taker and random/ unpredictable factors) which are hypothesized to affect test performance. This formulation recognizes that these factors may well be correlated with each other to some degree (except for the random factors, which are by de nition uncorrelated with anything else). In this formulation there is no factor identi ed as ‘dif culty’. A different conceptualization has been proposed by Skehan (1996; 1998) and the Hawaii group (Norris et al., 1998; Brown et al., in Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Lyle F. Bachman
465
press), in which task dif culty is conceptualized as ‘the classi cation of tasks according to an integration of ability requirements and task characteristics’ (Norris et al., 1998: 47). Skehan (1996; 1998) proposes three sets of features that he hypothesizes affect performance on tasks: · · ·
code complexity: the language required to accomplish the task; cognitive complexity: the thinking required to accomplish the task; and communicative stress: the performance conditions for accomplishing a task (Skehan, 1998, p. 88).
Norris et al. (1998) hypothesize that these task dif culty features ‘can affect the dif culty of a given task (ostensibly for all learners, regardless of individual differences ), and . . . can be manipulated to increase or decrease task dif culty’ (Norris et al., 1998: 50). This formulation, which hypothesizes that task dif culty is a major determinant of test performance, can be represented as in Figure 2. The following quote from Norris et al. expresses this formulation: These task dif culty features comprise the ability requirements and task characteristics inherent in a given second language task. As such, they offer a convenient and well-motivated starting point for gradation of task dif culty in order to predict performance outcomes. (Norris et al., 1998: 50; italics in the original)
Since these task dif culty features are explicitly described as a combination of ability requirements and task characteristics, they would appear to be the primary determinants of task dif culty and, hence, of test performance. This view is consistent with these researchers’
Figure 2
Dif culty features Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
466
Task-based language performance assessment
view, discussed above, that the tasks themselves de ne the construct to be measured. There are two problems with this formulation, in my view. First, the dif culty features confound the effects of the test-taker’s ability with the effects of the test tasks. Secondly, this approach introduces a hypothetical entity, ‘task dif culty’, as a primary determinant of test performance. Looking at the rst problem, it seems to me that only one of the three features posited – code complexity – is uniquely a characteristic of test tasks themselves. The other two – cognitive complexity and communicative stress – entail assumptions about the kinds and amounts of processing required of the test-taker in order to accomplish a given task, and are thus, in my view, functions of the interaction between the test-taker and the task. Cognitive complexity is seen to be a function of two components: cognitive processing and cognitive familiarity, both of which are likely to vary from one test-taker to another. Thus, this feature clearly involves the interaction of the test-taker’s attributes with task characteristics. Similarly, one of the components of communicative stress is the extent to which individuals can control or in uence the task (Norris et al., 1998: 50). As the degree of control is likely to depend on a number of test-taker characteristics, such as level of language ability, risktaking, cognitive style and affect, communicative stress is also likely to vary from one test-taker to another, so that this feature also involves the interaction of test-taker attributes with task characteristics. The second problem – introducing dif culty as a separate factor – becomes apparent, I believe, when one considers the complexities of a typical performance task, an oral interview, as illustrated in Figure 3. In looking at the various components in this task, and the interactions among them, we need to ask, ‘Wherein lies dif culty?’ Candidates, who will differ in their underlying competencies and ability for use, may nd tasks with different qualities and conditions differentially dif cult to perform. Different candidates will nd different examiners and other interactants differentially easy or dif cult to interact with. Different raters may apply the scale criteria differently to different performances, so that they may be differentially lenient or severe. So where is ‘dif culty’? I would argue that dif culty is not a separate factor at all, but ‘resides’ in the interactions among all of these components involved in the assessment. 1 Problems with the way dif culty is operationalized in current measurement models Although dif culty is operationalized in different ways in different measurement models, all of these are, in my view, problematic. First, Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Lyle F. Bachman
467
Figure 3 An expanded model of oral test performance Source: adapted from Skehan, 1998: 172
some indicators of dif culty are averages of performance across facets of measurement, and do not consider differential performance of different individuals. The proportion correct (‘p-value’) of classical test theory, for example, is the average performance across test-takers on a given item, while the mean for a given facet of measurement in generalizability theory is an average across test-takers, not an individual estimate.4 The IRT b parameter is simply the intercept on the latent ability scale that corresponds to an arbitrarily speci ed probability of getting the item correct. As with the classical ‘p-value’, this ‘dif culty’ estimate is de ned with reference to the probability of getting the item correct. Nevertheless, the item characteristic function clearly illustrates that this probability varies across ability levels. That is, the item characteristic function operationalizes the interaction between the latent ability and performance on a given item, so that what we call ‘dif culty’ is really nothing more than an artifact of
4 Recent developments in multivariate g-theory have made it possible to estimate means for individual conditions in a given facet of measurement (see, for example Marcoulides and Drezner, 1993; Marcoulides and Drezner, 1997; Marcoulides, 1999).
Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
468
Task-based language performance assessment
this interaction. The multifaceted Rasch b parameter and logit capture essentially the same types of interactions as the IRT b parameter. In summary, in all of these measurement models, ‘dif culty’ is operationalized as either an average of scores on a given task or facet of measurement across a group of test-takers, or as an interaction between the latent trait and performance on a given task. Thus, in current measurement models, ‘dif culty’ is essentially an artifact of test performance, and not a characteristic of assessment tasks themselves. Such unidimensional, empirical de nitions of dif culty are thus a serious bottleneck in modelling performance with complex tasks (Robert Mislevy, personal communication) 5. 2 Problem with trying to predict empirical dif culty from task characteristics Because of the problems discussed above, current approaches to using task characteristics to predict task dif culty are, in my view, unlikely to yield consistent or meaningful results. The approach of using task characteristics to predict empirical estimates of item dif culty, referred to above, is problematic because these item statistics are themselves a function of interactions between test-takers and test tasks. Thus, any correlations between the predictors (task characteristics) and the criterion (empirical estimate of dif culty) will be partly due to autocorrelation. Thus, the predictive power of a given task characteristic, or of a set of task characteristics, cannot be interpreted unambiguously: it can be due to either the effectiveness of the task characteristics as predictors, or to the fact that the predictors are part of the interaction that is in the dependent variable. VI A way forward? Although there are clearly differences in the eld, in terms of assumptions, hypotheses and research approache s, I believe that there is considerable consensus among language testing researchers and practitioners that the issue of how assessment tasks affect performance is critical to language performance assessment. This issue is central because it has clear implications for the way in which we interpret and use test results, and for the validity of these interpretations and uses. In addition, a better understanding of the effects of assessment tasks is crucial for the way we design and develop performance assessment tasks. I do not believe that language 5
For a recent approach to overcoming this problem, see Mislevy et al., 2001. Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Lyle F. Bachman
469
testers are all likely to follow the same ‘paradigm’, either in terms of the task characteristics they choose to investigate or the analytical methods they use. Furthermore, given the value of diverse approaches to a common research issue, I do not believe that it would be desirable for everyone to approach this issue in the same way. What I would like to propose are some suggestions for conceptualizing and implementing research into task effects that I believe will lead us forward. 1. General suggestions 1) 2)
3)
Conceptualize tasks as sets of characteristics, rather than as holistic entities. Clearly distinguish among three sets of factors that can affect test performance: · characteristics inherent in the task itself: I would propose that task characteristics are those that can be determined simply by examining the task, and that require no assumptions regarding test-takers or how they may interact with the task (e.g., how they might process the task, how familiar they might be with the task content or conditions or how they might appraise the task affectively ). · attributes of test-takers; and · interactions between test-takers and task characteristics. Conceptualize interactions as interactions.
Cognitive complexity and communicative stress, for example, could be seen not as different factors that affect performance directly, but as interactions among the other determinants of test performance (i.e., language ability, personal attributes and task characteristics ). I would argue that there are two advantages to this formulation. First, it recognizes the direct effects of test-taker attributes and task characteristics on test performance, and clearly distinguishes between these. Second, it explicitly recognizes that test-takers’ interactions with task characteristics may also affect performance. 2 Research 1) 2) 3)
State research hypotheses in a way that explicitly speci es the effects of the various factors that can affect test performance. Conduct research that systematically varies the facets of the assessment procedure so as to examine the interactions among these and between these and test-takers. Collect data on test-takers’ characteristics (e.g., background, nature of prior exposure to language ). Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
470 4) 5)
6)
Task-based language performance assessment Collect data on test-takers’ responses to individual assessment tasks (e.g., scores, ratings, oral discourse, written texts). Collect data on the processes or strategies that test-takers use in responding to assessment tasks (e.g., verbal protocols, observations, questionnaires, interviews, discourse that is created in the assessment process, such as speech or writing samples, physiological and neurophysiological responses) and utilize appropriate qualitative analyses to investigate the ways in which test-takers process language assessment tasks. Utilize alternatives to current measurement models for modelling quantitative data. Two promising methodologies, in my view, are structural equation modelling, particularly a latent variable approach to multivariate g-theory (e.g., Marcoulides, 1996; Marcoulides and Drezner, 1997) and evidence-centred design (e.g., Almond et al., 2001; Mislevy et al., in press; this issue).
3 Test design While it is important that research continues, practical test development cannot wait for de nitive answers. Furthermore, even if we had a full understanding of the effects of various task characteristics and how test-takers interact with these, it is not clear to what extent it is feasible to incorporate this knowledge into the design and development of a particular assessment. 1) Ensure that test design is both construct-based and task-based: Although I have argued in this article that an approach to test design that is solely based on the speci cation of tasks is at best extremely limited, and at the worst fraught with what I see as intractable problems of generalizability and extrapolation, I would also point out that an approach that is solely based on the de nition of abilities is also highly problematic. In my view, any test design that ignores either task speci cation or construct de nition is likely to lead to results that are not useful for their intended purpose. 2) Develop and re ne a set of task characteristics that are uniquely attributable to facets of the particular assessment being designed, and use these as a basis for test design: Such task characteristics can be based on some existing framework, or can be derived from an analysis of the TLU domain, using techniques of needs analysis, including task analysis. Although a number of frameworks have been proposed in the literature, test-developers need to select those characteristics that are most relevant to their own particular testing situation. 3) Construct a de nition of the speci c areas of language ability to Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Lyle F. Bachman
4)
471
be assessed: This de nition may involve a number of distinct components, or it may entail a global de nition. The de nition may be based on some theoretical model of language ability, or on the content of a course syllabus, or it may be derived from a needs analysis of the TLU domain. It may attempt to measure all or parts of either the theoretical model, the syllabus content or the abilities required in the TLU domain. Use a variety of procedures to collect information about test performance, along with a variety of analytic approaches, both quantitative and qualitative, to tease out and describe in rich detail the interactions among speci c tasks and speci c test-takers.
VII Conclusions Understanding the effects of assessment tasks on test performance and how test-takers interact with these tasks is, in my view, the most pressing issue facing language performance assessment. A fundamental aim of most language performance assessments is to present testtakers with tasks that correspond to tasks in ‘real-world’ settings, and that will engage test-takers in language use or the creation of discourse. Performance assessments are typically designed to assess complex abilities that cannot easily be de ned in terms of a single trait, and typically present test-takers with tasks that are much more complex than traditional constructed-response items. Drawing on the conceptualization of tasks in the research on SLA and pedagogy, some advocates of ‘task-based language performance assessment’ argue for a reorientation in both the way we interpret test scores and how we design language tests. I have pointed out what I believe to be intractable problems with a solely task-based approach, and have argued that the most useful assessments in all situations will be those that are based on the planned integration of both tasks and constructs in the way they are designed, developed and used. Clearly, task speci cation will also present challenges to such an integrated approach. Nevertheless, an integrated approach, in my view, makes it possible for test-users to make a variety inferences about the capacity for language use that test-takers have, or about what they can or cannot do. Furthermore, it makes available to test-developers the full range of validation arguments that can be developed in support of a given inference or use. Acknowledgements This is a revised version of a response paper given at the symposium ‘Putting tasks to the test’ at the 22nd Annual Language Testing Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
472
Task-based language performance assessment
Research Colloquium, Vancouver, 2000. I am grateful to Bob Mislevy, George Marcoulides and John Norris for their comments on an earlier version of this article. I am also grateful to J.D. Brown, Thom Hudson, John Norris and William Bonk for making the manuscript of their forthcoming book available to me and for permission to use one of their tasks as an example in my article. VIII References Alderson, J.C. 1983: Response to Harrison: who needs jam? In Hughes, A. and Porter, D., editors, Current developments in language testing. London: Academic Press, 87–92. —— 2000: Assessing reading ability. Cambridge: Cambridge University Press. Alderson, J.C., Clapham, C. and Wall, D. 1995: Language test construction and evaluation. Cambridge: Cambridge University Press. Almond, R.G., Steinberg, L.S. and Mislevy, R.J. 2001: A sample assessment using the four process framework (CSE Technical Report 543). Los Angeles, CA: University of Califormia, The National Center for Research on Evaluation, Standards, and Student Testing (CRESST), Center for Studies in Education Anderson, N., Bachman, L.F., Cohen, A.D. and Perkins, K. 1991: An exploratory study into the construct validity of a reading comprehension test: triangulation of data sources. Language Testing 8, 41–66. Bachman, L.F. 1990: Fundamental considerations in language testing. Oxford: Oxford University Press. —— 2000: Some construct validity issues in interpreting scores from performance assessments of language ability. In Cooper, R.L., Shohamy, E. and Walters, J., editors, New perspectives and issues in educational language policy: a festschrift for Bernard Dov Spolsky. Amsterdam: John Benjamins. Bachman, L.F., Davidson, F. and Milanovic, M. 1996: The use of test methods in the content analysis and design of EFL pro ciency tests. Language Testing 13, 125–50. Bachman, L.F. and Palmer, A.S. 1996: Language testing in practice: designing and developing useful language tests. Oxford: Oxford University Press. Brindley, G. 1988: Factors affecting task dif culty. In Nunan, D., editor, The learner-centred curriculum. Cambridge: Cambridge University Press. —— 1989: Assessing achievement in the learner-centred curriculum. Sydney: National Centre for English Language Teaching and Research, Macquarie University. —— 1994: Task-centred assessment in language learning: the promise and the challenge. In Bird, N., Falvey, P., Tsui, A., Allison, D. and McNeill, A., editors, Language and learning: papers presented at the Annual International Language in Education Conference (Hong Kong, 1993). Hong Kong: Hong Kong Education Department, 73–94. Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Lyle F. Bachman
473
Brown, J.D. 1996: Testing in language programs. Upper Saddle River, NJ: Prentice Hall Regents. Brown, J.D., Hudson, T.D., Norris, J.M. and Bonk, W. in press: An investigation of second language task-based performance assessments. Honolulu, HI: University of Hawaii Press. Buck, G. 2001: Assessing listening. Cambridge: Cambridge University Press. Candlin, C. 1987: Towards task-based learning. In Candlin, C. and Murphy, D.F., editors, Language learning tasks. Englewood Cliffs, NJ: Prentice-Hall. Carroll, J.B. 1968: The psychology of language testing. In Davies, A., editor, Language testing symposium: a psycholinguistic approach. London: Oxford University Press, 46– 69. Chalhoub-Deville, M. 2001: Task-based assessments: characteristics and validity evidence. In Bygate, M., Skehan, P. and Swain, M., editors, Researching pedagogic tasks: second language learning, teaching and testing. Harlow, England: Longman, 210–28. Chapelle, C. 1998: Construct de nition and validity inquiry in SLA research. In Bachman, L.F. and Cohen, A.D., editors, Interfaces between second language acquisition and language testing research. New York: Cambridge University Press, 32–70. Clark, J.L.D. 1972: Foreign language testing: theory and practice. Philadelphia, PA: Center for Curriculum Development. —— 1975: Theoretical and technical considerations in oral pro ciency testing. In Jones, R.L. and Spolsky, B., editors, Testing language pro ciency. Arlington, VA: Center for Applied Linguistics, 10–24. Cronbach, L.J. 1947: Test reliability: its meaning and determination. Psychometrika 12, 1–16. Cronbach, L.J., Gleser, G.C., Nanda, H. and Rajaratnam, N. 1972: The dependability of behavioral measurements: theory of generalizability for scores and pro les. New York: John Wiley. Crookes, G.V. and Gass, S. 1993a: Tasks and language learning: integrating theory and practice. Clevedon: Multilingual Matters. —— 1993b: Tasks in a pedagogical context: integrating theory and practice. Clevedon: Multilingual Matters. Davidson, F. and Lynch, B.K. 2002: Testcraft: a teacher’s guide to writing and using language test speci cations. Newhaven, CT: Yale University Press. Douglas, D. 2000: Assessing language for speci c purposes: theory and practice. Cambridge: Cambridge University Press. Freedle, R. and Kostin, I. 1993: The prediction of TOEFL reading item dif culty: implications for construct validity. Language Testing 10, 133– 70. —— 1999: Does the text matter in a multiple-choice test of comprehension? The case for the construct validity of TOEFL’s minitalks. Language Testing 16, 2–32. Kane, M., Crooks, T. and Cohen, A. 1999: Validating measures of performance. Educational Measurement: Issues and Practice 18, 5–17. Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
474
Task-based language performance assessment
Kramsch, C. 1986: From language pro ciency to interactional competence. The Modern Language Journal 70, 366–72. Lantolf, J.P. and Frawley, W. 1985: Oral pro ciency testing: a critical analysis. Modern Language Journal 69, 337– 45. —— 1988: Pro ciency: understanding the construct. Studies in Second Language Acquisition 10, 181–95. Linn, R.L., Baker, E.L. and Dunbar, S.B. 1991: Complex, performancebased assessment: expectations and validation criteria. Educational Researcher 20, 15– 21. Long, M.H. and Norris, J.M. 2000: Task-based language teaching and assessment. In Byram, M., editor, Encyclopedia of language teaching. London: Routledge, 597–603. Long, M.H. 1985: A role for instruction in second language acquisition: task-based language teaching. In Hyltenstam, K. and Pienemann, M., editors, Modelling and assessing second language acquisition. Clevedon: Multilingual Matters, 77–99. Long, M.H. and Crookes, G.V. 1992: Three approaches to task-based syllabus design. TESOL Quarterly 26, 27–56. Lorge, I. 1951: The fundamental nature of measurement. In Lindquist, E.F., editor, Educational Measurement ). Washington, DC: American Council on Education, 533–59. Marcoulides, G.A. 1996: Estimating variance components in generalizability theory: the covariance structure analysis approach. Structural Equation Modeling 3, 290– 99. —— 1999: Generalizability theory: picking where the Rasch IRT model leaves off? In Embretson, S.E. and Hershberger, S.L., editors, The new rules of measurement: what every psychologist and educator should know. Mahwah, NJ: Lawrence Erlbaum Associates, 129– 52. Marcoulides, G.A. and Drezner, Z. 1993: A procedure for transforming points in multi-dimensional space to a two-dimensional representation. Educational and Psychological Measurement 53, 933–40. —— 1997: A method for analyzing performance assessments. In Wilson, M., Draney, K. and Engelhard, G., editors, Objective measurement: theory into practice. Norwood, NJ: Ablex, 261–77. McNamara, T. 1997: Performance testing. In Clapham, C. and Corson, D., editors, Encyclopedia of language and education, Volume 7: Language testing and assessment. Dordrecht: Kluwer Academic Publishers, 131–39. McNamara, T.F. 1996: Measuring second language performance. London: Longman. Messick, S. 1980: Test validity and the ethics of assessment. American Psychologist 35, 1012– 27. —— 1989: Validity. In Linn, R.L., editor, Educational Measurement. 3rd edition. New York: American Council on Education and Macmillan, 13–103. —— 1994a: October 25–26. Alternative modes of assessment, uniform standards of validity. Paper presented at the Conference on Evaluating Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
Lyle F. Bachman
475
Alternatives to Traditional Testing for Selection, Bowling Green State University, October 25–26. —— 1994b: The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher 23, 13–23. —— 1995: Standards of validity and the validity of standards in performance assessment. Educational Measurement: Issues and Practice 14, 5–8. Mislevy, R.J., Steinberg, L.S. and Almond, R.G. in press: On the structure of assessment arguments. Measurement: Interdisciplinary Research and Perspectives 1. Nichols, P. and Sugrue, B. 1999: The lack of delity between cognitively complex constructs and conventional test development practice. Educational Measurement: Issues and Practice 18, 18–29. Norris, J.M., Brown, J.D., Hudson, T. and Yoshioka, J. 1998: Designing second language performance assessments. (Vol. SLTCC Technical Report #18). Honolulu: Second Language Teaching and Curriculum Center, University of Hawaii at Manoa. Nunan, D. 1989: Designing tasks for the communicative classroom. Cambridge: Cambridge University Press. Perkins, K. and Brutten, S.R. 1993: A model of ESL reading comprehension dif culty. In Huhta, A., Sajavaara, K. and Takala, S., editors, Language testing: new openings. Jyva¨skyla¨: University of Jyva¨skyla¨, 205– 18. Pica, T., Kanagy, R. and Falodun, J. 1993: Choosing and using communication tasks for second language instruction. In Crookes, G. and Gass, S.M., editors, Tasks and language learning: integrating theory and practice. Clevedon: Multilingual Matters, 9–34. Read, J. 2000: Assessing vocabulary. Cambridge: Cambridge University Press. Shohamy, E. and Reves, T. 1985: Authentic language tests: where from and where to? Language Testing 2, 48– 59. Skehan, P. 1996: A framework for the implementation of task-based instruction. Applied Linguistics 17, 38–62. —— 1998: A cognitive approach to language learning. Oxford: Oxford University Press. Spolsky, B. 1985: The limits of authenticity in language testing. Language Testing 2, 31–40. Stanley, J. 1971: Reliability. In Thorndike, R.L., editor, Educational Measurement. 2nd edition. Washington, DC: American Council on Education, 356–442. Tarone, E. 1998: Research on interlanguage variation: implications for language testing. In Bachman, L.F. and Cohen, A.D., editors, Interfaces between second language acquisition and language testing research. Cambridge: Cambridge University Press. Thorndike, R.L. 1951: Reliability. In Lindquist, E.F., editor, Educational Measurement. Washington, DC: American Council on Education, 569– 620. Upshur, J.A. 1979: Functional pro ciency theory and a research role for Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010
476
Task-based language performance assessment
language tests. In Brie`re, E. and Hinofotis, F.B., editors, Concepts in language testing: some recent studies. Washington, DC: TESOL, 75–100. van Lier, L. 1989: Reeling, writhing, drawling, stretching, and fainting in coils: oral pro ciency interviews as conversation. TESOL Quarterly 23, 489–508. Weir, C.J. 1983: The Associated Examining Board’s Test in English for Academic Purposes: an exercise in content validation. In Hughes, A. and Porter, D., editors, Current developments in language testing. London: Academic Press, 147–53. Young, R. 1995: Conversational styles in language pro ciency interviews. Language Learning 45, 3–42. Young, R. and He, A., editors. 1998: Talking and testing. Amsterdam: John Benjamins Publishing Company. Young, R. and Milanovic, M. 1992: Discourse validation in oral pro ciency interviews. Studies in Second Language Acquisition 14, 403– 24.
Downloaded from http://ltj.sagepub.com at UCLA COLLEGE SERIALS/YRL on January 25, 2010